High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

Hardware accelerators based on field programmable gate array (FPGA) and system on chip (SoC) devices have gained attention in recent years. One of the main reasons is that these devices contain reconfigurable logic, which makes them feasible for boosting the performance of applications. High-level synthesis (HLS) tools facilitate the creation of FPGA code from a high level of abstraction using different directives to obtain an optimized hardware design based on performance metrics. However, the complexity of the design space depends on different factors such as the number of directives used in the source code, the available resources in the device, and the clock frequency. Design space exploration (DSE) techniques comprise the evaluation of multiple implementations with different combinations of directives to obtain a design with a good compromise between different metrics. This paper presents a survey of models, methodologies, and frameworks proposed for metric estimation, FPGA-based DSE, and power consumption estimation on FPGA/SoC. The main features, limitations, and trade-offs of these approaches are described. We also present the integration of existing models and frameworks in diverse research areas and identify the different challenges to be addressed.


I. INTRODUCTION
Nowadays the development of algorithms focuses on performance-efficient and energy-efficient computations. Technologies such as field programmable gate array (FPGA) and system on chip (SoC) based on FPGA (FPGA/SoC) [1], [2], [3], [4] have shown their ability to accelerate intensive computing applications while saving power consumption, owing to their capability of high parallelism and reconfiguration of the architecture.
The associate editor coordinating the review of this manuscript and approving it for publication was Vincenzo Conti .
Several high-level synthesis (HLS) tools [5] have been proposed by vendors and academics such as Vivado HLS [6], formerly AutoPilot [7], Intel HLS [8], LegUp [9], Bambu [10], and others [5]. These tools facilitate the adoption of FPGAs in different fields, as they allow the creation of a register transfer level (RTL) code from a high level of abstraction. Nevertheless, the efficient use of these technologies usually requires the knowledge of the underlying hardware and the use of code restructuring techniques in the original algorithm [11]. This is a time-consuming task for algorithm designers, who want to take advantage of the inherent characteristics of these reconfigurable technologies.
HLS tools support C/C++, SystemC, and OpenCL [12] codes to generate the final RTL code. These tools provide the designer with a detailed report for each algorithmic solution, including information about the estimation of latency, resource utilization (also known as area occupied), and throughput. The use of directives allows code optimization through parallel techniques, such as loop pipelining, loop unrolling, array partitioning, and array reshaping. For each solution, the designer can specify different combinations of directives; comparing the reports provided by these tools, the best option can be determined according to different performance metrics.
Furthermore, these tools allow a design space exploration (DSE), which involves the evaluation of multiple implementations with different combinations of user design constraints, FPGA features, and directives (also known as knobs or optimizations). Setting these optimizations to obtain a hardware design with the desired characteristics is a problem that increases exponentially as the designer applies more directives, and the program has more complex code structures. The generated hardware is directly associated with the applied directives, but sometimes applying and tuning directives requires a considerable endeavour to obtain a proper hardware implementation. An optimal DSE process grants a hardware design with a good compromise between metrics such as latency, area, throughput, and power consumption.
Over the years, parallel computing models have proven their benefits across different architectures, such as clusters of distributed processors with single cores and multicores, GPU, and cloud. These models act as a bridge between the architecture and software developer. The actual trend in parallel computer architectures demonstrates progress toward hybrid architectures combining namely many cores, superscalars, single instruction/multiple data (SIMD), hardware accelerators, and on-chip communication systems, among others, which require handling computations and data locality at several levels to achieve suitable performance [13].
Using computing models, and also methodologies, and frameworks to predict the performance of FPGA/SoC architectures may reduce design times and improve productivity, which are critical issues when choosing these architectures. In this survey, a model is an abstraction that represents a simplified system. A methodology describes the steps involved in the process for systematically solving a problem. A framework provides the structure needed in the form of a template or conceptual scheme to simplify the elaboration of a task.

A. CONTRIBUTION
In this paper, we present a thorough analysis of the computing models, methodologies, and frameworks proposed for reconfigurable hardware accelerators based on FPGA. We compare their main features, including the inputs, outputs, and techniques employed for their development. Then, we show how these approaches for FPGA/SoC can be applied in different research fields, exposing their benefits in improving the design process and productivity.
Consequently, the reader will become more confident about the fundamental and technical aspects of the computing models, methodologies, and frameworks designed for FPGA/SoC, acquiring a clear idea of the main parameters required by each one. We highlight the importance of having simple approaches with few parameters, such as those proposed for other parallel architectures, so that they have a greater scope and can be widely used. Based on this literature review, the FPGA developer can select the approach that best suits the application, hardware architecture, and programming skills.
Some survey articles are available in the literature for FPGA-based reconfigurable hardware. Schafer and Wang [14] divide HLS DSE techniques into two main groups: synthesis-based and model-based. In addition to this classification, a third group appears including DSE synthesis-based and supervised learning. According to [15], HLS DSE can be developed using model-based and model-free techniques. Model-based techniques are composed of tools and methodologies that use analytical models, whereas model-free techniques include approaches where the HLS tool is treated as a black box. A survey of automatic high-level code deployment for HLS tools and toolchains is presented in [16]. The authors analyze commercial HLS tools, academic HLS tools, HLS code generation tools, domain-specific language tools for HLS, dataflow HLS tools, and automatic code deployment tools (including automated DSE). Yehya et al. [17] focus on power consumption. They classify different estimation techniques as analytical, table-based, polynomial-based, and neural networks. The work in [18] analyzes different performance and power estimation models for CPU, GPU, and FPGA. Moreover, reconfigurable architectures can be categorized as coarse-grained and fine-grained according to [19], [20]. In this work, we focus on FPGA and FPGA/SoC architectures included in the last category.
To the best of our knowledge, there is no previous work that jointly: years (2016)(2017)(2018)(2019)(2020)(2021)(2022) and have been selected based on the topics addressed in this survey. Several papers published before 2016 have been considered because of their contributions to the current literature.

C. OUTLINE
The remainder of this paper is organized as follows. Section II briefly presents the most widely used parallel computing models for CPU, GPU, and multicore processors. Section III introduces the FPGA-based reconfigurable hardware accelerator architectures, hardware/software co-design, DSE and metrics, and the techniques to improve latency, area, and power for this technology. In Section IV, we describe previous works on models, methodologies, and frameworks proposed for FPGA/SoC according to their main features: metrics estimation (IV-A), FPGA-based DSE (IV-B), and power consumption estimation (IV-C); and in Section IV-D, we present a summary and discussion. The integration of models and frameworks for FPGA-based reconfigurable hardware accelerators in different research fields is exposed in Section V. Challenges are analyzed in Section VI. Finally, conclusions are presented in Section VII.

II. PARALLEL COMPUTING MODELS FOR PERFORMANCE ESTIMATION
Computing models allow to easily analyzing algorithms by simplifying the computational world to a reduced set of parameters that define the cost of arithmetic and memory access operations and communication. These models contribute to the search for efficient algorithms for a given architecture, improving the productivity of designers, programmers, and engineers. A small amount of communication, a small number of operations, and a high degree of parallelism are key points that directly contribute to the efficiency of a parallel algorithm.
This section summarizes the characteristics of the most widely used parallel computing models for performance estimation. It is not aimed at providing a comprehensive presentation or a thorough classification of parallel models, languages, and architectures. In addition, we present some examples of their application in different architectures.

A. RANDOM ACCESS MACHINE AND PARALLEL RANDOM ACCESS MACHINE
The random access machine (RAM) model is proposed in [21] for sequential algorithms. It is composed of a memory, control unit, processor, and program. In 1978, Fortune and Wyllie proposed the parallel random access machine (PRAM) model [22] based on the RAM model. The main idea behind PRAM is that there is a shared memory m connected to several processing units with a global clock, as shown in Fig. 1. In this scenario, one processor P can execute one operation (arithmetic, memory access, or logic) within one single clock cycle. However, this model does not consider the communication or synchronization overheads.
PRAM sub-models like the exclusive read exclusive write (EREW), exclusive read concurrent write (ERCW), concurrent read exclusive write (CREW), and concurrent read concurrent write (CRCW) are introduced to handle read/write operations in a shared memory model [23].

B. BULK SYNCHRONOUS PARALLEL MODEL
The bulk synchronous parallel model (BSP) [24] proposed for distributing computing is a bridging model between hardware and algorithms that offers a high degree of abstraction. The BSP program is divided into supersteps separated by a barrier synchronization. Each superstep comprises several blocks of computation and communication. Fig. 2 shows the workflow of the BSP model.
A BSP computer is represented by parameters P, s, L, and G, where: • P: number of processors of the BSP computer. • s: processor speed. • L: cost, in step, to complete a barrier synchronization. • G: cost, in words, of delivering a message. The normalized cost G is defined by Eq.1 where Op local is the number of local operations executed in a processor and W sec is the number of words communicated by the network per second. L represents the barrier synchronization cost at the end of each superstep. The sum of G and L is the superstep cost. The former represents the number of maximum local computations executed on parallel processors. The latter represents a cost composed of the cost of the communications plus the synchronization at the end of the superstep.
The multi-BSP model [25] extends the BSP to multicore architectures by considering the architecture as a tree with d leaves. This is a multilevel model with explicit parameters for the number of processors, memory/cache sizes, communication, and synchronization costs. The multi-BSP allows: (i) modelling a multicore computer as a tree, (ii) designing a parallel algorithm as a single program multiple data (SPMD) program with strict separation between computation and communication, and (iii) computing the cost of an algorithm on a specific computer based on computation, data movement, and latency. For a tree with i levels, the main parameters related to this model are as follows: • P i : number of processors at i-th level. • g i : communication bandwidth. • L i : cost, in step, to complete a barrier synchronization at level i.
• m i : words of memory at i-th level. BSP and multi-BSP have been widely used in multiple contexts and applications because of their flexibility in allowing portable and efficient parallel programs for a wide range of computers [26], [27], [28], [29], [30], [31], [32]. The results presented in [33] demonstrate the feasibility of the BSP-based machine learning (ML) computing model in the field of intrusion detection. An elastic BSP for relaxing the synchronization stage in the context of distributed deep learning is presented in [34]. The authors focus on the data parallelism approach, in which weight synchronization during training is crucial. The BSP is adapted for CUDA applications in [35]. This BSP for the CUDA model allows the prediction of execution times for a single kernel function on the GPU. This proposal focuses on a number of computational and communication steps, but removes synchronization at the end of each step.

C. LogP MODEL
The LogP model [36] describes a parallel machine using four main parameters: communication delay (L), communication overhead (o), gap between each message (g, from a local point of view), and the number of processors (P). A graphical representation of the different parameters is presented in Fig. 3. The model decomposes each communication step into three elements: L, o, and g, measured in clock cycles, but it does not include a model for application/computation. LogP is devised for distributed computation, is based on message passing, and can simulate a BSP model. Different variants of LogP, such as LogGP [38], LogGPC [39], and LogPQ [40], were introduced to improve the model. LogPQ includes communication queues for sending, receiving, and transferring operations. LogGP introduces a new parameter G, defined as the time per byte for a long message (gap per byte). This allows for the modelling of short and long messages. Finally, LogGPC uses LogP parameters for short messages and LogGP parameters for longer messages. Its contribution relies on the inclusion of network contention and network interface direct memory access (DMA).
PLogP [41] and mPlogP [42] have been introduced for multicore architectures. The former includes the overhead (sender and receiver), latency, gap message, and number of nodes. It is suitable for modelling inter-node communication, but lacks a memory access model. The latter is proposed as an extension of PlogP. Unlike PlogP, mPlogP considers multi-grain parallelism (through vector parameters), intranode communication, and inter-level memory access. The parameters of mPlogP are the overhead (o), which includes the overhead of the sender and receiver, latency (l), gap between messages (g), memory access time (m), and number of cores (P).
For CPU/GPU heterogeneous clusters, the work in [43] presents the mHLogGP model based on the mPlogP, LogGP, and LogP models. It is used to predict the performance of point-to-point and broadcast communications, and the running time of parallel algorithms. The model uses parameters such as overhead, latency, gap per byte, gap between messages, and number of computer nodes. The model also helps to estimate possible bottlenecks.

D. COLLECTIVE COMPUTING MODEL
The collective computing model (CCM) [44] is based on the BSP model and is composed of processors, memory, and two types of supersteps: normal and division. The normal superstep is characterized by computation, followed by the execution of a collective communication function (f ). The division superstep considers that the machine can be divided into submachines. Based on this assumption, several steps are performed: P processors are divided into r groups and the input data are distributed in tasks, each one is executed, followed by a phase of re-joinment. Finally, the distribution of the results is performed.
CCM has as parameters P: number of processors, F: group of collective functions f , TF: cost functions for each f F, P: group of partition functions p, and TP cost functions for each p P.

E. ROOFLINE MODEL
The Roofline [45] is a throughput-oriented performance model for auto-tuning the performance of multicore computers. It provides information about data movement and 90432 VOLUME 10, 2022  [48]. The x-axis represents the operational or computational intensity (CI) and y -axis represents the attainable performance (AP) or throughput. Computational roof and I/O bandwidth roof limit the achievable AP. On the right (yellow area), the algorithms are compute-bound, while on the left (orange area), they are memory-bound.
computation to understand the limitations of the code and combines bandwidth, locality, and different parallelization paradigms. Fig. 4 shows the output of the model, which includes the computational intensity, peak computation (PC), peak memory bandwidth (PMB), and architectural and algorithmic features. The main parameter of the Roofline model is the arithmetic intensity (or computational/operational intensity -CI -[GFlops per byte]), which corresponds to the x-axis and is defined as the ratio of the number of operations (floating-point) to the total data movement (bytes). The attainable performance (AP) is defined by Eq. 2, and corresponds to the y-axis [GFlops per second]. Some contributions in the literature, such as [46], [47], extend the Roofline to cache hierarchy (hierarchical Roofline) by considering L1, L2, device memory, and system memory bandwidths.
In recent years, this model has been used for performance analysis of different computer architectures and application domains. A toolkit for modelling based on Roofline is presented in [49] for multicore, manycore, and accelerated architectures. Roofline has been applied in the context of deep learning using GPU [50]. The model includes time and complexity to add new features pertinent to applications. The authors in [47] propose a practical methodology for GPU that allows a hierarchical Roofline performance analysis.

F. CLASSIFICATION OF PARALLEL COMPUTING MODELS
Zhang et al. [51] classify parallel computing models into three groups based on their evolution over the years and in the memory model of their targeting parallel computers. The first group includes the shared memory parallel computing model (PRAM), which has four approaches: asynchronous, memory contentions, latency-bandwidth, and hierarchical parallelism. The second group includes distributed memory parallel computing models (BSP, CCM, and LogP and its variants). The third group includes hierarchical memory parallel computing models (P-HMM [52], UHM [53], LogP-HMM [54], HPM [55], among others). The authors remark the simplicity, portability, and structured programming style of the BSP model, concluding that BSP offers a better level of abstraction than LogP for designing and programming parallel algorithms.
The third group is based on the speed gap between the processor and the memory system. To reflect the memory access costs, the models incorporate a local memory hierarchy. Models within this last category are subdivided into uniform hierarchical models, LogP extended models, DRAM (h, k) model, and HPM model. Some models cannot be strictly classified into these three groups. This is the case with the traditional Roofline model [45], which quantifies the traffic between memory and cache rather than between processors and cache. The processor performance depends on the off-chip memory traffic. In contrast, DRAM-only Roofline is extended and improved in the recent hierarchical Roofline [46], [47] supporting different cache levels.
A technical literature survey is presented in [56] for performance modelling and prediction of parallel and distributed computing systems. It analyzes different techniques, mathematical modelling, measurements, and simulations. A recent study by Riahi et al. [57] compares analytical, and machine learning models for predicting CPU/GPU data transfer time. Table 1 presents a comparison of the main features of the models described in this section. The table includes the type of communication supported by the model (shared, distributed, or hierarchical), the different costs considered by the model (synchronization, asynchronous communication, computation, or memory), and the parameters used in each model.

III. FPGA-BASED RECONFIGURABLE HARDWARE ACCELERATORS
FPGA architectures contain a large number of reconfigurable circuits, which makes them feasible for accelerating applications that require high parallelism, high performance, and low power consumption.
FPGAs have been commonly used with ''soft'' processors, which are designed using programmable logic resources instead of being built into the silicon. Because the use of reconfigurable devices has grown in increasingly sophisticated applications, the need for FPGA-based systems including processors has been arising.
Integrating a processor and FPGA into a single chip allows the exploitation of different but complementary computational resources of both devices. A performance boost of the system can be achieved by dumping critical functions to the FPGA while maintaining the data transfer quickly and coherently between the devices.
The SoC based on FPGA architecture combines a processing system with programmable logic (FPGA).  The architecture also includes specific interfaces that provide high bandwidth and low latency in the connections between the two parts of the SoC based on FPGA device. The processing system has a fixed architecture formed of a ''hard'' processor and a RAM memory, while the FPGA is completely flexible for hardware design.
Within this context, a processing element (PE) can perform an entire computation containing all the elements required for its replication, which improves the performance of the entire system through coarse-grain parallelism. As an example of this architecture, Fig. 5 depicts the different components of the Zynq-7000 SoC and Zynq UltraScale+ multiprocessor system on chip (MPSoC) architectures from AMD-Xilinx. We refer to Xilinx because it is one of the main providers of this technology. Zynq-7000 SoC combines a dual processor with an FPGA. Zynq UltraScale + MPSoC devices include quad-core and dual-core real-time processors, GPU, and FPGA.

A. HARDWARE/SOFTWARE CO-DESIGN
Hardware/software co-design aims to exploit the inherent features of different technologies, deciding which part of the algorithm should be implemented with sequential instructions (in the processor) and which part in the hardware (such as ASIC or FPGA). Usually, a profiling of the algorithm helps to determine which part is suitable to accelerate. Typically, the most expensive section of the code, in terms of runtime, is a good candidate for hardware acceleration.
Regarding communication overhead, its complexity should be minimized between both technologies (that is, between the processor and the FPGA). Also, energy efficiency could be achieved through this technique. Recent contributions in the literature expose the benefits of co-design hardware/software strategy, such as [58], [59], [60], [61], [62].

B. DESIGN SPACE EXPLORATION AND METRICS
HLS tools are used to create RTL components from a high-level of abstraction using directives to optimize a hardware design described in a high-level language. Each hardware obtained is unique based on the strategies and optimizations used to describe it. DSE involves the evaluation of multiple implementations with different combinations of directives, also known as knobs or optimizations. In this context, DSE plays an important role as a fundamental key point in obtaining a hardware design with a good compromise between different metrics.
In the last few years, most DSE techniques have applied multi-objective optimization algorithms (MOOA), which are dedicated to optimizing objective functions in the presence of conflicting metrics. In this scenario, trade-off solutions contribute to forming an objective space plotted with the objective values, which builds a Pareto-optimal frontier (PF) and a set of configurations (trade-off solutions) called Paretooptimal designs.
Let us denote D as the design space composed by q design points, thus q D. PF can be defined as a set of hardware designs PF = {d 1 , d 2 , . . . , d k }, where the sub-index k defines the number of elements in PF. Each d i with 1 ≤ i ≤ q represents a hardware design with unique features such as latency, resource utilization, and clock frequency. In the case of area (A) and latency (L) as the objective functions; any hardware design d i is considered a Pareto-optimal design, and in consequence d i PF, if there is no other design d n with 1 ≤ n ≤ q in the search space such that it simultaneously has less area (A) and less latency (L) than d i [14], as shown in Eq. 3.
A survey on MOOA for HLS, presented by Fernandez de Bulnes et al. [63], remarks on the expansion of these techniques for the FPGA DSE process. The authors conclude that the most common objective functions are: latency (clock cycles), area (LUT, BRAM, DSP, and FF), power (static and dynamic), wire length, digital noise, reliability, temperature, and security. They claim that all metrics should be minimized, except reliability and security. The authors remark on six main multi-objective methods applied for HLS DSE: evolutionary algorithms, single-solution-based heuristics, problem-specific heuristics, branch-and-X, learning-based methods, and swarm intelligence systems. Some examples are the studies presented in [64], [65], [66], [67], [68], [69], [70], [71], [72], and [73].
An overview of the general DSE process using HLS tools in the loop, based on [14], is shown in Fig. 6. An application, described mainly in C/C++, SystemC, or OpenCL, is the input of this type of system. A low-level virtual machine intermediate representation (LLVM IR) [74] is obtained from the input code through the Clang front-end compiler [75], generating a control data flow graph (CDFG). Each node of the graph represents the operations connected by control dependency and data. The DSE phase generates a unique batch of directives to minimize a specific cost function. The HLS tool then uses the generated optimizations, application, and technology library to generate the final optimized RTL.
Among the main objective functions associated with FPGA/SoC, we can identify the performance, area, and power. The performance includes the latency (L) and throughput (T). This is directly related to the maximum frequency (f max ) of the synthesized design given by T = f max /L. The area includes hardware resources: reconfigurable hardware (LUTs, CLBs, and slices) and static hardware (DSPs and BRAMs). The power is the total power consumed (static and dynamic).
Other metrics could be added, such as scalability measured as the number of PE inside the FPGA, bytes per operation, processing system features, and off-chip and on-chip memory bandwidths.

C. TECHNIQUES TO IMPROVE LATENCY, AREA, AND POWER
Different techniques can be used to improve the performance of algorithms running on FPGAs though HLS tools [76]. One of the most common approaches is to use a set of directives (or knobs) provided by HLS tools to improve throughput, latency, and resource utilization. To this end, HLS tools insert pragmas (compiler directives) into the source code [6], [8]. Some of the most used optimization techniques are: • Pipelining: in the presence of sequential operations executed multiple times, this technique allows the insertion of registers at the output of each stage, so that each operation can run in parallel on different input data, increasing the overall throughput at expenses of area. Pipelining can be applied at instruction and function levels.
• Loop unrolling: let us denote f as the unroll factor. For a rolled loop, one iteration is executed at n clock cycles. Thus, f iterations can be executed within n clock cycles when unrolling the loop by a factor of f and the total latency for the unrolled loop is n/f (without data dependency). This technique can improve both latency and throughput, but it is expensive in terms of resource utilization since it is affected proportionally by f .
• Memory optimizations: -Array partition: let us denote p f as the partitioning factor. Array partition splits an array in p f sections to be mapped into a dedicated memory element, allowing multiple simultaneous accesses to it, at the cost of higher utilization of memory elements. -Array reshape: this technique allows creating smaller arrays from the original array, concatenating elements by increasing bit widths, thus reducing the number of BRAM consumed and allowing parallel access to the data.
Nevertheless, memory performance could be affected by array partition techniques because an improper partitioning leads to generate a large amount of multiplexers, incurring in additional delays [77].
Code restructuring techniques [78], [79], [80], [81], [82] are also used to improve the hardware design of the algorithms. Ferreira et al. [83] introduce an approach for automatic code restructuring targeting HLS tools. A detailed survey is presented in [82], where the sets of optimizing transformations techniques are classified into: pipelining, scaling, and memory-enhancing transformations.
Quantization techniques aim to reduce memory footprint by selecting the number of bits to represent the data structures and operations to improve objective functions such as latency, resource utilization, and throughput. Moreover, by reducing VOLUME 10, 2022 the computational intensity, the power consumption also decreases [84], [85], [86], [87].
The dynamic power consumption P d depends on the design and can be improved by considering each element present in the Eq. 4 [88]. As can be noticed, P d is directly proportional to the clock frequency f , which increases with the square of the power supply V , and it is also affected by the effective capacitance C i , resource utilization R i , and switching activity S i for a given resource i. Hence, A survey on this topic is presented in [89], considering ultra-low-power techniques for FPGA-based IoT systems. Contributions devoted to improving power consumption on FPGA are presented in [90] and [91].

IV. MODELS, METHODOLOGIES, AND FRAMEWORKS FPGA/SoC
We present the models, methodologies, and frameworks that have been proposed to estimate the performance metrics associated with FPGA/SoC to reduce design times and improve productivity. Some of these models, methodologies, and frameworks propose an exploration of the design space to grant a hardware design with good compromises between different metrics. Other ones include power consumption estimation because low power is one of the main highlights of FPGA-based hardware accelerators.
In this section, we classify models, methodologies, and frameworks into the following categories according to their main features: metrics estimation, FPGA-based DSE, and power consumption estimation.

A. METRICS ESTIMATION 1) METHODOLOGIES
Among the methodologies, we can find the works presented in [92] and [93]. HLScope [92] consists of a performance debugging methodology, that helps to identify potential bottlenecks and their causes. HLScope has two flows: in-FPGA (accurate analysis) and software simulation (rapid analysis). For each hardware described by the designer, the tool provides execution times and analyzes various stall causes: external DRAM access, synchronization, and dependency. HLScope+ [93] extends HLScope to overcome its main drawbacks. HLScope+ includes a fast and accurate HLS-based cycle estimation and an improved memory access model that considers some PE in the FPGA connected to an external memory through a DRAM controller, avoiding cache modelling.
Kapre et al. [94] present a communication discipline inspired by synchronous dataflow [95] and BSP computational models for OpenCL pipes in FPGA devices, considering that one of the strategies to exploit FPGA wiring is through pipes, by reducing the communication latency between kernels.

2) MODELS
In the early stage of the design, models have been applied to FPGA/SoC to mainly estimate latency and area.
Hora et al. [96] proposes pipelining circuit RAM (PCRAM), which is a computational model that considers only synchronous circuits. Several algorithms are described, and the model is used to obtain time complexities, leaving for future work the contrast with the experimental results. In this model, the computer comprises a word-RAM of word size w with a circuit composed of an execution module, gates, and inputs/outputs.
A cost model for FPGA partial reconfiguration, proposed by Papadimitriou et al. [97], considers all physical elements involved in the reconfiguration process, where each phase contributes to the total reconfiguration time. The authors also explore the parameters that affect the reconfiguration performance.
FlexCL, introduced by Wang et al. [98], is an analytical performance model that uses the OpenCL kernel as the input and supplies the performance estimated for the FPGA. A high-level scheme of this model is presented in Fig. 7. The input source code is transformed into an LLVM IR trace through Clang. Information such as the code structure and operation latency is extracted using a kernel analyzer and sent to different models: a computation, communication, and global memory model. As a result of the integration of these three models, the execution time for a given kernel is estimated. FlexCL contributes to identifying performance bottlenecks on FPGA, where PEs, computation units, and kernels have their own models. FlexCL considers eight global memory access patterns; and can also be used to explore the design space to identify solutions under given user constraints.
Currently, Roofline is used for the recognition of the highest performance and potential bottlenecks in FPGA, due to its intuitiveness and simplicity while providing insights about the arithmetic computation and attainable performance. An extended version of the Roofline multicore model for hardware accelerators is presented by Silva et al. [48], maintaining the core of the original proposal, but adding the resource utilization and parameters obtained through HLS tools. The unit for the performance operation is byteoperations (Bops), considering that fixed-point operations are more suitable for this technology than floating-point operations. The authors also include the scalability parameter to determine the PE replication factor, considering the available resources and resource utilization per PE. Starting from this initial proposal, contributions in the literature [99], [100] extend this model to FPGA devices. Calore et al. [99] present an FPGA empirical Roofline (FER) to estimate the throughput and memory bandwidth of FPGAs for high-performance computing (HPC) applications based on HLS tools. Nguyen et al. [100] extend the empirical Roofline toolkit (ERT) to FPGAs, presenting a benchmark for the energy efficiency. High-level overview of FlexCL, based on [98]. The input is the OpenCL kernel code, which is transformed to LLVM IR through Clang. Information from the source code is extracted by a kernel analyzer, which is sent to a computation model, a communication model, and a global memory model. The results of each model are integrated in one model to estimate the final kernel execution time.

3) FRAMEWORKS
Pyramid, developed by Makrani et al. [101], is a machine learning based framework to estimate timing and resource utilization, and to overcome the differences between the post-implementation results and intellectual property (IP) cores created with HLS. It is developed by employing ensemble machine learning techniques, such as linear regression, artificial neural networks, support vector machines, and random forests. As part of the framework, Minerva [102], which is an automated hardware optimization tool based on a heuristic model, is used to obtain a good throughput and throughput-to-area ratio for the RTL code generated by HLS.
Wang et al. [103] present a framework based on a performance analysis model combined with code tuning techniques for OpenCL applications only on FPGAs, assuming that an incremental development model is adopted by designers [104]. The model includes four FPGA-centric metrics to detect possible bottlenecks related to memory, parallelism, and computation.

4) SUMMARY
For metric estimation, a few contributions have considered the use of the traditional parallel computing models such as BSP and PRAM [94], [96] on FPGA. Nevertheless, the adoption of the Roofline model for estimating performance and bottlenecks on FPGA devices has been widely adopted due to its intuitiveness and simplicity [48], [99], [100].
Furthermore, the differences between the metric estimation reported by HLS tools and the post-implementation results are a key point to consider when designing the estimators of performance metrics [101].

B. FPGA-BASED DESIGN SPACE EXPLORATION
Design space explorers aim to minimize HLS tools execution times, which are highly dependent on the size of the space to be analyzed. Different methodologies, models, and frameworks have been proposed based on the analysis of HLS directives, where the exploration of the design space [105], [106] is important because it increases exponentially with the use of directives. The challenge is to find a set of hardware designs, also known as Pareto-optimal designs. Considering that there is a limited number of resources (LUT, BRAM, DSP, and FF) available in the reconfigurable architecture, the hardware design cannot request more resources than those available in the FPGA.
Surveys related to this topic are presented in [63] and [14]. In particular, the last one proposes a classification of HLS DSE techniques into two groups, as depicted in Fig. 8: synthesis-based and model-based. In this classification, the third category is composed of a combination of supervised learning and DSE synthesis-based techniques.
According to Sohrabizadeh et al. [15], HLS DSE can be developed using model-based and model-free techniques. Model-based techniques comprise tools and methodologies that use analytical models. They estimate the resources and performance of each point in the design space. Model-free VOLUME 10, 2022 techniques include approaches in which the HLS tool is treated as a black box, such as Bayesian optimization and reinforcement learning techniques [112], [113], [114], [115].
Nabi et al. [117] propose TyTra flow that integrates performance and cost models based on Roofline analysis to obtain an optimized FPGA solution for scientific HPC applications. The methodology adopts the models defined in the OpenCL standard: platform and memory hierarchy, kernel execution, memory execution, and data pattern. The Roofline model is the base for the design space explorer and is used to assist the selection of the best instance to be downloaded into the hardware. Additionally, the authors propose an intermediate representation language (TyTra-IR). For the calculation of resource utilization to obtain scalability of the system, the authors consider a maximum utilization of the FPGA of 80%, as suggested by [119].
Siracusa et al. [118] propose a DSE methodology, presented in Fig. 9. The system input is the C/C++ source code, which is translated to an LLVM IR trace, obtaining the baseline of performance estimation and resource utilization through the synthesis process. From this base implementation, the Roofline model chart (RooflineOrig) determines memory bottlenecks. Afterward, an automated DSE estimates resources and performance, generating the optimal design points. The Roofline for the best feasible design is plotted along with the RooflineOrig chart, to compare the current design's performance and the performance of the solution derived by the DSE. The explorer includes resource sharing and HLS-specific IR optimizations during sample estimations. This work is extended in [116], with the hierarchical version of Roofline, estimating peak performance analytically and integrating a guide to reaching memory-transfer and data-locality optimizations.
Ferretti et al. [120] propose a method for inferring knowledge from past design explorations, as shown in Fig. 10. The authors introduce signature encoding for code and directives, composed of specification encoding (SE), configuration space descriptor (CSD), and similarity metric longest common subsequence (LCS). The methodology uses signature encoding to create a string with design and configuration spaces (directives and their modes), combining CSD and SE. On the other side, the LCS metric is used to measure the similarity between the actual and previous DSE stored in a database.
COSMOS, an automatic and scalable methodology for DSE, is introduced by Piccolboni et al. [121] for complex accelerators. It generates a set of Pareto-optimal designs and reduces the number of HLS invocations. It comprises two main phases: component characterization and DSE (based on two steps: synthesis planning and mapping). The computing model used for DSE is based on timed marked graphs.
COSMOS includes memory as part of the DSE process and applies synthesis constraints to reduce the variability of the HLS tools.
The adaptive threshold non-Pareto elimination strategy (ATNE) [122] focuses on inaccuracy estimation, to address the exploration of the design space on FPGA for implementations based on OpenCL. The ATNE algorithm is based on a random forest for regression. The prediction quality is obtained using two metrics: average distance from reference set (ADRS) and hypervolume error (HVE). The results are shown for matrix multiplication, Sobel filter, finite impulse response filter (FIR), histogram, and discrete cosine transform.
Xu et al. [123] propose a methodology for performing DSE using MPSoC devices. This work presents three methods to automatically carry out the exploration: two based on simulation (cycle-accurate and fast cycle-accurate) and one based on hardware acceleration. For this purpose, the authors consider several IP cores in an FPGA. The proposed methodology is called fast explorer for behavioral systems (FEBS), and it accepts the number N of IP cores and their testbenches as input. The output is a set of dominant systems with area vs performance trade-off. In this methodology, design space exploration is performed for each IP core. The general overview for this design space explorer is shown in Fig. 11.

2) MODELS
Lo et al. [113] propose a sequential model-based optimization, using a transfer-learning mechanism, to select directive configurations in HLS, minimizing the number of tool evaluations/executions while obtaining solutions with LUTs-latency optimal trade-offs.
Kwon et al. [124] propose the mixed-sharing multidomain model for reusing the knowledge obtained from previous HLS DSE whereas exploring a new target design space, showing its effectiveness when approximating quality of results (QoR) without running HLS tools.
Dai et al. [125] present a fast and accurate QoR estimation based on HLS. For this purpose, they use final HLS reports from a set of synthesized applications to identify relevant features and metrics, and construct the dataset to be used for training machine learning models (linear regression, artificial neural networks, and gradient tree boosting). To create the dataset, the authors employ the information obtained from HLS reports for different directives and targeting different FPGA platforms. In addition, C-to-bitstream flow for different clock periods is performed to obtain features such as post-implementation resources and the worst negative slack. Finally, the authors obtain 234 features, which were reduced to 87 after an elimination process to remove irrelevant features.

3) FRAMEWORKS
Mehrabi et al. propose Prospector framework [114], which uses Bayesian techniques to obtain the best configurations FIGURE 9. A DSE methodology presented in [116], [118]. The input source code is translated to LLVM IR trace, obtaining the baseline for performance estimation and resource utilization. Subsequently, the Roofline model chart estimates memory bottlenecks. An automated DSE phase allows resource and performance estimations, and the best feasible design is plotted along with the original Roofline chart.

FIGURE 10.
A DSE methodology presented in [120] that uses past design explorations to infer knowledge. The signature encoding is used to create a string with the design and configuration spaces. The new signature is compared with the ones obtained from previous DSE (DSE database). After the similarity evaluation, the signature selected is used as input for the inference stage, to finally obtain the optimal configuration. FIGURE 11. MPSoC DSE, based on [123]. Different IP cores coexist in the MPSoC: some developed with HLS tools (IP1 and IP2) and others using RTL description. A design space is generated with the HLS tools. The system level exploration receives as input the number of IP cores described in ANSI-C or SystemC and their testbenches. The output is a Pareto-design with throughput-area trade-off. The system level exploration is composed by three methods: two based on simulation and one based on hardware acceleration.
with fewer resources and reduced latency near Paretoefficient designs. The HLS tool is considered as a black box (or function), which has to be modelled and optimized. Prospector is shown in Fig. 12, where the inputs are the source code, clock frequency, and directives, and the outputs are the synthesized designs. The Bayesian optimization unit (BOU) is used to explore the design space and control the selection of directives. The HLS tool is used to generate RTL from the high-level source code. At the end of the process, the framework can obtain different designs with a latency-area trade-off, which belong to the Pareto frontier.
Lin-Analyzer [130] is a tool that allows accurate and fast FPGA performance estimation and DSE, considering fine-grained parallelism. With this framework, runtime scales linearly while increasing the design space complexity; however, only a few optimizations are considered, mainly loop unrolling, loop pipelining, and array partitioning. Regarding resource utilization, the authors assume that DSP and BRAM are the bottlenecks in accelerator designs. The communication cost between the FPGA and global memory is not considered. The framework is divided into three main stages: instrumentation, optimization of dynamic data VOLUME 10, 2022 dependence graph (DDDG) generation, and DDDG scheduling. In the last stage, latency is used as a performance metric under resource constraints. Lina is proposed in [131] as an extension of Lin-Analyzer, and it includes non-perfect loop nests and timing analyses.
MPSeeker is proposed by Zhong et al. [132] to estimate the performance and resource utilization from a given code (C/C++), considering fine-and coarse-grained parallelism, allowing fast DSE. Because MPSeeker contemplates multi-parallelism using the loop tiling technique, a gradient boosted machine is proposed to obtain an accurate resource model for FF and LUT, while Lin-Analyzer is used for BRAM and DSP estimation. The authors also extend the features of Lin-Analyzer by including the data communication cost. The performance cost in MPSeeker is modelled as the sum of the kernel computation and data communication costs.
Choi et al. [78] present a DSE and clock cycle estimator using HLS, including code transformations in the presence of variable loop bounds. They propose a resource prediction method based on HLS reports through shareable and non-shareable operators from a loop. Using linear interpolation, non-shareable resources are obtained, whereas the resources estimated for shareable operators are computed as the maximum of all loops. An analytical model is proposed for clock cycle prediction. In this framework, the design with the best performance is the output.
COMBA [77], [133] is a framework that focuses on selecting the optimal configuration of directives in HLS, taking into account the use and availability of hardware resources, and provides an estimation of performance and resource utilization. The authors propose the metric-guided DSE II (MGDSE-II) algorithm to prune and explore the design space based on three metrics: the number of DSP, BRAM, and LUT. An overview of COMBA, which is composed of a recursive data collector, analytical models (latency and resources), and DSE, is presented in Fig. 13. In COMBA, the input is the C/C++ source code, which is transformed into an LLVM IR trace through Clang. The IR trace is the input for the recursive data collector, which extracts static and dynamic information that will be used for the analytical models. MGDSE-II then evaluates the configuration and establishes the next set of directives to be applied to the input code. This iteration is repeated until a high-performance configuration is obtained.
Ferretti et al. [134] present a framework for HLS DSE using a cluster-based heuristic integrally developed in MATLAB. The algorithm identifies different clusters in the DSE, reducing the number of regions to be analyzed; intraclustering is performed, followed by inter-cluster exploration. A lattice-traversing DSE framework [135] is proposed to explore the design space by transforming it into a lattice representation. The framework includes three stages: lattice creation and initial sampling, selection of lattice Paretoneighbours, and synthesis and lattice labelling.
IronMan [115] is an end-to-end flexible and automated framework for DSE composed of a performance and resource predictor based on a graph-neural network (GPP), multiobjective DSE engine based on reinforcement-learning (RLMD), and code transformer (CT). One of the main features of this framework is that it retrieves the final code with the discovered optimizations, ready to generate the corresponding RTL through HLS.
Sherlock [136], introduced by Gautier et al., is a DSE framework based on multi-objective optimizations devoted to find Pareto-optimal solutions (or Pareto front), handling multiple conflicting optimization objectives. This framework uses active learning to exploit a surrogate design space model to find the Pareto-optimal designs as quickly as possible.

4) SUMMARY
A summary of most of the contributions devised for DSE and presented in this section are listed in Table 2, considering the following aspects: • Reference. • Pruning of the design space (P-DS).
• Whether it is based on the Roofline model. • Whether it considers quality of results (QoR) in relation to the place and route estimation.
• Whether it applies transfer learning (TL).  [77]. LLVM IR is extracted from the source code. This trace is the input for the recursive data collector, which will extract the parameters used by the analytical models (latency and resource). MGDSE-II evaluates the configuration and defines the next set of directives to be applied. The output of the complete flow is the high-performance configuration. • The amount of estimated resources (N Resource): 1 stands for one resource, 2 for two, and so on. NS stands for not specified.
From Table 2, only a few contributions include more than two aspects when developing DSE. A design space explorer can benefit from a reduction of the design space by focusing on obtaining design points near the Pareto frontier, a parallel computing model to guide performance estimation, a good estimation of QoR, and resource utilization. Transfer learning, a technique linked mainly with ML approaches, could help to obtain underlying patterns when developing hardware through HLS tools.
There are contributions that only estimate some FPGA resources, as follows. LUT-latency trade-off is estimated by [113], BRAM and LUT are computed by [137]. COMBA [77], [133] estimates DSP, BRAM, and LUT. Lin-Analyzer [130] computes BRAM and DSP, whereas MPSeeker [132] estimates FF and LUT, combining Lin-Analyzer for DSP and BRAM utilization. Nevertheless, overestimating resource utilization can lead to pruning valid design points in the exploration phase. LUT, FF, DSP, and BRAM post-implementation estimation is performed by [125]. A challenge with HLS tools is efficiently predicting resource sharing for unrolling factors and array partitions when using HLS pragmas. [78], [118].

C. POWER CONSUMPTION ESTIMATION
Power consumption is an important topic, especially with the growth of green technology, internet of things (IoT) systems, and the expansion of communication networks. Power estimation techniques are categorized based on the abstraction levels of the FPGA design process as follows: system, RTL level, gate, and layout levels. One of the requirements when designing IP cores under power, energy, or thermal constraints is their estimation in the first steps of the design process for a given application.
FPGA vendors have proposed different tools to estimate power consumption, such as Maxim R integrated power solution with a USB-to-PMBus interface dongle [139], USB interface adapter EVM from Texas Instruments R [140], Xilinx R power estimator based on spreadsheets (XPE) [141], and Intel R FPGA power and thermal calculator [142]. With FPGA/SoC devices, power is classified as static (fixed and technology-dependent) and dynamic (data and design-dependent). A recent survey on power consumption in FPGA and ASIC devices [17] classifies the techniques for its estimation into analytical, tablebased, polynomial-based, and neural networks.

1) METHODOLOGIES
KAPow, proposed by Davis et al. [143], is an online activity-based power methodology that includes a signal pruning strategy. The flow has two phases: signal selection (nets with strong relationships between activity and power) VOLUME 10, 2022 and instrumentation (implying the accumulation of events to monitor the relevant signals). A linear model is used to estimate the power contribution of the overall system by computing the power consumption of each IP core.
In the context of approximate computing, Xu et al. [144] investigate the use of linear regression and multilayer perceptron (MLP) models to generate a new approximated RTL design with a trade-off between area and power. Using this approach, the search space is extended by reducing the precision of the weights obtained for the predictive models. The proposed method is divided into three stages: kernel extraction and training data generation, model fitting and substitution, and model precision optimization with bit width reduction.

2) MODELS
Lorandel et al. [145] propose the use of neural networks to estimate the dynamic power consumption and output signal activities for different IP cores involved in a system. In this study, two stages are considered: IP characterization and high-level system modelling. Nasser et al. [146] present a model for the characterization phase by extracting the relevant information for each component that has an impact on power.
Tripathi et al. [147] introduce an MLP architecture to calculate power consumption, using LLVM IR instructions as input, and modelling only dynamic power.
Verma et al. [148] present a power estimation model that improves the Deng's model [149], and is designed using nonlinear regression techniques. For this purpose, they use the power data of different types of digital circuits (described in VHDL) after the synthesis process. The data is divided into designs with and without clock gating, and based on this separation, two power models are developed.
In [150] two techniques are proposed by Verma et al. remarking the importance of predicting the power consumption in an early stage of the accelerator design: a heuristic approach based on a backpropagation neural network and a regression based on statistics.
FlexCL is extended in [151] through the incorporation of three modes of communication for the memory model: direct, burst, and stream access patterns, and an analytical power model for dynamic and static power.

3) FRAMEWORKS
HLSPredict, developed by O'Neal et al. [152], is a framework based on an ensemble of ten machine learning models to predict performance and power consumption without analytical models or HLS-in-the-loop. Two types of IP cores are considered: without directives (base IP core) or with directives (optimized IP core). Accelerators for training the models are based on a template with DMA for memory transactions, which implies that for every source code implemented through HLS, the functionality of the IP core is encapsulated and integrated within the hardware template.
HL-Pow, proposed by Lin et al. [153], is based on machine learning techniques and overcomes the gap between the HLS synthesis phase and power consumption estimation (usually performed after the RTL implementation flow). A DSE is introduced to obtain the latency vs power trade-off, with pruning to reduce the design space when finding Pareto-optimal designs. For the machine learning implementation, the training dataset is constructed by a feature construction (HLS report) and power collection (post-implementation report), with a total of 256 elements per feature. The experiments are performed with different machine learning models, including linear regression, support vector machines, tree-based models, and neural networks.
PowerGear, described by Lin et al. [154], is a graphlearning-assisted power estimator for FPGA HLS, and is composed of a graph construction flow and a power-aware graph neural network model called HEC-GNN. This study considers the impact of interconnections in the hardware design that affects the power modelling. The authors benefit from the HLS front-end and HLS back-end to recover dataflow graphs because it is possible to obtain the IR traces and finite state machine with data path information. Pow-erGear can be used to guide a design space explorer with a trade-off between latency and power to obtain the Pareto frontier.
Aladdin, introduced by Shao et al. [155], estimates the performance, power, and area of accelerators. It generates a dependence graph from the input code and produces a fast cycle estimate before RTL construction. HAPE, presented by Makni et al. [156], is a framework for area-power estimation based on analytical models, and it aims to assist the DSE in reducing HLS runtime. HAPE focuses only on the main subtraces present in a source code containing the directives provided by the designer. HAPE integrates Lin-Analyzer for computation cost.

4) SUMMARY
Regarding the power consumption, there is an evident trend in estimating this metric in the early stages of design using HLS tools. Moreover, some of the presented frameworks integrate the performance, power, and area estimations with a DSE engine.

D. SUMMARY AND DISCUSSION
The studies described in this section are summarized in Table 3, including for each one: • Reference and year of publication. • Whether it is a model, a methodology, or a framework.
-In the case of a model, the number of input parameters is included. For example, the model presented in [157] uses more than 10 input parameters (10+), and the model presented in [98] uses 21 parameters. The symbol (−) indicates that the number of parameters is not defined in the corresponding study.
• Techniques used to implement the proposed approaches: statistical, analytical, machine learning (ML), and others. Table 3 shows that the described contributions are fairly distributed between models (35%) and frameworks (41%), whereas 24% propose methodologies. In line with the growing tendency in developing design space explorers, 55.2% of the contributions include DSE.
We can observe that most DSE solutions use high-level abstraction languages as input, showing a tendency to increase productivity in the design phase. Likewise, many studies are focused on obtaining Pareto-optimal designs.
Regarding metrics, latency and area are the most frequently estimated metrics, followed by power: 65.3%, 57%, and 26.5%, respectively. We also present this result in Fig. 14. The area and latency metrics are widely estimated because reconfigurable platforms are resource constrained and are used for algorithm acceleration.
Concerning the power consumption, some described contributions highlight the benefits of estimating this metric for a given application at an early stage of its design. Some of the most recent studies benefit from HLS tools to estimate this metric before the implementation stage of the overall system into the hardware platform. This approach is becoming commonplace in the literature when considering FPGA/SoC as a development architecture. Table 3 also shows that the C/C++ source code is preferably used as input (65.3%), and the Pareto frontier is the most applied solution to obtain optimal designs (33%) in terms of trade-off between area and latency, area and power, latency and power, among other metrics. Whereas machine learning and analytic methods are almost equally used to obtain accurate, fast, and robust models (43% and 41%, respectively), as shown in Fig. 15. However, in the last years, machine learning is the most widely used technique.
The models, methodologies, and frameworks for metric estimation, FPGA-based DSE, and power consumption described in this section are illustrated in Fig. 16. It can be observed that, in recent years, there has been an increasing number of frameworks including DSE, whereas the power consumption is mainly estimated by models, with a preponderance of analytical techniques. Fig. 17 summarizes the main topics presented in the research works reviewed in this paper and discussed in this section.

V. INTEGRATION IN DIFFERENT RESEARCH FIELDS
In this section, we present contributions in the literature that propose models and frameworks for specific hardware VOLUME 10, 2022 acceleration applications. Some of them are based on general models such as Roofline. We show that the frameworks and models for FPGA/SoC are used in diverse research areas, exposing their benefits in the design of hardware.

A. MODELS
The Roofline model has been introduced to assist the designer when targeting hardware acceleration of HPC applications, so as to explore the design space, estimate the performance, and evaluate the throughput due to its dependency on communication and computation.
Roofline is applied by Du et al. [158] in the acceleration of the stencil computation kernels, by Karp et al. [159] for the hardware implementation of a spectral element method, and by Nagasu et al. [160] in the context of an FPGA-based tsunami simulation.
In computational fluid dynamics (CFD), Du et al. [161] present an FPGA-based CFD simulation architecture using a performance model to guide the DSE while achieving the maximum performance of the lattice Boltzmann method, searching for an optimal combination of the parameters of the unroll directive.
Reggiani et al. [162] present the acceleration of iterative stencil computation using Verilog to describe hardware. An analytical model that considers memory transfer and computation is proposed to estimate the attainable performance of the accelerator and speedup the DSE.
Through efficiency degradation, it is possible to obtain hardware designs with higher performance, lower power consumption, and lower resource utilization at the cost of QoR. Manuel et al. [129] propose a DSE in the context of model-based approximate computing for image processing using a multi-objective genetic algorithm, finding a wide range of Pareto-optimal solutions, from which the desired compensation between quality and resources can be chosen.
In recent years, ML techniques have been applied in multiple fields such as fluid dynamics, high-energy physics, information retrieval, image processing, video processing, security, and biology [163], [164], [165]. Because of this trend, models for FPGA-based architectures are being developed to accelerate ML applications with efficient exploitation of hardware resources, with the aim of improving productivity in the design phase [166], [167], [168].
Resource and performance models are proposed by Reggiani et al. [169] for convolutional neural network (CNN) accelerators, to drive an automatic Pareto-optimal DSE, exploring network performance on different hardware platforms. These models are applied to convolutional cores, which are critical components of the design, directly affecting the overall latency and DSP utilization. The final relation to obtain the Pareto-optimal solutions is the number of DSP vs the initiation interval (input rate of the pipeline in clock cycles).
Gysel et al. [170] present an analytical model for deep CNN design, which is useful for obtaining the computational cost and inferring the required memory bandwidth for the hardware design.
CaFPGA, developed by Xu et al. [171], is an FPGA-based DSE for CNN that focuses on convolutional and fully connected layers. To improve the productivity in the design phase, the authors propose an automatic generation model, including incremental searching and flexible layer-folding algorithms, considering that the on-chip memory is a limited resource in FPGA. The analysis of the design space is performed using time, resource, memory, and performance models.
Shan et al. introduce [172] a CNN multi-kernel application and its implementation on AWS-F1, where an analytical model is used to compute data transfers (CPU to DDR, DDR to FPGA, FPGA to DDR, and DDR to CPU) and kernel computation times.
The Roofline model is employed as a performance predictor for FPGA-based CNN accelerators [173], [174], [175], [176]. Ayat et al. [173] present an optimization for an FPGA-based CNN accelerator for energy efficiency. Xie et al. [174] use this model to quantitatively analyze the design phase of a CNN accelerator, depending on the available computing and memory resources. Park et al. [175] propose a model based on Roofline to effectively compute convolutional layers using metrics such as throughput, on-chip memory, off-chip memory bandwidth, and the computation-to-communication ratio.
Ma et al. [176] introduce a coarse-grained analytical performance model for CNN accelerators. For this purpose, the modelling of DRAM access, latency, and on-chip buffer is analyzed to obtain the final model. Regarding DSE, convolution throughput is the main focus, considering factors such as operating frequency, external memory bandwidth, and loop unrolling variables, using Roofline to analyze the throughput of the CNN accelerator. Resource costs are obtained by considering the knobs loop unrolling and tiling. Table 4 summarizes the models used in the contributions described in this section. The first two columns are the reference and the year of publication. The third column is the VOLUME 10, 2022 research area in which the model is applied. The fourth and fifth columns are the aim and type of model used, respectively, and the last one is the target platform.
We can observe that most contributions focus on CNN accelerators, and that the models are devoted to carrying out DSE and performance estimation and are mainly based on Roofline. The use of this model is based on the premise that communication and computation are two basic constraints to improve the throughput of an accelerator, specially when developing hardware for highly demanding applications.

B. FRAMEWORKS
Frameworks (or toolflows) have been proposed to map ML inference and training into SoC-based, integrating models to mainly estimate hardware resource utilization, latency, and throughput. An exhaustive survey is presented in [166].
Concerning training acceleration, Geng et al. [177] developed FPDeep, a toolflow for a scalable CNN training acceleration on deeply-pipelined FPGA clusters, proposing a model for operator graph partitioning and hardware resource allocation (with a distinction between small and large FPGA clusters). Roofline is used to evaluate the throughput, because of its dependency on communication and computation.
F-CNN, introduced by Zhao et al. [178], is an automatic framework for CNN training based on the reconfiguration of a streaming data path at runtime. The proposed models for resource and bandwidth estimation guide the space exploration under design constraints to obtain an optimal performance.
HP-GNN, proposed by Lin et al. [179], is a framework for training graph neural networks (GNN) on a CPU-FPGA platform. It incorporates an engine dedicated to exploring the design space through an exhaustive search using performance and resource utilization models. HP-GNN also incorporates hardware templates to implement different GNN architectures.
Regarding inference acceleration, Ghaffari et al. [180] present CNN2Gate, a framework based on OpenCL to map a CNN onto an FPGA with fixed-point arithmetic, including a hardware-aware DSE based on resource utilization. It is implemented using manual directive tuning, reinforcement learning, and the hill-climbing methods.
Venieris et al. [181] propose the fpgaConvNet toolflow to map a CNN onto an FPGA, thereby optimizing the neural network workload. It includes a DSE using a multi-objective algorithm (simulated annealing), where the explorer optimizes the design according to latency, throughput, or maximum throughput with a latency constraint. Performance estimation and resource utilization models are proposed for DSE.
Cloud-DNN [182], introduced by Chen et al., is a framework for mapping DNN to cloud-FPGA, generating the corresponding HLS project to obtain the final IP core. The proposed accelerator model is based on hardware resource cost (considering DSP and BRAM) and a performance model for each layer (convolutional, max pooling, and fully connected). A greedy algorithm is employed to search for the best accelerator configuration under constraints such as the DSP, BRAM, bandwidth, and DNN layers.
FRED [183], developed by Biondi et al., is a framework for real-time applications that benefits from a dynamic partial reconfiguration (DPR). It includes a hardware task model for the tasks carried out by the FPGA with partial reconfiguration enabled, a software model for the tasks executed on the processor, and a scheduling infrastructure.
Mu et al. present [184] a collaborative framework to obtain OpenCL-based hardware designs for CNN implementation. A DSE based on LoopTrees is generated and pruned to reduce the design space. Fine-grained and coarse-grained analytical models are introduced to generate the final optimized solution. The former estimates the latency and resource utilization, whereas the latter applies further optimization to the best candidate designs obtained after applying the fine-grained model.
The heterogeneous image processing acceleration (Hippac), proposed by Reiche et al. [185], is a framework that allows the generation of image processing accelerators. Several steps are performed by analyzing the IR trace: data dependency analysis, dependency graph restructuring, and transformations (streaming objects, memory allocation, and replication of the innermost kernel to improve throughput).
A framework named Spark-to-FPGA-Accelerator (S2FA), introduced by Yu et al. [186], transforms Scala computational kernels based on Apache Spark applications into optimized accelerator designs. For this, a learning-based DSE is employed to obtain high-performance RTL designs using an ensemble of reinforcement learning algorithms: uniform greedy mutation, differential evolution genetic algorithm, particle swarm optimization, and simulated annealing. The HLS tool is executed in the loop to verify each optimization.
AutoDNNchip [187] is proposed by Xu et al. to facilitate fast chip designs based on DNN, targeting FPGA and ASIC platforms. The main factors involved in the DNN acceleration process are bit precision, clock frequency, memory technology, PE architecture, width for data transfer, memory allocation, and DNN mapping. AutoDNNchip is composed of a chip predictor and a chip builder. The former predicts metrics such as area, latency, energy, and throughput, whereas the latter performs the DSE optimizing the chip design using the results obtained by the predictor. A chip predictor is formed by two modes: (i) coarse-grained and (ii) fine-grained. In (i), analytical models are used to obtain the energy, critical path, and area for a DNN model, while in (ii), an algorithm is implemented to obtain the final latency through runtime simulations, considering the results of the coarse-grained mode. A chip builder is composed of a DSE based on two phases: early stage architecture and IP configuration exploration, and inter-IP pipeline exploration and IP optimization. Finally, the RTL is generated and executed to validate the results. Table 5 summarizes the frameworks used in the contributions described in this section. The first two columns are the reference and the year of publication. The third column is the research area in which the model is applied. The fourth is the name of the framework and the last is the target platform.
As we can observe, most frameworks are devoted to mapping ML-based inference into FPGA/SoC architectures. The components of these frameworks are usually expressed as pre-defined optimized templates, mainly implemented in C++ and OpenCL, where parallelism can be controlled by changing the parameters associated with the different directives.

VI. CHALLENGES
Nowadays, the explosive growth of accelerators promises greater computational capabilities. FPGA/SoC devices are widely used as hardware accelerators in different areas of research and development. However, the structured study we have presented in the previous sections indicates the necessity to address some challenges. Coping with them will permit a more widespread adoption of models, methodologies, and frameworks for performance estimation of HLS-based hardware designs for FPGA/SoC technology.
Even using HLS tools, reconfiguring an FPGA/SoC with an efficient hardware design is a challenging task. This is easily made apparent by some observations: • Physical resources, such as memory bandwidth, reconfigurable hardware (LUTs, CLBs, and slices), and static hardware (DSPs and BRAMs) are limited in FPGA/SoC devices. Thus, the available physical resources should be used skilfully, considering techniques to improve the latency, area, and power, as introduced in Section III-C.
• Code restructuring techniques aid creating efficient FPGA implementations using HLS tools, modifying the original source code of the application according to the FPGA architecture. Suggestions for this topic are presented in [82].
• The number of PE replicas in a hardware design, and consequently the level of coarse-grain parallelism that can be obtained, is limited to the available physical resources. Therefore, different strategies should be implemented to exploit the architecture so as to increase the scalability of the system.
• There is a trade-off between the different metrics to be optimized, as was presented in Section III-B. As an example, the area occupied is likely to increase if the latency is reduced, and vice versa. Thus, the FPGA designer should choose a good compromise between the metrics in terms of resources, computing operations, throughput, among others.
• The hardware generated through HLS tools is directly associated with the applied directives, but sometimes applying and tuning directives require a considerable endeavour to obtain a proper FPGA implementation. Moreover, generating a solution for each directive combination is associated with the synthesis time, reducing productivity.
• The exploration of the design space is linked to the human effort of performing combinations of directives, user design constraints, FPGA features, and code restructuring, among others. We can cope with the above considerations through models, methodologies, and frameworks to reduce design time, as follows: • The level of coarse-grain parallelism can be obtained by means of a model such as Roofline, identifying the computation-to-communication ratio, exposing the relationship between communication bottlenecks, computations, and number of replicas, as was presented in Section II-E and demonstrated in contributions such as [48], [118]. VOLUME 10, 2022 • Design space explorers aim to identify the optimal combination of directives to obtain an HLS-based hardware design with the best trade-off among different metrics, generating the Pareto-optimal set of designs. Reducing the design space and avoiding HLS in the exploration process can improve the design time, as was described in Section IV-B.
• Models integrated within a methodology or framework can automatically estimate the performance of HLS-based hardware designs without executing HLS tools, as presented in Section IV.
• Some frameworks and methodologies including DSE provide automatic directive-insertion optimizations and code transformation insights, as in contributions such as [115], [116], [118]. Nevertheless, the literature review shows that a number of challenges has to still be addressed in order to make optimal use of models, methodologies, and frameworks, such as: • Recent HLS tools generate more comprehensive reports with more accurate information on total resource availability, latency, clock frequency, and resource utilization. These reports can be integrated with models, methodologies, and frameworks to estimate metrics and provide an initial value for the replication factor of a single PE. However, the report generation is linked to the synthesis time of the FPGA implementation. Reducing the design time is an important factor when using FPGA/SoC without losing hardware quality to reconfigure the platform. Thus, if the HLS tool is in the loop for performance estimation using reports, it can lead to an increased design time. One way to overcome this is to use approaches such as [113], [121], [124], [152], [156], without the need to run HLS in the loop or reduce its invocation.
• The performance metrics reported by HLS tools make them suitable to be combined with a parallel computation model to reduce the time required to obtain the necessary statistics for each implementation for a specific application. However, there is a gap between the HLS report and the real hardware implementation [101] that can be addressed with a performance model that includes the results obtained from the sourceCode-tobitstream flow using the values related to final hardware utilization, power consumption, and timing reports.
• Computing models for FPGA-based reconfigurable hardware accelerators have to consider that the inherent hardware is not fixed. Rather, it is defined by how the application is described. Therefore, a higher number of parameters have to be included in the model, such as hardware resources (DSP, BRAM, LUT, and FF), programmable logic clock, latency, byte-operations (Bops), scalability in the number of PE, and power consumption. This contrasts with the computing models proposed for other parallel platforms, such as PRAM or BSP, that use a few parameters. Nevertheless, including more parameters in the model increases the analysis accuracy, but affects the complexity of the model analysis. Therefore, the trade-off between these two features has to be addressed. In addition, the parameters should be adjusted according to the particular combination of directives applied to the source code.
• The compatibility among different versions of HLS tools is not granted by models, methodologies, and frameworks. As a consequence, calibration techniques can help maintain compatibility between high-level tools, thereby avoiding being tied to one version of HLS tool in particular [14].
• Methodologies and frameworks are typically linked to a tool [77], [130], [131], [136]. However, most such tools are not easily available or do not have user support. This is a critical point in the adoption of methodologies and frameworks for performance estimation, which makes difficult to include them in the design flow. This may be solved by making methodologies and frameworks available to the FPGA designer through a repository system, such as contributions in [77] and [136], among others.
• The integration of frameworks into the different steps of the flow for designing IP cores can be affected by the installation of libraries, dependencies, and tools, such as LLVM IR and Clang, needed for the execution of the frameworks. It should be guaranteed to the user a simple way of installation and maintenance in order to facilitate their integration in the design flow. This concern can be addressed by providing a script with dependencies to be installed, an executable file, or a library package.
• For heterogeneous architectures, the hardware-software co-design can be considered by models, methodologies, and frameworks taking into account the inherent features of different technologies, to ease the decision on which part of the algorithm should be implemented in software and which part in hardware. The performance of the overall system may be estimated by combining traditional parallel computing models presented in Section II (for the sequential part) and the contributions discussed in Section IV (for the FPGA part). In addition, a single parallel model, such as Roofline, can be applied to both architectures.
Moreover, when a DSE engine is integrated with models, methodologies, and frameworks, the following aspects need to be considered: • One of the key points in the DSE is the execution of HLS tools during the exploration stage to validate the configuration obtained. This behaviour can lead to a long runtime, becoming a drawback in the DSE phase. Therefore, the adoption of different techniques to reduce the execution time of the exploration phase is indispensable, as shown in contributions such as [113], [121], [134], [136], [137].
• It is often sufficient to find a suboptimal combination of knobs based on specific metrics and user constraints. An important strategy is pruning the design space using intermediate Pareto-optimal designs, giving priority to the points that permit high-performance behaviours, as introduced in [136], [188], and [134].
• The DSE engine should guarantee a good compromise among the QoR and performance metrics.
• Approximate computing [189] can lead to an expansion of the design space, generating Pareto-optimal designs with a trade-off between area-power-latency estimation and error computation [129], [144]. A reduction in the space to be explored is fundamental to minimizing the invocations of HLS tools.
• It is important to identify the strengths and weaknesses of a given design space explorer. This can be performed using benchmarks, as was made in [15], [115], [114], [132], and [77], among others. • Mapping an optimal design from the DSE to the FPGA/SoC can be challenging while maintaining the QoR reported by the DSE engine, mainly latency. Contributions in the literature [77], [130], [155] have implemented their own scheduler to obtain solutions with better timing than HLS tools (with no guarantee that HLS will implement it in the same way) [78].
To address this, some contributions [78], [118] use a baseline implementation obtained after HLS synthesis to consider the impact of the compiler optimizations and use the estimated critical path that affects the latency. This implementation is considered the starting point for the DSE engine to search for Pareto-optimal designs. Moreover, in the process of mapping the final hardware design onto the FPGA/SoC, the place-and-route phase VOLUME 10, 2022 plays an important role and different strategies provided by commercial tools can be used in this phase, adding another factor to be analyzed.
• It is fundamental to consider the application of HLS-specific compiler optimizations, due to the impact that they have on the hardware quality, in terms of latency, area, and power consumption [190]. Fig. 18 summarizes the main aspects presented in this section, considering those to create efficient hardware to reconfigure the FPGA, how some of these aspects may be coped through models, methodologies, and frameworks, and the challenges that need to be considered to bridge the gap between designers and FPGA-based reconfigurable hardware accelerators.

VII. CONCLUSION
In this survey, different models, methodologies, and frameworks proposed for metrics estimation, FPGA-based design space exploration, and power consumption estimation on FPGA/SoC have been described. The main features and limitations, as well as trade-offs of these approaches, have been presented, and different challenges to be addressed have been identified.
The integration of models and frameworks in different research areas has also been described, indicating a growing tendency to apply them in the field of machine learning accelerators for diverse applications.
Based on our literature review, it can be observed that existing models, methodologies, and frameworks are very difficult to compare against one another. One reason is the lack of standards limiting their evaluation on different hardware and applications, together with the fact that the different approaches do not analyze the same performance metrics.
In addition, it can be affirmed that the inherent hardware reconfigurability of FPGA/SoC affects the complexity of the associated models. Indeed, the models for FPGA/SoC usually have a higher complexity than those commonly used for CPU, GPU, multicore processors, among other architectures.
We believe this survey can help readers understand the benefits of integrating models, methodologies, and frameworks for FPGA-based hardware accelerators into the design flow. Therefore, the FPGA designer can select the approach that best suits the application, hardware architecture, and programming skills.
The literature review shows that several challenges have to still be addressed to make optimal integration of models, methodologies, and frameworks in the design flow. By highlighting these challenges, this survey reveals what has to be considered to bridge the gap between the FPGA designer and hardware accelerators based on FPGA. VERONICA GIL-COSTA is currently a Former Researcher at Yahoo! Labs Santiago hosted by the University of Chile. She is also an Associate Professor at the University of San Luis, a Researcher at the National Research Council (CONICET) of Argentina, and a Researcher at the CITIAPS, Chile. Her research work is on parallel computing and distributed systems, with applications in query processing and capacity planning for large scale systems.

MARÍA LIZ CRESPO is currently a Research
Officer at The Abdus Salam International Centre for Theoretical Physics (ICTP) and an Associate Researcher of the Italian National Institute of Nuclear Physics (INFN), Trieste, Italy. She is also coordinating the research and training program of the Multidisciplinary Laboratory (MLab), ICTP. She has organized several international schools and workshops on fully programmable systems on chip for nuclear and scientific instrumentation. She is the coauthor of more than 100 scientific publications in prestigious peer-reviewed journals. Her main research interests include advanced scientific instrumentation for particle physics experiments and experimental multidisciplinary research.
GIOVANNI RAMPONI (Life Senior Member, IEEE) was born in 1956. Since 2000, he has been a Full Professor of electronics at the Department of Engineering and Architecture, University of Trieste, Italy. He is the co-inventor of international patents, and has published more than 200 papers in international journals, conference proceedings, and book chapters. His research interests include nonlinear digital signal processing, enhancement and feature extraction in images and image sequences, image visualization, image quality evaluation, and deep learning techniques for image processing. More information can be found at: www.units.it/ramponi. VOLUME 10, 2022