Raising the Abstraction Level of a Deep Learning Design on FPGAs

Autonomous and intelligent systems based on deep learning, continuously attract the attention of researchers and engineers. With the progress on the application of deep learning for modern applications arises the challenge of reaching real-time processing. To face this challenge, Field Programmable Gate Arrays (FPGAs) can be used; however, deep learning generic implementations on an FPGA are still a topic of research. Advances in FPGA technology allow for designs based on High-Level Synthesis (HLS) for accelerating and facilitating implementations of complex problems on hardware. A platform based on HLS for emulating a generic parameterizable deep learning system on an FPGA is proposed in this paper, allowing for the implementation of any structure based on the following layers: convolution, max-pooling, batch-normalization, and fully connected networks. Through this platform, it is possible to implement a deep learning system on FPGAs using an N-Fold or a Flow architecture without the assistance of central processing units. Whereas the N-Fold architecture requires fewer hardware resources, as it re-uses resources, the Flow architecture presents a higher throughput. The developed platform improves the deep learning design productivity by automating the generation of the system, achieving efficiency and raising the level of abstraction, as was experimentally verified and evaluated.


I. INTRODUCTION
A large quantity of computation is required to analyze data based on deep learning [37]. Thus, the hardware required to implement a deep learning efficiently has been the subject of intensive investigation and investment, mostly targeting Field Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs) [7], [25], [26]. Even though GPU providers have positioned GPUs as the most performant devices for this new era, FPGAs are becoming widely adopted since they provide a better trade-off between performance and power consumption [25], making them efficient in real-time embedded systems.
The associate editor coordinating the review of this manuscript and approving it for publication was Szidónia Lefkovits .
The advance of FPGA technology makes their design a challenge to fully utilize their capacities to address complex problems [12]. The design methodology on FPGAs can be structured hierarchically over several levels of abstraction. To better understand the levels of abstraction, Gajski-Kuhn proposed in 1983 a model called the Y-chart, which is represented in Figure 1 [8], [14]. The three domains of the Y-chart are represented on its radial axes. Each of them can be divided into levels of abstraction, using concentric rings. The Y-chart has five concentric circles representing the following levels of abstraction [22]: Circuit level, Logical level, Register Transfer Level (RTL), Algorithmic level, and System level. Each level represents the information in different ways, with the amount of detail increasing from higher to lower levels [8]. Previously, designers would manually refine the behavioral system specifications down to the RTL. From that point on, RTL synthesis would complete the design. Recently,however, High-level Synthesis (HLS) has improved design productivity by automating the refinement from the Algorithmic Level to the RTL [22]. Even though HLS can produce the design from the Algorithmic Level, a significant amount of time and a certain level of hardware design expertise are required to deploy deep learning on an FPGA. HLS tools require the algorithm to be specifically expressed in order to enable the synthesis tools to identify and exploit parallelism. Dealing with hardware description languages at a level of abstraction, as pure software, has been explored but often leads to inefficient use of the hardware. The key factor to bear in mind is that the FPGA has hardware characteristics that are not taken into consideration in a software-based design. Thus, each statement describes hardware that must be built rather than providing a set of instructions to be executed.
A novel way of solving this problem, proposed in this paper, is to increase the abstraction to the system level by providing a platform to deploy a generic parameterizable-based deep learning on an FPGA. Such a platform provides designers with a choice in choosing or developing their own deep learning framework while providing efficient implementations. To achieve this, designers have the freedom to provide the network topology description and the parameterization of each layer.
The proposed platform allows for the designing of Convolutional Neural Networks (CNNs), Deep Neural Networks (DNNs), and Artificial Neural Networks (ANNs). Two different architectures are provided to make the platform more efficient and useful, allowing the designer to easily adjust the achieved throughput and the required resources. In this way, the designer has the freedom to perform design space exploration, which allows for finer customization. Thus, the platform offers the advantage of optimization for different objectives to find the best solution for each application and according to the desired specifications. The main advantage of using such a platform is that the designer can easily implement deep learning on FPGAs without the necessity of an in-depth understanding of the underlying hardware.

II. BACKGROUND
CNNs are divided into two main components: feature extraction and classification components. The feature extraction component makes use of convolution, max-pooling and batch-normalization functions, while the classification component consists of fully-connected layers, common in ANNs. All layers of these networks can be organized into two sublayers. The first one, called the main operation sub-layer, applies the main function, i.e. the convolution, the maxpooling, the batch-normalization, or the fully-connected operation. The second sub-layer, called the non-linear sublayer, is supported on the hyperbolic tangent or by ReLU functions. ReLU has become the most successful and widely used non-linearity, given its simplicity and effectiveness [29]. Additionally, for classification models, a softmax non-linear function is commonly used in the last layer.
A convolutional layer contains a set of kernels whose parameters need to be learned for extracting multiple features, typically from an image [3]. The dimension of the kernel is relatively smaller than that of the input data. The kernel is slid across the width and height of the matrix input and the scalar products between the input and kernel are computed in every spatial position. Assuming the matrix F as an input (F ∈ R W ×W ×L , where W × W × L corresponds to height × width × #input matrices), the matrix K as a kernel (K ∈ R H ×H ×L×M , where H × H × L × M corresponds to height × width × #input matrices × #output matrices), and b as the bias, the equation for computing the calculation of each element in the output matrix (C ∈ R R×R×M , where R×R×M corresponds to height × width × #outputs matrices) is given by equation (1).
The max-pooling layer ''smoothes'' and progressively down-samples the spatial size of the representation, to reduce the number of parameters and to control overfitting. The max-pooling layer operates independently on every matrix input and resizes it spatially without performing any learning. Typically, the maximum value within an H × H pooling window is selected, reducing the number of parameters to be learned in the next layer [24]. Thus, the expression to compute the max-pooling response of each element in the output matrix is given by equation (2).
The most common form is a max-pooling layer with a 2 × 2 pooling window size and a stride of 2 along the width and height of each input matrix, discarding 75% of the activations. Each operation would, in this case, take a maximum value of four numbers. The batch-normalization layer normalizes each input matrix processed by the convolution layer. Using batchnormalization, the internal covariate shift is reduced and, in consequence, the training of deep learning systems accelerates. Batch-normalization standardizes values, x i , by calculating the mean µ B and variance σ 2 B over a mini-batch as shown in equation (3).
where ε is the numerical stability when the mini-batch variance is too small. To accommodate inputs with a mean equal to 0 and a variance equal to 1, the value of x i should be scaled and shifted as shown in equation (4). The offset, β, and scale factor, γ , are parameters that are updated during the training.
In a Fully Connected layer all the input elements from the previous layer are connected to all the neurons. Typically, equation (5) is used to compute the response of each of the neurons present in the layer. The activation function σ is a non-linear function applied to the weighted input sum to produce the response y.

III. HARDWARE ACCELERATORS FOR DEEP LEARNING
Ordinary processors are not efficient for deep learning; they can hardly meet performance requisites [9]. Since CNN processing exhibits parallelism at different levels, there has been a significant amount of work investigating efficient dedicated parallel systems [17], [28], [36], [46]. While spatial parallelism can be explored greatly on hardware, there are application fields for which the use of specific hardware is a requirement. These requirements could include speeding up the processing [26]. Many examples of neural vision systems, which likewise cannot be used while attached to computers, can be found in [34]. Several designs based on FPGA [45], GPU [7], [35] and Application-Specific Integrated Circuits (ASICs) [10] have been prototyped to implement high performance deep learning systems. The first FPGAs, launched in the 1980s [39], allowed one to explore, in an extended way, spatial parallelism. FPGAs outperform the systems based on general purpose processors by eliminating the paradigm of sequential execution. Ku proposed high-level synthesis to implement ASICs, which have the advantage of possessing a very minutely controlled and optimized power consumption [19]. Nevertheless, ASICs are not suitable for application fields where the designs might need to be upgraded frequently or even occasionally. Deep learning is an example of a field in fast pace evolution, for which the re-configurability of the hardware is useful. Contrary to FPGAs, ASICs have very high Non-Recurring Expenditure (NRE) costs [21].
NVidia's Compute Unified Device Architecture (CUDA) platform, first announced in 2007, was the earliest widely adopted programming model for General Purpose computing on a GPU (GPGPU) [13]. The GPU architecture adopts a Single Instruction Multiple Thread (SIMT) approach, which is more efficient than general-purpose Central Processing Units (CPUs) when exploring data parallelism. However, the power consumption of GPUs is relatively high. Contrariwise, the FPGAs architecture allows one to explore parallelism without the limitations imposed by an SIMT. On the one hand, FPGAs allow for more flexibility and are more energy efficient than GPUs [7], [25]. GPUs, on the other hand, have become more attractive for system designers because, unlike FPGAs, an in-depth understanding of the underlying hardware is not required. To counteract this tendency, Xilinx has been making a considerable effort to mitigate these constraints by providing tools such as Vivado HLS. Vivado HLS is a tool which has greatly facilitated the implementation of custom logic in the programmable logic (PL), starting from a high-level description in C language, which can be automatically translated to HDL. Table 1 presents a comparison of the platforms found in the literature for supporting neural network models, according to the front ends, FPGAs supported, operation precision, and the necessity of off-chip microinstructions to perform the network. Note that the platforms support the convolution, pooling and fully-connected layers.

IV. RELATED WORK
The Automatic Neural GEnerator (ANGE) is one of the first platforms for developing artificial neural networks [4], [30], [31]. The first version of this platform allows only the mapping of ANNs, but a second version extended it to DNNs as well (Those with only two hidden fully-connected layers). The ANGE tool uses Matlab and the System Generator from Xilinx. Currently, several approaches towards the direction of automated mapping CNNs to FPGAs have been proposed. Platforms such as Haddoc2 [1], DeepBurning [42], and DnnWeaver [33] generate fully synthesizable Verilog or VHDL as output. The evolution of the HLS tools enables the emergence of platforms such as fpgaConvNet [41], FINN [40] and FP-DNN [16]. ALAMO, DeepBurning, DNNWeaver and the proposed platform all support the normalization layer. Most of the platforms focus on the automated implementation of CNNs, except FINN, which is focused on Binarized Neural Networks (BNNs).
Most of the platforms have been integrated in existing deep learning software libraries and frameworks. Thus far, Caffe has been the best-supported framework for CNN-to-FPGA mapping. Conversely, the fact that the designer is limited to a specific framework is a disadvantage since other or even newer frameworks may gain prominence. The design methodology used for each platform can also be analyzed based on the necessity of microinstructions required to implement the network. It allows one to have processing engines controlled by software through microinstructions. This process corresponds to the sequential execution of the layers or set of layers in a time-sharing manner. On the other hand, it is possible to find some platforms that implement the network on hardware without off-chip microinstructions. Haddoc2 and the herein proposed platform store the trainable parameters on-chip but the supported model size is constrained by the storage resources of the target device. To circumvent this constraint, the proposed platform allows one to extend a CNN implementation to multiple FPGAs. In this case, the input is inserted in one device and the output is obtained from another device. The front-end is relevant to provide platform accessibility to the developers. Most of the platforms present a user graphical interface associated with the framework to develop and train the networks. As mentioned previously, Caffe has been the best-supported framework by CNN-to-FPGA automated tools. However, these platforms do not integrate different deep learning tools. Conversely, ANGE, FFTCodeGen, and the proposed platform have up to this point adopted custom front ends. With the proposed platform, the network description is held by the designer. Thus, this tool can reach a wider community of deep learning researchers and practitioners. Furthermore, the proposed platform is a unique platform that provides alternative architectures to design a deep learning network on an FPGA: N-Fold and Flow architectures. The N-Fold architecture is a new and original architecture presented by this platform.

V. ARCHITECTURES TO DESIGN DEEP NEURAL NETWORKS
Two different architectures are proposed to implement deep learning on an FPGA, namely the Flow architecture and the N-Fold architecture.

A. N-FOLD ARCHITECTURE
The N-Fold architecture is depicted in Figure 2. It is composed of two sub-layers, which are iterated N times over time to process the complete network. The first sub-layer implements the main operation while the second sublayer applies a non-linear function.
In this architecture, the network can only initialize a new process after the previous iteration of the process is completely through. Consequently, it provides high latency and low throughput. On synthesizing this architecture, the non-linear operation is shared, reducing the circuit area required. Figure 2 illustrates this approach. In the first sublayer, the design commutes to the respective module where the main operation is applied. Kernels, Bias and Weighs represent values that are stored in the memory to execute the convolution and fully connected networks. In the second sub-layer, the design commutes the signal to the respective module to compute the non-linearity. Here, the nonlinear part has access to two different memories: one of them stores exponential values while the other stores hyperbolic tangent values. An attractive feature of the N-Fold approach is the possibility of implementing several layers in hardware sharing the nonlinear modules.
On the other hand, in Figure 2 one can also observe that the resulting output of each layer is temporarily stored. Even though the output of each sublayer is a 3D map with different sizes, a unique array was used to temporarily store these outputs during the application of the algorithm. Since the outputs resulting from the layers are temporarily buffered in a common memory, resources are saved since there is re-utilization among all the layers. The number of blocks RAM (BRAMs) required for data buffering between sub-layers can be calculated with equation (6), where N words is the number of values to store, N bits is the number of bits used to represent a word, and S BRAM is the capacity of a BRAM. To improve the performance, a dualport RAM is used to allow read operations on one port and write operations on the other port. For this purpose, the directive < #pragma HLS resources > is used, specifying that array variable is mapped to the BRAMs. Each BRAM has two completely independent access ports. Each port has its own address, data in, data out, clock, clock enable, and write enable signals.
With the previous directive, the HLS tool considers the memory in Figure 2 as a single array and it is implemented as one large memory. The array representation becomes a bottleneck to achieve performance due to the limited memory ports. However, when an array is partitioned into multiple blocks, the single array is implemented as multiple RTL BRAMs. Partitioning helps with the performance, allowing an increase in the amount of read and write ports for the storage and, subsequently, an increase in the number of elements that can be accessed in parallel by the modules. Thus, there is a design trade-off between the performance and the number of RAMs required. Vivado HLS includes optimization directives for defining how arrays are implemented and accessed; the directive used in this architecture is < #pragma HLS array_partition variable = ''name block factor = ''int > where the variable = ''name specifies the array variable to be partitioned and the factor = ''int is the number of BRAMs used to implement the buffer.

B. FLOW ARCHITECTURE
A general view of the Flow architecture is illustrated in Figure 3. In the Flow architecture, the whole system is seen as a series of data transformations, where all operations are performed sequentially but individually independent of each other. This architecture emphasizes the incremental transformation of data by successive components. In this architecture, resources are not shared, which allows for pipelining at the cost of requiring more hardware and memory buffers between the layers. By using the directive < #pragma HLS dataflow >, the pipeline between the layers is implemented and, as a result, the flow data are optimized. In this case, data enter into the system and then flow through the layers across time in a data flow approach. If no directive is used, the Vivado HLS performs all the layers sequentially without pipeline operation.
When a pipeline between the layers is used, there may be a data conflict if a value is written through one port on a specific address and, at the same time, a value is read from the same address from the other port. To avoid conflicts between the layers in the pipeline, double buffering (also known as ping-pong buffering) is used as an intermedium memory. Two equal size memories with two independent access channels are shown in Figure 4. One memory bank is used to hold the previous data so that the forward layer can read it, while the backward layer creates and transfers new data to another memory bank. When the new data are completely transferred, the reader and writer layer alternates the two memory banks. The usage of double buffering increases the overall throughput of a CNN and helps to prevent bottlenecks.
The procedure to compute the number of BRAMs in each intermedium memory of the Flow architecture is similar to the one for the N-Fold architecture, but the result is multiplied by the number of memory banks, n bank ; for each intermedium memory, the n bank is equal to 2.

VI. PROCESSING MODULES
Two different types of modules were designed to implement deep learning on an FPGA: the main operation modules and the non-linear modules.

A. MAIN OPERATION MODULES
The main operation modules process a feature map based on equation (1) for the convolution module, equation (2) for the max-pooling module, and equation (5) for the fully-connected layer module. The Batch-normalization layer computes each sample of the feature map by multiplying k 1 and adding k 2 (equation (7)). Note that equation (7) is derived from equations (3) and (4).
The main operation modules are designed for HLS description by using optimization directives required to reach the best architecture. The objective of the design is to enable the efficient application of loop unrolling and hardware pipeline techniques, and thereby improve the performance while using the resources provided by the FPGA. The pseudo code, using loop unrolling and pipelining directives, is presented in Table 2.
A fundamental first step of HLS consists of detecting and resolving loop issues based on program directives for latency optimization (pipeline and unroll directives). The HLS tool determines the dependencies between computations and applies those techniques to achieve the specifications. Detecting such loop dependencies and applying transformation is a complex task. Loop dependencies can be classified as loop-independent and loop-carried dependencies.
Loop-independent dependencies do not inhibit any parallelization of the outer loops, while loop-carried dependencies inhibit parallelization because the simultaneous execution of different iterations does not respect the dependencies.
At the algorithm level, the dimension of the feature maps processed by all the modules is different according to the architecture used, as can be verified from Table 2. The feature maps span through a three-dimensional (3D) space if the Flow architecture is used, whereas the feature maps are seen as a vector if the N-Fold architecture is used. Although BRAMs is used in both cases to store the intermedium data, these differences imply different interpretations of the HLS synthesizer. The Flow architecture allows, for instance, a parallelization of loop 3 in the Pseudocode of the convolution module VOLUME 8, 2020 (see Table 2), because it has Loop-independent characteristics. For this purpose, the use of the directive ''array partition'' to the n loop-iterator is introduced at the top of the module. Thus, the buffer creates n smaller arrays from consecutive blocks of the original array. Consequently, direct and broadcast connections between the input and the processing elements are generated. In the N-Fold architecture, the resulting feature map is temporarily stored for each layer by using the same memory in the format of a vector, being the number of direct connections between the input and the processing elements dependent on the number of BRAMs needed to build the memory.
For both architectures, the loops from 6 to 4 in the pseudocode 1a) of Table 2 are examples of loop-carried dependencies, since each read operation cannot proceed until the write operation from the previous iteration is completed (Fig. 5a), so parallel calculation cannot be implemented. For example, in pseudocode 1a), the multiplication of the elements of Kernel and Input can be pipelined, but the respective addition requires the result of the addition in the previous iteration of the loop. This is a loop-carried dependency. The inner loop c ∈ [1; KC] accumulates into a temporary register, which is written back to a temporary register at the end of each iteration r ∈ [1; KR]. In its turn, when these two loops are completed, it is accumulated and written back to a variable Out at the end of each iteration of ns ∈ [1; NS]. This is a common scenario when accumulating into a single register (Fig. 5b), in cases where the accumulation operation takes L acc higher than 1 clock cycle (L acc is the latency of a 32-bits floating point operation).
The carried dependencies loops are solved by a cascade of accumulations, allowing the pipeline to compute the output elements. The cascade of accumulations may generate interconnections with a complex topology; an iteration of a pipelined loop depends on a result produced by a previous iteration, accumulating partial results into registers.

B. NON-LINEAR MODULES
Three different non-linear modules are presented in this section: the hyperbolic tangent, the ReLU, and the softmax modules.

1) HYPERBOLIC TANGENT MODULE
The main challenge in designing the hyperbolic tangent module is in the range -6 and 6 of its domain, because for domain  x > 6 or x < −6 the hyperbolic tangent can be simply approached by tanh(x) ≈ 1 or tanh(x) ≈ −1, respectively. This function is an odd function [2], as it is, symmetric around the origin. Therefore, the function is computed for x > 0, and then for x < 0, and the hyperbolic tangent values are achieved by using equation (8).
The developed solution stores the hyperbolic tangent values for 0 x < 6 in memory, filling a Look Up Table (LUT). Figure 6 illustrates the circuit for computing the hyperbolic tangent, from the calculation of the memory address until the output of the value.
The calculation of the memory address from the x input value is based on a linear relation (equation (9)).
The slope, y, is given by the ratio between the number of memory elements, N tanh , and the difference between the last, x 1 , and the first, x 0 , values of the domain (x 1 = 6 and x 0 = 0). Then, the quotient in equation (9) is calculated as the nearest integer of y to get the address, selecting the best value of the hyperbolic tangent.

2) ReLU MODULE
As depicted in Figure 7, a ReLU module provides an output equal to zero if the input is less than zero, otherwise, the output is equal to the input: if x < 0, then ϕ(x) = 0; otherwise ϕ(x) = x.

3) SOFTMAX MODULE
The softmax function is used at the last layer of a deep learning-based classifier [23] and it is given by equation (11).
This function, also named the normalized exponential function, is used for a categorical distribution representation, giving a probability distribution over k different probable outcomes. The developed architecture is presented in Figure 8.   The first step consists of calculating sequentially the exponential value of each input, x, with the result being stored in a buffer. Note that before storing these values into a buffer, the value of S = k j=0 e x k is computed and accumulated. Then each exponential value is divided by S. To compute the exponential function on the hardware, a hybrid solution was used [18]. This solution decomposes the exponential function into an integer, int_x, and a fractional part, frac_x, of x, i.e.: While e frac_x is calculated by a polynomial interpolator, e int_x is stored and uploaded from memory. Figure 9 shows a diagram with the adopted hybrid solution.
The int_x is used to compute the memory address. In this implementation, values between e −30 and e 30 are stored in an LUT. The frac_x is used to compute the value of e frac_x through a 5th-order interpolating polynomial [2], [20].

VII. BENCHMARKING THE N-FOLD AND FLOW ARCHITECTURES
In this section, the two proposed architectures, the N-Fold and the Flow, are benchmarked. Three different networks, namely ANN, DNN, and CNN, are considered. For each network, the resources required to implement it, based on each architecture on a Kintex7 [43], are presented. Additionally, the latency and the throughput are presented. To further complement the results, the performance of the implementation of these networks on a GPU through the toolbox of Matlab is evaluated. The GPU is a GeForce Nvidia MX150 384 Compute Unified Device Architecture (CUDA).

A. ARTIFICIAL NEURAL NETWORK
Two ANNs, with the same topology, but using different non-linear functions in the hidden layer, are considered in the proposed architectures. The main characteristics of the two ANNs are presented in Table 3. Figure 10 presents the resources required for implementing both ANNs, when adopting the different architectures. Figure 10 can be analyzed in two different perspectives: i) the resources required by both networks when the same architecture is used, and ii) the resources used by the two architectures when the same network is implemented.
Since an ANN contains just one hidden layer and one output layer, it is logical that the difference in resource utilization between both architectures is not large. However, as expected, an ANN implemented with a Flow architecture uses more resources than the same ANN implemented with an N-Fold architecture. The ANN-1 uses 13% and 14% of the FPGA resources for the N-Fold and Flow architectures, respectively, and the ANN-2 uses 14% and 15% of the FPGA resources for the N-Fold and Flow architecture, respectively. On the other hand, it can be observed that an ANN that applies the hyperbolic tangent as the non-linearity requires more resources than the same ANN using the ReLU as the non-linearity (1% increase of the total resources required). The increase in the total resources needed to implement an ANN using a hyperbolic tangent is essentially justified by the VOLUME 8, 2020  BRAMs used to store the hyperbolic tangent values and the DSPs applied to calculate the memory addresses. Figure 11 presents the latency and the throughput of the implemented ANNs on a GPU and on an FPGA with the proposed architectures, respectively. Figure 12 presents the power consumption.
An ANN implemented with the N-Fold architecture has a latency slightly higher than the same ANN implemented with the Flow architecture. On the other hand, an ANN using a ReLU as non-linearity has a latency slightly lower than the same ANN using a hyperbolic tangent (not exceeding 0.1µs). More time is required to calculate the memory address for accessing the hyperbolic tangent values. The pipeline effect between layers in the Flow architecture is not very noticeable, since the ANNs contains just one hidden layer and one output layer.
The GPU becomes more efficient when various instances of the networks are batched. For example, if 15000 instances of the ANN-1 are simultaneously processed in the GPU, the throughput of the GPU is 11.4× higher than that of the FPGA. In the case of ANN-2, the throughput of the GPU is 9.96× higher than that of the FPGA. Larger batch sizes are almost always more efficient on GPUs since massive parallelism is explored to take advantage of all the GPU stream processors. However, sometimes batching inference work is not possible due to the characteristics of the application. In some common applications, such as a server that does inference per request, it is not possible to implement opportunistic batching. For each incoming request, one must wait for a time, and if other requests come in during that time, one must batch them together. Otherwise, one may continue with a single instance inference. From Figure 11, it is possible to verify that when exactly one instance of the ANN-1 is processed at a time in the GPU, the throughput is 2.9× higher when both architectures are used. In the case of ANN-2, the throughput is 3× higher when both architectures are used. As can be observed, situations where a single instance is processed do exist in practice but they are not suitable for using a GPU. Regardless of the number of instances entered simultaneously, the power consumption does not change significantly. This fact could be explained since a system combines CPUs and GPUs, and the GPU does not disconnect the resources which are not used. A comparison between the power consumption of the implementation solely on the GPU and the FPGA is presented in Figure 12. The power consumption when the ANNs are implemented on the GPU [38] through the toolbox of Matlab is higher than when the same networks are implemented with the proposed architectures. In the ANN-1 case, the GPU consumes 7.9× and 8.2× more power when the N-Fold and Flow are used, respectivaly. In the ANN-2 case, the GPU consumes 7.7× and 8.1× more power when the N-Fold and Flow are used, respectively.

B. DEEP NEURAL NETWORK
The DNNs in Table 4 were considered to apply the two proposed architectures. Figure 13 presents the resources required for implementing both DNNs, by using the two proposed architectures, on FPGAs. Figure 13 shows that a DNN with 3 hidden layers implemented using the Flow architecture uses more resources than when the same DNN is implemented using the N-Fold architecture. The DNN-1 spends 14% and 31% of the total resources available when implemented with the N-Fold and Flow architectures, respectively. The total resources needed to implement the DNN-2 using an N-Fold and a 205156 VOLUME 8, 2020 Flow architecture are 16% and 35%, respectively. Moreover, a DNN using the hyperbolic tangent as the non-linearity requires more resources than when using the ReLU as the non-linearity (an additional 2% in case of the N-Fold and 4% in case of the Flow architecture). Figure 14 presents the latency and the throughput of the DNNs implemented on a GPU and with the proposed architecture. Figure 15 presents the power consumption. In a generic overview, a DNN using a ReLU as the non-linearity (DNN-1) exhibits a latency slightly higher than the same DNN using the hyperbolic tangent as the non-linearity instead (DNN-2). On the other hand, the DNN implemented with the Flow architecture has a higher throughput than when implemented with the N-Fold architecture. The throughput increases by 5.1% for DNN-1 and 4.5% for DNN-2 when the Flow architecture is used. When 15000 instances of the DNN-1 are batched and simultaneously processed in the GPU, the throughput is 13.1× and 13.8× higher than the Flow and the N-Fold FPGA architectures, respectively. In the case of DNN-2, the throughput of the GPU is 12× and 12.6× higher than the Flow and the N-Fold architectures, respectively. Futhermore, can be verified that when exactly one instance of the DNN-1 is processed at a time in the GPU, the throughput is 3.4× higher and 3.6× higher when the N-Fold and the Flow architectures are used on an FPGA, respectively. In the case of DNN-2, the throughput is 3.7× higher and 3.9× higher when the N-Fold and the Flow architectures are used, respectively. The power consumption of the DNNs implemented on the GPU through the toolbox of Matlab is higher than that on the FPGA. In the DNN-1 case, the GPU consumes 7.6× and 7.4× more power if the N-Fold and the Flow architectures are used, respectively. In the DNN-2 case, the GPU consumes 7.5× and 7.2× more power if the N-Fold and the Flow architecture are used, respectively.

C. CONVOLUTION NEURAL NETWORKS
Two CNNs were chosen to experimentally evaluate each of the proposed architectures. The main features of the CNNs applied for the experimental assessment are presented in Table 5. Figure 16 presents the FPGA resources required for implementing the CNN-1 and the NCNN-1 using the two proposed architectures. The CNN-1 implemented with the N-Fold architecture uses 39% of the total resources available, while the CNN-1 implemented with the Flow architecture uses 49% of the total resources. As expected, the CNN-1 designed with the N-Fold architecture uses fewer resources than those designed with the Flow architecture. In Figure 16b, we can see an extreme situation: the impossibility of ''fitting'' an network implemented with the Flow architecture due to the lack of resources. The NCNN-1 implementation using the N-Fold architecture requires around 47% of the total resources available. However, an impossibility of implementing the NCNN-1 is verified if the Flow architecture is used, due to the lack of LUTs and DSPs. In these situations, the N-Fold architecture is the only one that can be used for implementing the NCNN-1 in a single FPGA.    Figure 18 presenting the corresponding power consumption.
With this experience, one may come to realize the advantage of using the Flow architecture, in terms of performance improvement. The Flow architecture achieves significantly higher throughput than the N-Fold architecture. The increase of throughput is 2.12× and 4.44× for the CNN-1 and the NCNN-1, respectively. The GPU becomes more efficient when various instances are batched. For example, if 15000 instances of the CNN-1 are simultaneously  processed in the GPU, the throughput of the GPU is 12.5× and 26.4× higher than the Flow and the N-Fold architectures, respectively. In the case of the NCNN-1, the throughput of the GPU is 5.6× and 25× higher than the Flow and the N-Fold architectures, respectively. It can be verified that when only one instance of the CNN-1 is processed at a time in the GPU, the throughput is 5.1× higher and 10.7× higher when the N-Fold and the Flow FPGA architectures are used, respectively. In the case of the NCNN-1, the throughput is 5.3× and 23.52× higher than when the N-Fold and the Flow architectures are used, respectively. As can be is observed, situations where a single instance is processed exist in practice but are not suitable for using a GPU. The power consumption of the CNN-1 and the NCNN-1 implemented on a GPU through the toolbox of Matlab is higher than the same networks implemented with the proposed architectures. In the CNN-1 case, the GPU consumes 6.1× and 6.2× more power if the N-Fold and the Flow architecture are used, respectively. In the NCNN-1 case, the GPU consumes 5.9× and 4.9× more power if the N-Fold and the Flow architectures are used, respectively.

VIII. COMPARATIVE EVALUATION WITH STATE-OF-THE-ART
The CNN applied for the performance comparison between the proposed architectures and the state-of-the-art FPGA implementations has in the first layer 11 convolutions, with 3 × 3 kernels and 3 feature maps as input, producing maps of size 24 × 24. The next layer computes 12 high-level features by performing 3 × 3 convolutions. The third layer performs 2 × 2 spatial pooling. The fourth layer performs 3 × 3 convolutions resulting in 10 feature maps followed by a 2 × 2 pooling layer. Finally, the last layer is a linear classifier having 10 neurons and applying the softmax function as the activation function. Two different implementations were tested, one of them supported on the N-Fold architecture and the other one on the Flow architecture. On one hand, 49% of the total resources are used to implement the CNN with the Flow architecture. On the other hand, 43% of the total resources are used to implement the CNN with the N-Fold architecture. Table 6 compares the performance of some CNNs from state-of-the-art implemented on the FPGAs with the one implemented with the platform investigated in this paper.
Most of the implementations use fixed-point arithmetic, while the implementation proposed in this Thesis uses floating-point arithmetic. Nevertheless, [44] presents an implementation that uses floating-point but, in contrast with the solution proposed in this Thesis, requires external memory to store/retrieve kernels and on-/off-chip interconnect. As a result, this makes both proposed architectures in this Thesis more cost-effective.
In general, the two architectures developed in this work presents a competitive throughput in comparison to the existing solutions in Table 6. However, the solution presented in [44] is the most noteworthy as this implementation presented a throughput of 61.62 GOP/s, which is better than the implementation using the Flow architecture. One reason which could explain why the proposed architecture does not achieve a higher throughput is directly related to the number of access ports available in the memory between the layers. In [44] the CNN design is composed of PEs, an on-chip buffer, external memory, and on-/off-chip interconnect. All the intermedium data are stored in the external memory and the PE is the basic computation unit for convolution which is tiled to fit in the PL part. The tile data are first transferred from the external memory to the on-chip buffer before being fed to PEs. This buffer contains several independent buffer banks and the number of these buffer banks is equal to the number of inputs in the tile data. In this way, it is possible to access all the inputs simultaneously to process the tile, increasing its calculation speed. In the case of this paper's research, the intermedium data between the layers are stored in a buffer. The size of that buffer is sufficient to allocate all the intermediate values, but each input does not have an exclusive port. The fact that there are not enough ports to simultaneously access all intermediate values may cause a delay in the processing values. On the other hand, the implementation using the N-Fold architecture ranks in a favorable position, considering that the N-Fold architecture saves resources at the cost of throughput.

IX. CONCLUSION
A platform to deploy a generic parameterizable-based deep learning on an FPGA was proposed in this paper. In this platform, the parameterization of the networks is applied to easily design a deep learning system that adopts convolution, max-pooling, batch-normalization, and fully connected layers. The parameterization provides tools to the designer for configuring a deep learning in a ''lego'' approach and deploying it automatically on an FPGA.
Two different architectures were proposed in this paper to design and implement those networks on FPGAs, namely the N-Fold architecture and the Flow architecture. While the N-Fold architecture is composed of a single hardware layer, which iterates N times for the multiple layers, the Flow architecture processes ''flows'' between hardware layers that operate in a pipeline way. In both architectures, each layer is composed of two different modules: the main operation and the non-linearity modules. Loop unrolling and pipelining were applied to improve the performance and the efficiency of the networks synthesized with HLS.
The performance of the networks implemented with the proposed architectures has been evaluated and compared with the performance achieved with GPUs. The proposed architectures present a better performance when compared with the implementation on a GPU. Comparing both architectures, the N-Fold architecture requires fewer FPGA resources than the Flow architecture, since the N-Fold reuses resources. On the other hand, the performance of networks using the Flow architecture is higher than those using the N-Fold architecture. The deeper the network, the more significant the increase in latency and the decrease of the throughput of the networks implemented with the N-Fold architecture. This characteristic is presented as an advantage of Flow over the N-Fold architecture. However, the herein architecture requires all the weights and kernels to be stored on-chip and, as consequence, the supported model size is constrained by the storage resources of the target device. DARÍO BAPTISTA received the master's degree in telecommunications and networks from the University of Madeira, Portugal, in 2009. He is currently pursuing the Ph.D. degree called NETSyS with the Instituto Superior Técnico, Lisbon. He has been involved in research projects, since 2010, with the Madeira Interactive Technologies Institute and the Centre of Exact Sciences and Engineering of the University of Madeira. His research interests include automation, artificial neural networks, deep learning, and field programmable gate array implementations.
LEONEL SOUSA (Senior Member, IEEE) received the Ph.D. degree in electrical and computer engineering from the Instituto Superior Técnico (IST), Universidade de Lisboa (UL), Lisbon, Portugal, in 1996. He is currently a Full Professor with Universidade de Lisboa (UL). He is also a Senior Researcher with the Research and Development Instituto de Engenharia de Sistemas e Computadores (INESC-ID). His research interests include VLSI architectures, computer architectures, parallel computing, computer arithmetic, and signal processing systems. He has contributed to more than 200 papers in journals and international conferences, for which he got several awards, such as DASIP'13 Best Paper Award, SAMOS'11 'Stamatis Vassiliadis' Best Paper Award, DASIP'10 Best Poster Award, and the Honorable Mention Award UTL/Santander Totta for the quality of the publications, in 2009. He has contributed to the organization of several international conferences, namely as program chair and as general and topic chair, and has given keynotes in some of them. He has edited four special issues of international journals, and he is currently Senior Editor of the IEEE JETCAS, Associate Editor of the IEEE TRANSACTIONS ON COMPUTERS, IEEE ACCESS, and Springer JRTIP. He is Fellow of the IET and Distinguished Scientist of ACM.
FERNANDO MORGADO-DIAS (Member, IEEE) received the master's degree in microelectronics from the University Joseph Fourier, Grenoble, France, in 1995, and the Ph.D. degree from the University of Aveiro, Portugal, in 2005. He is currently an Assistant professor with the University of Madeira and a Researcher with the Madeira Interactive Technologies Institute and Larsys/ITI. His research interests include renewable energy, artificial neural networks, and FPGA implementations. His affiliation is now at the University of Madeira, ITI/Larsys, and Madeira Interactive Technologies Institute. VOLUME 8, 2020