1 Introduction

Deep learning paradigm has emerged as a promising scalable machine learning solution for extensive data analytics and inference [1,2,3,4]. Many applications from smart transportation, and smart and connected communities, inherently require real-time or near real-time scalable deep-learning inference. One major example is real-time video analytic for object localization, detection, and classification. With the tight latency requirements, long communication latency, and scarcity of communication bandwidth, the cloud comping paradigm is not able to offer a scalable sustainable solution for real-time deep learning inference. Therefore, novel architecture and design paradigms are required to push deep learning from the cloud to the edge of the network near to the data producers (e.g. video cameras).

While GPUs are widely used for training, they are not an efficient platform for real-time deep learning inference at the edge. GPUs are inherently throughput-oriented machines which makes them less suitable for the edge. GPUs require the large batch size of data (multiple video frames) to achieve high performance and power efficiency. Furthermore, GPUs have lack of deterministic execution patterns; [5, 6]. To overcome the limitations of GPUs, we have seen many new custom hardware approaches for accelerating deep learning inference. Few industrial examples are Google TPU [7], and Microsoft Brainwave [8]. While these platforms offered much higher performance and power efficiency compared to GPUs they still rely on throughput-oriented processing principles, which is more suitable for cloud computing, than the real-time inference at the edge. There is a need for novel custom platforms that offers latency-aware scalable acceleration for real-time deep-learning analytics over streaming data at the edge.

In this paper, we propose a novel reconfigurable architecture template for real-time low-power execution of Convolutional Neural Networks (CNNs) in the edge devices next to the camera. In principle, our proposed architecture is a coarse-grain dataflow machine, which performs CNN computation over streaming pixels. It consists of basic functional blocks required for CNN processing. The blocks are configurable with respect to data window size (size of convolution) and stride, and other network hyperparameters. The macro data-path is constructed by proper chaining of the function blocks with respect to targeted network topology. Function blocks are fused together and work concurrently to realize the convolutional operations without the need to store the streaming pixels in the memory hierarchy. Furthermore, the architecture provides enough configurability to adjust itself to rapidly growing and continuously evolving CNN topologies. As a result, the proposed architecture offers a reconfigurable template (rather than a single solution) that is able to generate efficient architecture instances. This feature gives us the possibility of easily adapting the architecture to any desired network topology. Furthermore our architecture works in a streaming fashion with minimum memory access, with respect to the algorithm’s intrinsic parallelism.

The major focus of our architecture is on accelerating the first two layers of CNNs, as they are the most compute-intensive kernels. The first two layers will run on the edge device, next to the camera, while other layers will run on the edge server in a proximity to edge devices. Our implementation of Xilinx Zynq FPGAs, for the first two layers of SqueezNet Network [9], shows 315 mW on-chip power consumption with an execution time of 0.24 ms. In contrast, the Nvidia Tegra TX2 GPU is only able to perform with an execution time of 31.4 ms, with a much higher on-chip power consumption (135 W).

The rest of this paper is organized as follows. Section 2 presents a summary on the existing methods and past literature on architectures for neural networks. Section 3 motivates the proposed architecture. Section 4 explains the details of the proposed architecture. Section 4.4 presents function blocks integration and dimensioning. Section 5 presents our implementation results. Finally, Sect. 6 concludes the paper.

2 Related Work

GPUs’ large power consumption conflicts with low power requirements in mobile applications [10,11,12,13]. This pushed the designers to use customized hardware accelerators for implementing CNNs at the edge. These custom accelerators could be targeted for ASIC [6] or FPGAs [14]. Most of recent works have focused on converting direct convolution to matrix multiplication. Among them, some have focused on doing the multiplication in a low-latency and low-power manner. Tann et al. [15] propose to map floating-point networks to 8-bit fixed-point networks with integer power-of-two weights and hence to replace multiplication with shift operations to do a low-power and low-latency multiplication.

A number of recent works have addressed this extensive memory requirement and have proposed different methods to reduce this memory access [16]. As some examples, [17] proposes entirely mapping a CNN inside an SRAM, considering weights are shared among many neurons, and eliminate all DRAM accesses for weights. Later authors in [18] proposed a hardware accelerator targeted for FPGAs that exploits sparsity of neuron activation to accelerate computation and reduce external memory accesses. They exploit the flexibility of FPGAs to make their architecture work with different kernel sizes and number of feature maps. Han et al. [19] uses deep compression to fit large networks into on-chip SRAMs and accelerates resulting sparse matrix-vector multiplication by weight sharing. They decrease energy usage by going from DRAMs to SRAMs, exploiting sparsity in multiplication, weight sharing, etc. Jafri et al. [20] presents an architecture targeted for ASIC that exploits the flexibility of compressing and decompressing both input image pixels and kernels to minimize DRAM accesses. They also presents an algorithm (LAMP) that intelligently partitions memory and computational resources of a MOCHA accelerator to map a CNN to it. [21] proposes a convolution engine that achieves energy efficiency by capturing data reuse patterns and enabling a large number of operations per memory access. Authors in [22] propose fusing the processing of multiple CNN layers by modifying the order that input data are brought on chip. They cache intermediate data (data that is transferred between layers) between the evaluations of adjacent CNN layers.

With all these different approaches towards reducing memory access, a lack of an architecture that separates the computation data from memory data and works on the streaming pixels is still sensed. This paper proposes such an architecture that can be further configured for any desirable network topology.

3 Background and Motivation

In this section, we briefly overview data access types in CNN, and the differences between General Matrix Multiplication (GEMM) and direct convolution. We conclude with the motivation to focus on the first two layers of the CNN.

3.1 Data Access Types

Convolutional Neural Networks (CNN) are both memory and compute-intensive applications, often reusing intermediate data and while consistently doing millions of parallel operations. Furthermore, the inherent memory intensive aspects of the algorithm are further exaggerated due to complex multi-dimensional data accesses. In this regard, we consider two major types of data when performing CNN.

  1. (1)

    2D Weight: The first type is 2D weight matrices. These weight matrices each correspond to a single channel and these channels weight matrices group together to construct the entire kernel. Multiple kernels form a layer, and multiple layers create a network topology.

  2. (2)

    Frame Pixels: The streaming pixels which are the input to the CNN processing. Just like the weight matrices these are 2D matrices, with multiple channels. This is the data that flows through the network topology.

3.2 GEMM vs Direct Convolution

Direct convolution is the point-wise Multiply and Accumulation (MAC) operation across the 2D weight Matrices and frame pixels. In direct convolution, similar to the algorithmic level definition, the weight Matrices are used to perform multiple multiply and then accumulation operations directly on the 2D window of input pixels. The direct convolution performs in a sliding window fashion with respect to a stride parameter that varies layer to layer in network topologies. Figure 1 exemplifies direct convolution operation, for a 3 by 3 convolution window over a frame with 5 by 5 pixels.

Fig. 1.
figure 1

Direct convolution

Fig. 2.
figure 2

General Matrix Multiplication (GEMM)

Traditionally, GPUs have seen much success in the cloud by using a linear algebra transformation called General Matrix Multiplication (GEMM) to lower the dimensions of convolution to regular matrix multiplication. GEMM transforms all the temporal parallelism into spatial parallelism. This helps GPUs to achieve a high throughput assuming the large data batches are available. However, this comes at a significant memory cost. The transformation is done by rearrangement with redundant copies of input image pixels. Our estimation reveals that the rearrangement results in 11X data duplication only for the first layer of any CNN network. This translates to significant power and energy overhead for accessing the redundant pixel data throughout memory hierarchy. Figure 2 exemplifies GEMM operation, for the same example illustrated in Fig. 1. As we observe, redundancy of frame pixels is required to transfer the convolution operation to a large matrix multiplication. For this example, the pixels will be 9 by 9 compared to original frame size which is 5 by 5.

3.3 CNN Execution Bottlenecks

In this paper, we primarily focus on accelerating the first two layers of CNN as the major execution bottlenecks. We specifically target SqueezNet [9], a DCNN design with memory efficiency in mind. To motivate our argument, we have estimated the computation demands across the CNN layers for the example of SqueezNet [9]. Figure 3 shows the computation distribution across the SqueezNet layers. Overall, SqueezNet contains 10 layers. The first and last layers are traditional convolution layers (conv0 and conv9). The intermediate layers are squeeze (s) and expand (e) convolutional layers. The squeeze layers combine the feature maps to make the network more efficient and expand layers expand the feature map. As we observe, the first layer (conv0) has the highest computation demand with 21% contribution to overall computation demand. The first layer also generates the largest size of feature map across all layers which can lead to significant communication and memory traffic. Figure 4 presents the contribution of layers on feature map. To minimize the memory access and communication demand, it would be beneficial to accelerate the second layer (s1, e1) along with the first layer. In this way, much smaller feature maps will be transferred to the edge server for processing of the remaining layers.

Fig. 3.
figure 3

Computation distribution across the SqueezNet layers

Fig. 4.
figure 4

Feature map distribution

Fig. 5.
figure 5

From algorithm composition to architecture realization.

4 Architecture Template

This section introduces our proposed architecture template, for real-time execution of CNN inference on the edge. The proposed template targets FPGA devices, as they offer both efficient execution and sufficient reconfigurability to cope with continuously growing CNN topologies [23]. Furthermore, by targeting the FPGAs, we are able to generate a customized datapath per each CNN network as such to best fit the processing requirements. The major premise of our proposed architecture is to remove the gap between the algorithm execution semantic and architecture realization. Therefore, our proposed architecture is primarily a data flow machine working on streaming data based on direct convolution. It consists of three main function blocks for realizing the wide range of CNN inference topological structure. The blocks are Convolutional Processing Element (CPE), Aggregation Processing Element (APE), and Pooling Processing Element (PPE). The blocks will be configured and connected with respect to target network topology, creating a macro-pipeline datapath. Figure 5 presents overall architecture realization from logical domain (algorithm) to physical domain (architecture).

Fig. 6.
figure 6

CNN computation mapping between the edge node and edge server

Our architecture is designed based on the natural dataflow of CNNs. It is able to exploit both spatial parallelism across the convolutions within the same layer, as well as temporal parallelism between the blocks across the layers. The blocks are configurable with respect to network parameters such as size of convolution and stride. This gives us the possibility of easily adapting the architecture to any desired network topology.

In this paper, we focus our architecture realization of the first tow layers of CNNs. While our proposed architecture template is extensible and can support the entire CNN topology, the primary limitation is available hardware resources on FPGAs of the edge devices. At this moment, we are targeting smaller FPGAs, e.g. Xilinx Zynq [24], with small reconfigurable fabric. However, by accelerating the first two layers on the edge node, we will able to relax the computation demands on the edge server. Figure 6 shows the logical mapping of the network between the edge node and edge devices. The edge node will perform the heavy computation of the first layer. Furthermore, it runs the second layer to significantly shrink the feature map. Then it sends the feature maps to the edge server for the remaining layers to do the processing.

4.1 Covolutional Processing Elements (CPE)

Convolutional Processing Element (CPE) is responsible to perform the primary computation of CNN which is direct convolution over two-dimensional pixel stream. Figure 7 presents the internal architecture of CPE block. It contains two primary blocks: (1) 2D-line Buffer and (2) Multiply and Accumulator (MAC) engines.

Fig. 7.
figure 7

Convolutional Processing Element

2D-Line Buffers. The 2D-line buffer is what enables convolution neural networks to operate in an streaming manner. This is achieved by maintaining the reused pixels for multiple cycles. The pixels that must be reused are determined by the network topology and the receptive field of the layer the 2D line buffer is mapped to. Every layer of a convolutional neural network has a hyperparameter called stride that dictates how the receptive field slides through the feature maps, both horizontally and vertically. No matter what the stride is set to, the minimum amount of data that must be stored is determined by the size of the receptive field or filter window. However, when the stride is less than the filter dimension size, extra feature map data must be kept in a buffer.

To deal with the horizontal reuse, only twice the extra pixels must be kept at max, however, vertical stride requires all rows that were used in the filter window to be available. The 2D-line buffer that is used in our approach is able to overcome this by keeping the minimum amount of rows needed. We keep at least one row of the streaming input to preserve the horizontal reuse and we maintain extra rows depending on the filter size in order to preserve the vertical reuse of data. This is done for all streaming feature map data in each layer through the accelerator. The 2D-line buffer is expanded by having an input FIFO which we call stream accumulator to allow the buffer to receive input while operating on the data at the same time.

Multiply and Accumulator (MAC) Engines. The convolution unit is the heart of the architecture. It is composed of a series of independent MAC units that perform parallel multiply and accumulate operations together each cycle. The MAC units are able to execute any kernel size by simply changing the number of cycles it operates on data. These MAC units further enable efficient and flexible convolution by exploiting the multiple forms of parallelism inherent to the convolution operation. The first form of parallelism we exploit is Intra-Kernel parallelism. Intra-Kernel achieved by dividing the convolution of a single kernel to multiple MAC units. By exploiting this parallelism a 7\(\,\times \,\)7 kernel which would normally take 49 cycles can only take 7 cycles, by dedicating 7 MAC units to operate on the pixel and weight data in parallel.

The next form of parallelism, Inter-Kernel parallelism, is achieved by fetching multiple kernels at once and having at least one MAC dedicated to each. The main benefit to this form of parallelism comes when you exploit the full available inter-kernel parallelism. When all kernels are run together the kernel weights can be kept in the buffer thus removing unnecessary memory fetches. The 2D-line buffer allows data0-level parallelism by reusing the same kernel on all the feature map data available on the buffer. This approach leaves less of a memory footprint on the system. Further feature map parallelism is also a possibility by running multiple feature map sections concurrently, however, this would increase the memory footprint left on the main system, so we do leave it to be explored in future work.

Fig. 8.
figure 8

Aggregation Processing Element (APE)

4.2 Aggregation Processing Elements (APE)

This layer performs aggregation across multiple output streams representing different channels. Figure 8 presents overall view of our proposed APE module. APE is perhaps the simplest functional block in our architecture. It takes the stream of input pixels that have negative and positive values, rectifies the negative values to zero and passes the positive values as they are. Therefore, the output of APE is a non-negative sequence of pixel values.

Fig. 9.
figure 9

The sliding window with stride

4.3 Pooling Processing Elements (PPE)

Pooling Processing Elements (PPEs) are in charge of down-sampling the image. Every pooling process has two parameters of stride and window (filter) size. The degree of compression actually depends on the stride. The core idea of pooling with a \(n\times n\) window is to replace each window with the maximum among all the elements in that window. Figure 9 shows an example of pooling with the window size of \(3\times 3\) and stride of 2.

To avoid the unnecessary memory and buffer requirements to store the entire feature map, the proposed pooling block works on the stream of pixels while supporting variable horizontal and vertical pooling strides. Figure 10 presents the architecture details of our proposed PPE. In the following we will present a detailed description of \(3\times 3\) window with a stride of two as an example to illustrate the on-the-fly pooling process. For horizontal stride, the pooling block receives the first pixel and keeps it in a register until the second pixel arrives. When the second pixel arrives, a comparator will compare the two and keeps the result in a register since it is not the end of our window yet. When the third pixel arrives, it is sent to the comparator to determine the maximum of the first three pixels. This maximum is then stored in the FIFO. The third pixel is also kept separately in a register to be compared with pixel 4. (Because in a stride of 2, pixel 3 is shared between the first and second windows.) The same process repeats until all the pixels in the first row are received. By this time, maximum values in each window, for the first row of the image, are stored in the FIFO and the FIFO is full now.

Fig. 10.
figure 10

Pooling Processing Elements (PPE)

For taking care of vertical stride, when the second row arrives, maximum of the three first pixels is calculated like the first row. However, it is time for the oldest input of the FIFO to pop out. This oldest element would be the maximum of the first window in the first row, which is then compared to our new maximum in the second row and the largest between the two is fed to the FIFO. Similarly, when the third row arrives, the process for the second row is repeated and finally, the maximum of all nine pixels in the first window is Fed to the FIFO. Moreover, to take care of the horizontal stride, since the third row is also the horizontal end of our window, as the pixel stream for the third row arrives, we also feed it to another pooling block as the first row of the image. All the process described above is replicated in this second pooling block. The first pooling block is vacated after all the maximums for first row windows are calculated and sent out, and by the time the sixth row arrives, pooling block is ready to receive this row as the first row.

4.4 Function Blocks Integration

The Macro-pipeline consists of single CPEs mapped to one input channel of a layer. The full layer is then constructed by multiple CPEs operating in parallel. The CPEs are then wired together by the APE to aggregate the convolutions and pass data to the next layer. A PPE is optionally generated after the aggregation if the network topology demands it with the data stream then being fed into the multiple CPEs of the next layer.

This MACRO-pipelined datapath is generated layer by layer until the desired network topology is achieved. By changing the number of CPEs we can support multiple layers with multiple channels. Each CPE itself is also able to handle customization to each layer’s hyper parameters such as stride, Kernel dimensions and input frame size. The system receives image data directly from the sensor, this allows users to separate the input and memory traffic and minimize the memory footprint. To handle the kernel data we include on-chip memory to double buffer access to the main memory and hide the latency. That becomes an end to end accelerator capable of flexible acceleration over the domain of CNNs.

5 Evaluation

This section presents our evaluation results based on implementation on Xilinx Zynq FPGAs.

5.1 Experimental Setup

The full architecture template was constructed using chisel [25], a high-level hardware construction language. We feed the chisel code multiple parameters of the network topology as well as design parameters on how we should extract the natural parallelism. The code then generates a Macro-pipelined datapath to run the network topology. We have implemented an instance of our architecture template for the first two layers of the SqueezNet [9]. We focused on SqueezNet network because it was designed with computational and memory efficiency. The design was realized with a Xilinx Zynq-7000 FPGA [24], due to its low power footprint and embedded Processor.

Table 1. SqueezNet topology properties for the first two layers.
Fig. 11.
figure 11

Intra kernel parallelism

Fig. 12.
figure 12

Inter kernel parallelism

Fig. 13.
figure 13

Hybrid inter/intra kernel parallelism

Table 1 presents SqueezNet architecture properties for the first two layers. Overall, the first layer, as the major compute-intensive layer, contains 96 kernels each one performing 7 by 7 convolution which translates to 49 MACs. It also contains three channel representing R, G, and B.

For evaluation, we use three different possible datapaths of our proposed architecture. Figures 11, 12 and 13 shows these three implementations respectively, with respect to our proposed CPE and function block integration presented in Sect. 4. Intra kernel parallelism focuses on spatial parallelism in the MAC operations within each kernel. On the other extreme, inter kernel parallelism solely focuses on the spatial parallelism across MAC operations across all 96 parallel kernels. In between, the hybrid inter/intra kernel parallelism aims to find a balance between inter and intra kernel parallelism.

5.2 Resource Utilization and Power Overhead

This section presents the resource utilization and power overhead for the three proposed configuration.

Fig. 14.
figure 14

First layer dynamic power for different types of parallelism

Figure 14 shows the dynamic power of our proposed architecture when running the first layer for three types of parallelism. The results gathered for real-time processing of 30 frames per second with 227\(\,\times \,\)227 resolution. As the figure illustrates, the intra-kernel parallelism achieves the minimum power consumption by consuming only 135 mW dynamic power consumption. The hybrid parallelism is a close second and inter-kernel parallelism has the highest power consumption. The static power of entire FPGAs is about 180 mW. This leads to the overall power consumption of 315 mW.

Fig. 15.
figure 15

Absolute FPGAs resource demand for the first layer across design points.

Figure 15 presents the absolute resource consumption for the three design points. Figure 16 also presents relative resource utilization on Xilinx Zynq across the design points. Overall, intra kernel parallelism has the lowest utilization except for LUT as memory. In intra kernel parallelism, the 2D-line buffers are not mapped directly to BRAMs but instead to LUTs do to the extra Read ports need. Although overall intra kernel parallelism performs best for the first layer in SqueezNet, the remaining layers would map to better utilize different forms of parallelism with respect to their inter and intra kernel data sharing patterns which are directly driven from the network topology.

Fig. 16.
figure 16

Relative resource utilization on Xilinx Zynq across design points

5.3 System-Level Impact

In this part, we quantify the system-level benefits of the computing the first two layers on the edge node. Figure 17 compares two scenarios: (1) computing entire network on the edge server and (2) computing across the edge node and edge server (edge-node+edge-server). Figure 17b compares the execution time. Overall, the server+node cooperative computation achieves 32% improvement in the performance. Figure 17a compares the network communication traffic. Node+server cooperative computation reduces the communication and network traffic by more then 3x.

Fig. 17.
figure 17

Network traffic and execution time comparison between edge-server, and edge-node+edge-server scenarios

5.4 Comparison Against GPUs

This section is the alternative solution comparison, which compares our proposed architecture (implemented on Zynq FPGA) against the state of the art mobile GPU, Nvidia Jestosn TX2 [26]. Figure 18 compares both execution time (as the latency for performing single frame), and power consumption in the logarithmic scale. Figure 18a shows that our architecture (implemented on Zynq FPGA) has a latency cost of 0.24 ms. While the mobile GPU solution imposes a latency of 31.4 ms. Figure 18b shows that our proposed architecture can offer considerably lower power consumption over mobile GPUs. For execution 30 frames at the resolution of 227\(\,\times \,\)227, GPU consumes 7.5 W. In contrast, our proposed architecture, implemented on Zynq FPGA, only consumes 0.315 W.

Our proposed architecture consumes about 24x lower power compared to Nvidia Jetson TX2 GPU, while imposing 130x less latency. Our proposed architecture is a data-flow machine, thus it only operates when the streaming pixels of the new frame are available. It is able to process the entire frame at 0.24 ms. It then stays in the standby mode until it receives the next frame streaming data. As a result, our architecture is able to even perform real-time processing at the much higher frame rates such as 60 fps and 120 fps.

Fig. 18.
figure 18

Power and performance comparison against Nvidia Tegra TX2 GPU

6 Conclusions

In conclusion, this paper proposed a novel architecture template for real-time low-power execution of Convolutional Neural Networks at the edge. The proposed architecture is primarily targeted for FPGAs, and is able to offer configurable macro-pipeline datapath for scalable direct convolutions over streaming pixels. The proposed architecture is an example of a hybrid solution across edge nodes and edge servers for realizing compute-intensive deep learning applications. The proposed architecture is able to reduce the network traffic and execution time of the overall application. At the same time, it maintains the flexibility to map to any standard CNN network topology. Future work includes supporting full network topology acceleration on edge and supporting nonstandard CNN, as well as a workflow for mapping them efficiently to different FPGA boards.