research-article

Open Access

Accelerating Deformable Convolution Networks with Dynamic and Irregular Memory Accesses

Authors:
Cheng Chu

Indiana University Bloomington, USA

Indiana University Bloomington, USA

0000-0002-3226-0750
View Profile

,
Cheng Liu

Institute of Computing Technology, Chinese Academy of Sciences, China

Institute of Computing Technology, Chinese Academy of Sciences, China

0000-0002-5542-7306
View Profile

,
Dawen Xu

Hefei University of Technology, China

Hefei University of Technology, China

0000-0003-4204-0898
View Profile

,
Ying Wang

Institute of Computing Technology, Chinese Academy of Sciences, China

Institute of Computing Technology, Chinese Academy of Sciences, China

0000-0001-5172-4736
View Profile

,
Tao Luo

Institute of High Performance Computing, A*STAR, Singapore

Institute of High Performance Computing, A*STAR, Singapore

0000-0002-3415-3676
View Profile

,
Huawei Li

Institute of Computing Technology, Chinese Academy of Sciences, China

Institute of Computing Technology, Chinese Academy of Sciences, China

0000-0001-8082-4218
View Profile

,
Xiaowei Li

Institute of Computing Technology, Chinese Academy of Sciences, China

Institute of Computing Technology, Chinese Academy of Sciences, China

0000-0002-0874-814X
View Profile

ACM Transactions on Design Automation of Electronic Systems Volume 28 Issue 4Article No.: 67pp 1–23https://doi.org/10.1145/3597431

Published:18 July 2023Publication History

ACM Transactions on Design Automation of Electronic Systems

Abstract

Deformable convolution networks (DCNs) proposed to address image recognition with geometric or photometric variations typically involve deformable convolution that convolves on arbitrary locations of input features. The locations change with different inputs and induce considerable dynamic and irregular memory accesses that cannot be handled by classic neural network accelerators (NNAs). Moreover, bilinear interpolation (BLI) operation, which is required to obtain deformed features in DCNs, also cannot be deployed on existing NNAs directly. Although a general purposed processor (GPP) seated along with classic NNAs can process the deformable convolution, the processing on GPP can be extremely slow due to the limited parallel computing capability and massive additional data movement. To address the problem, we develop a DCN accelerator on existing NNAs to support both the standard convolution and deformable convolution. Specifically, for the dynamic and irregular accesses in DCNs, we have both the input and output features divided into tiles and build a tile dependency table (TDT) to track the irregular tile dependency at runtime. With the TDT, we further develop an on-chip tile scheduler to handle the dynamic and irregular accesses efficiently. In addition, we propose a novel mapping strategy to enable parallel BLI processing on NNAs and apply layer fusion techniques for more energy-efficient DCN processing. According to our experiments, the proposed accelerator achieves orders of magnitude higher performance and energy efficiency compared to the typical computing architectures including ARM, ARM+TPU, and GPU with 6.6% chip area penalty to a classic NNA.

1 INTRODUCTION

Deformable convolution network (DCN) [1], a new category of neural networks, is proposed to address the neural network model accuracy degradation caused by geometric and photometric variations such as lighting and rotation occurred in many practical applications like medical imaging. DCNs typically sample arbitrary locations of the input features for the convolution such that the objects with different scales or deformation can be captured. The sampling patterns of deformable convolution can be learned and calculated using an additional convolution layer. With the unique deformable convolution, DCNs have shown superior performance on many vision tasks such as object detection [1, 2, 3, 4], semantic segmentation [1, 5, 6, 7], and classification [8, 9, 10]. For instance, the authors in Reference [1] demonstrated that the prediction accuracy of the proposed DCN increases from 70% to 75% on the image semantic segmentation dataset (CityScapes). Significant prediction accuracy improvement is also observed in human motion recognition task [11, 12], action detection task [13, 14], and intelligent medical monitoring and treatment [15, 16].

Despite the great advantages, each deformable convolution operation in DCNs needs additional convolution-based index¹ calculation and bilinear interpolation (BLI) to obtain deformed features other than a standard convolution, so it is both computing- and memory-intensive and requires intensive acceleration for widespread deployment. Nevertheless, DCNs cannot be deployed on conventional neural network accelerators mainly from the following two aspects. First, it convolves on arbitrary locations of the input features instead of fixed sliding windows as depicted in Figure 1. The locations, i.e., the indices to the input features, are generated at runtime and they will cause both dynamic and irregular accesses to the memory, which cannot be fitted to conventional neural network accelerators targeting at regular memory accesses and data flows. Second, DCNs have a standard convolution to calculate the indices, but the calculated indices are usually not integers and cannot be used to retrieve the feature data directly. Typically, BLI algorithm is utilized to approximate the features with the nearest original input features. This step is also not supported in conventional neural network accelerators. An intuitive solution to execute DCNs is to conduct the deformable convolution on a general purposed processor (GPP), which is usually seated with a neural network accelerator while deploying the rest of the normal neural network operations in DCNs on a neural network accelerator. However, GPPs especially the embedded processors with limited parallel processing capability is inefficient for the bilinear interpolation, and a large number of irregular memory accesses and data transfers between the GPPs and the neural network accelerator also inhibit the DCN execution efficiency dramatically.

Fig. 1. Deformable convolution: (a) regular sliding window in a standard convolution, (b) irregular sampling of a deformable convolution, and (c) deformable convolution processing.

Recently, there are also works proposed to revise the DCN models to fit existing neural network accelerators. The authors in Reference [17] proposed to replace the bilinear interpolation algorithm with a simple rounding strategy and restrict the sampling locations to avoid dynamic memory access induced buffering problems. Similarly, the authors in Reference [18] also proposed to modify the DCN models to reduce the receptive field size substantially so that the sampling locations are limited to a small region, which avoids the dynamic memory accesses across the whole input feature map. Although these approaches are demonstrated to be effective on existing neural network accelerators with minor model accuracy, it essentially poses hardware constraints to the model design and particularly limits its use on scenarios that are sensitive to the model accuracy loss. In addition, it requires time-consuming retraining and training data that may not be applicable to the end users. Thereby, we investigate the computing of the DCNs and seek to implement the entire unmodified deformable convolution on top of a unified neural network accelerator directly.

To implement the entire DCNs on a unified accelerator and reuse the conventional neural network accelerator as much as possible, we revisit a typical neural network accelerator architecture mainly for the new irregular feature sampling and the BLI required by DCNs. For the dynamic and irregular feature accesses, we observe that the input data required by the deformable convolutional output is imbalanced and some of the input features are utilized more than the others. More details can be found in Section 3. With this observation, we propose to divide the input and output features into smaller tiles and build a tile dependency table (TDT) that keeps a record of all the required input tile IDs of each output tile with runtime tracking. On top of the TDT, we further schedule the output tile execution such that the buffered tiles are reused as much as possible and the overall memory access efficiency can be improved. For the BLI, we convert it to multiple small vector-based dot product that can be mapped in parallel to the 2D computing array in typical neural network accelerators efficiently with a weight stationary data flow [19]. In addition, we fuse the BLI and the following convolution to further reduce the intermediate data transmission between on-chip buffers and the external memory. With the proposed redesigning on top of a conventional neural network accelerator, the entire deformable convolution can be implemented on the revisited accelerator efficiently. According to our experiments on a set of DCNs, it achieves orders of magnitude higher performance and energy efficiency when compared to typical computing architectures including ARM, ARM+TPU, and GPU while it incurs only minor hardware resource consumption compared to a conventional neural network accelerator.

The rest of the article is organized as follows. In Section 2, we introduce the typical deformable convolutional networks and formulate them into a unified computing model. Meanwhile, we brief prior works about neural network accelerator redesigning for new types of neural network models. In Section 3, we characterize the computing patterns and memory accesses of deformable convolution. In Section 4, we show the detailed design and optimizations of an accelerator for DCNs on top of a classical neural network accelerator. In Section 5, we evaluate the performance and energy efficiency of the accelerator and compare it with typical embedded computing platforms. In Section 6, we conclude this work.

2 BACKGROUND AND RELATED WORK

2.1 Deformable Convolutional Networks

Deformable convolution may sample arbitrary locations of the input features for convolution such that it can capture the objects with scale or transformation. The unique feature makes it attractive in visual recognition tasks with geometric variations such as lighting and rotation. There have been many deformable convolution architectures proposed recently [1, 14]. They typically include two standard convolution operations. The first convolution calculates the indices for the input feature sampling while the second convolution convolves on the sampled features. Usually, a BLI operation is used to bridge the two convolution operations. It approximates the input features based on the non-integer indices generated by the first convolution and provides the features as inputs to the second convolution. The structures of DCNs mainly differ on the index reuse and can be roughly divided into two categories. The first category of DCNs has a unique index for each data in the feature plane and the indices are reused across the different channels [14]. The second category of DCNs also has the indices shared across the feature channels, but it has a unique index for each data in each convolution window [1]. Basically, the same data in the feature plane has different indices when it is located in different convolution windows. The second category of DCNs requires a larger convolution to calculate more sampling indices and produces more deformed features than the first category of DCNs. The two categories of DCNs are abbreviated as DCN-I and DCN-II, respectively.

2.2 Unified Deformable Convolution Model

The different deformable convolution can be represented in a unified model as formulated in Equations (1)–(3). Basically, it has a convolution operation to determine the deformed locations or indices of the input features in the first step. This step is formulated in Equation (1) where \(\boldsymbol {X}\), \(\boldsymbol {W_1}\), \(\boldsymbol {b_1}\), and \(\boldsymbol {y^{^{\prime }}}\) refer to the input feature, weight, bias, and output of the first-stage convolution, respectively. As each convolution window across the input feature has an independent index and input features at different channels share the same index, a 3D (\(L \times H \times D\)) data is utilized for \(\boldsymbol {y^{^{\prime }}}\). H and D stand for the number of convolution windows on X dimension and Y dimension, respectively. \(L=2 \times K \times K\) stands for indices at different window locations and axis dimensions. Suppose \(\alpha _{m}\) and \(\beta _{n}\) are the corresponding coordinates in X axis and Y axis, respectively, and they are located at the same position and the same convolution window. \(\alpha _{m}\) and \(\beta _{n}\) are not integers, so they cannot be used to retrieve the input features directly for the deformable convolution. To that end, a bilinear interpolation approach is utilized to calculate the deformed features using the neighboring features around the location (\(\alpha _{m}\), \(\beta _{n}\)) in the second step. The calculated feature data \(x_{c,\alpha _{m},\beta _{n}}^{^{\prime }}\) can be obtained using Equation (2) where \(BLI(.)\) refers to the bilinear interpolation function, c is the channel index of the feature data, \(\Delta \alpha _{m}=\alpha _{m}- \lfloor \alpha _{m} \rfloor\) and \(\Delta \beta _{m}=\beta _{m}- \lfloor \beta _{m} \rfloor\). \(x_{c,\alpha _{m},\beta _{n}}^{^{\prime }}\) is an element of the deformed feature \(\boldsymbol {X^{^{\prime }}}\). A vivid description of the BLI function can be found in Figure 2. When the input features are retrieved and organized according to the deformed indices, the deformable convolution can be obtained using a standard convolution over the reorganized features as shown in Equation (3) where \(\boldsymbol {W_2}\), \(\boldsymbol {b_2}\), and \(\boldsymbol {y}\) refer to weights, bias, and output feature of the second convolution operation in a deformed convolution: (1) \(\begin{equation} \begin{aligned}\boldsymbol {y^{^{\prime }}}=\boldsymbol {X} \ast \boldsymbol {W_1} + \boldsymbol {b_1}, \end{aligned} \end{equation}\) (2) \(\begin{equation} \begin{aligned}x_{c,\alpha _{m},\beta _{n}}^{^{\prime }}&= F_{BLI}(x_{\left\lfloor \alpha _{m} \right\rfloor ,\left\lfloor \beta _{n} \right\rfloor },x_{\left\lfloor \alpha _{m} \right\rfloor ,\left\lceil \beta _{n} \right\rceil },x_{\left\lceil \alpha _{m} \right\rceil ,\left\lfloor \beta _{n} \right\rfloor },x_{\left\lceil \alpha _{m} \right\rceil ,\left\lceil \beta _{n} \right\rceil },\Delta \alpha _{m},\Delta \beta _{n})\\ &=\left(1-\Delta \alpha _{m} \right)\left(1-\Delta \beta _{n} \right)x_{\left\lfloor \alpha _{m} \right\rfloor ,\left\lfloor \beta _{n} \right\rfloor }+\left(1-\Delta \alpha _{m} \right)\Delta \beta _{n}x_{\left\lfloor \alpha _{m} \right\rfloor ,\left\lceil \beta _{n} \right\rceil }+\Delta \alpha _{m}\left(1-\Delta \beta _{n} \right)x_{\left\lceil \alpha _{m} \right\rceil ,\left\lfloor \beta _{n} \right\rfloor }\\ &\;\;\;\;+\Delta \alpha _{m}\Delta \beta _{n}x_{\left\lceil \alpha _{m} \right\rceil ,\left\lceil \beta _{n} \right\rceil }, \end{aligned} \end{equation}\) (3) \(\begin{equation} \boldsymbol {y}=\boldsymbol {X^{^{\prime }}} \ast \boldsymbol {W_2} + \boldsymbol {b_2}. \end{equation}\)

2.3 Neural Network Accelerator Redesigning

The great success of neural networks in massive domains of applications inspires considerable efforts devoted to developing neural network accelerators [20, 21, 22, 23, 24, 25]. In spite of the notable efforts, newer network operations proposed for higher performance may go beyond the capability of the existing neural network accelerators. Although it is usually possible to offload the unsupported operations to the attached GPPs while leaving the rest of the conventional neural network operations on the accelerator, the performance of the co-designed implementation may drop dramatically due to the massive data communication between GPP and the accelerator. Also complex operations offloaded to GPPs may still become the performance bottleneck due to the insufficient computing capability of GPP and degrade the overall neural network execution. Thereby, the neural network accelerators are usually redesigned to meet the requirements of the new neural network operations on top of the existing neural network accelerators. For instance, unified neural network accelerators are also proposed to perform deconvolution used in generative adversarial neural networks other than the conventional convolution [26, 27, 28, 29]. A novel accelerator is developed to support 3D neural networks in References [30, 31, 32]. The authors in Reference [33] proposed to add a bilinear interpolation calculation module to an existing ReRAM neural network accelerator to enable in-situ DCN calculation.

Inspired by prior works, deformable convolution networks that cannot be fitted to existing neural network accelerators also attract efforts for specialized accelerator designs [17, 18] as illustrated in Section 1. Nevertheless, they mainly proposed to limit the random sampling in deformable convolution to ensure better data locality and smaller memory access footprints. Although the limitation is beneficial to the hardware design, it compromises the model accuracy, which is the major design goal of DCNs. Table 1 reveals the model accuracy comparison between native DCNs [1] and the DCNs with additional sampling constraints proposed in References [17, 18]. We have ShuffleNet V2, Faster R-CNN, and VGG16 used for the comparison. Twelve epochs, 100 epochs, and 20 epochs are used for the training of Faster R-CNN, ShuffleNet V2, and VGG16, respectively. It can be observed that there are generally 1% to 2% accuracy loss, which is non-trivial. In addition, the sampling constraints are coupled with the hardware designs, so the model training must take the hardware design parameters into consideration, which also blurs the hardware and software boundary and complicates the model development. Unlike prior works, we aim to develop a DCN accelerator to enable native deformable convolution efficiently without any accuracy penalty or model training constraints.

Table 1.

Network	Dataset	[17]	[18]	Native DCNs [1]	Quantized DCNs
ShuffleNet V2	COCO	36.8	\(\setminus\)	38.4	37.9
ShuffleNet V2	VOC	63.1	\(\setminus\)	64.4	64.2
Faster R-CNN	COCO	\(\setminus\)	60.8	61.8	61.1
VGG16	IMAGENET	82.6	83.2	84.3	84.1
VGG16	CIFAR10	88.7	89.4	90.2	90.2

View Table

Table 1. Accuracy Comparison of Native DCNs and Constrained DCNs

In addition, since existing neural network accelerators are mostly fixed point and DCNs need to be quantized to take advantage of the hardware acceleration, we apply 8bit fixed point quantization to the native DCNs with the approach proposed in Reference [34] and evaluate the influence of quantization on the model accuracy. The quantized model accuracy is presented in the last column of Table 1, it can be observed that 8 bit fixed point quantization has negligible influence on the prediction accuracy of DCNs.

3 OBSERVATION

Since deformable convolution has many irregular memory accesses involved, which dramatically affect the processing efficiency, we investigate the memory accesses of a typical deformable convolution in this section. We take the third convolution layer of VGG16 as the basis of a typical deformable convolution operation. As the memory accesses vary on different inputs, we randomly selected 2,000 images from ImageNet and averaged the memory accesses for the investigation.

Fig. 3. Memory access characterization of the deformable convolution.

Figure 3(a) shows the distribution of the input feature accesses. Unlike standard convolution that usually has nearly uniform utilization of the input features, deformable convolution has distinct utilization of the different input features. With the \(3 \times 3\) kernel used in the convolution, each input feature will be utilized around 9 times. For the corresponding deformable convolution, the access distribution shows dramatic difference. Around 15% of the features will be utilized by more than 12 times, which take up around 25% of the feature accesses. In contrast, more than 22% of the features are utilized less than 6 times. While the neural networks can usually be tiled and the memory accesses can be performed in the granularity of a tile, we further analyze the input feature tile access distribution. In this experiment, we had both the input feature map and the output feature map divided into 25 tiles as an example. The tile access distribution is shown in Figure 3(b). It can be observed that the input tile reuse still shows notable variation.

In addition, we also evaluate the data locality of DCN processing with a trace-based cache simulator [35]. The trace is generated based on the basic DCN processing. We have a small cache and a moderate cache evaluated. The small cache has 32 KB 4-way L1 cache and 512 KB 16-way L2 cache. The moderate cache has the same 32 KB L1 cache and 2 MB 16-way L2 cache. The cache line size is 64B and least recently used (LRU) replacement is applied. In this experiment, we take the 3rd, 5th, 8th, and 13th layer of VGG16-based deformable convolution evaluated independently. At the same time, we take the corresponding convolution layers evaluated and compared. The comparison is shown in Figure 4. It can be seen that deformable convolution layers with additional random accesses have much higher cache miss rate than the corresponding convolution layers given limited cache size, while little difference can be observed when the cache can accommodate the entire input data as expected. Moreover, given 512 KB L2 cache, we notice that smaller layers such as layer 13 has higher cache miss rate than larger layers like layer 3. This is mainly because that BLI is the major source of cache miss and the percentage of BLI operations with irregular memory accesses in smaller deformable convolution layers is higher than that in lager deformable convolution layers.

In summary, the input features are not evenly accessed, so some of the input features are more likely to be reused than the other. The imbalanced memory accesses in DCNs further induce worse data locality and higher cache miss rate eventually, which lowers the DCN performance substantially. In this case, scheduling the order of the output feature calculation and optimizing the order of the input feature accesses can potentially improve the data locality of DCNs and enhance the performance and energy efficiency of the DCN processing.

Fig. 4. Cache miss rate comparison between VGG16-based deformable convolution layers and convolution layers: (a) 512 KB 16-way L2 cache; (b) 2 MB 16-way L2 cache.

4 DCN ACCELERATOR ARCHITECTURE

4.1 Overall Accelerator Architecture

Deformable convolution is the major barrier that hinders the deployment of DCNs on existing neural network accelerators. Thereby, taming the deformable convolution to existing neural network accelerators is key to accelerate DCNs. As formulated in Section 2.2, deformable convolution consists of three processing stages including convolution, BLI, and convolution. Since the convolution operations in DCNs can be deployed on existing neural network accelerators directly, the major challenge of DCN acceleration is to optimize the BLI operation that samples the input features according to the irregular indices calculated with the first convolution in DCNs and conducts the BLI calculation based on the sampled input features. Since the indices depend on the input features and thus change at runtime, the BLI sampling leads to complicate memory access patterns. To address this dynamic and irregular memory access problem, we have BLI divided into tiles and track the tile dependency at runtime with a TDT. On top of the TDT, we have a tile scheduler to decide the order of the output tile execution and the input tile loading at the same time such that the tiles loaded to the on-chip buffers can be fully utilized and the memory accesses to the external memory can be reduced. As for the BLI calculation, we reorganize it to multiple small vector-based dot product and have the processing performed in parallel on top of the 2D computing array in the neural network accelerator for the sake of higher performance.

Fig. 5. DCN accelerator overview. The blocks filled with grey are added specifically for the deformable convolution processing while the rest of the blocks remain the same with a conventional neural network accelerator with 2D computing array.

The proposed DCN accelerator architecture is shown in Figure 5. The components without filling any color belong to a classic neural network accelerator while the rest of the components filled with grey are designed specially for the deformable convolution. The entire accelerator is generally controlled with a sequence of instructions compiled from the target neural networks. The instructions are stored in the instruction buffer and decoded at runtime to generate control signals for the entire accelerator. With the controlling, neural network operations are mapped and executed on the regular 2D computing array. For the convolution operations, weights from different filters are streamed to the different columns of the computing array from top to bottom in parallel while input features are streamed from left to right in different rows according to the output stationary data flow proposed in Reference [19].

For the deformable convolution, it is divided into three dependent operations. The first convolution operation starts when inputs and weights are ready in the on-chip buffers. The outputs of the first convolution are indices and will be utilized to sample from the input features. Since they are not integers and cannot be utilized to retrieve the input features directly, we have them stored in an index buffer and have an address converter to obtain the four neighboring integer indices. Meanwhile, BLI coefficients as proposed in Section 4.2 are generated with a coefficient calculation block at the same time. To this end, we can start the BLI calculation with the retrieved input features and BLI coefficients by taking advantage of the 2D computing array of the accelerator. The output of the BLI are essentially deformed features and will be utilized as inputs of the second convolution. While the address conversion and the BLI calculation are conducted in pipeline manner, the output buffer and the index buffer are separated in case of conflicts. After the BLI calculation, the deformed features will be stored and swapped as inputs of the computing array for the second convolution. When the features or weights exceed the on-chip buffers, BLI and the second convolution need to be tiled and fused to avoid intermediate data exchange through the external memory. In addition, we have a TDT to keep a record of the tile dependency based on the generated indices and have a runtime tile scheduler to optimize the output tile execution and input tile loading ordering based on the TDT to enhance the on-chip data reuse and memory access efficiency.

4.2 BLI Implementation

To implement BLI, we have an address converter module to convert the original non-integer indices to neighboring integer addresses of the input feature buffer and a coefficient calculation block to produce the BLI coefficients at the same time. Each feature requires four coefficients \(\eta , \mu , \theta , \gamma\), and they can be calculated with Equation (2). The BLI coefficients are stored in the weight buffer while the converted buffer addresses are used to retrieve features from the the input buffer directly. The retrieved features and coefficients read from the weight buffer will then be fed to the 2D computing array following a standard weight stationary data flow [19] for the BLI computing. Details of the processing will be illustrated in the rest of this section.

BLI Mapping: Each deformed feature depends on four neighboring input features and it can be viewed as a vector-based dot product. One of the vector includes four BLI coefficients and the other vector includes four neighboring input features as shown in Figure 6. Each deformed feature processing can be mapped to four neighboring PEs organized as a cluster with a weight-stationary data flow. Since weights of the BLI are shared among the different input feature channels, they can be distributed to the PEs in the same cluster and broadcast to different clusters. The corresponding four input features in the same channel will be streamed in parallel for the multiplication among PEs in the same cluster. Features from different channels will be distributed to the different clusters of the computing array for higher throughput, but additional wires from wide input feature buffers to the PEs across the computing array are required accordingly. The clustered computing array on top of the original 2D regular computing array is shown in Figure 7. Unlike the output of the conventional computing array that are aligned in column, the clustered computing array output are extracted and aligned in cluster. Thereby, a DEMUX is added to the output port of each PE cluster to extract the output from the computing array when BLI is mapped.

Fig. 7. Clustered PE array for parallel BLI calculation. The design is built on top of a conventional 2D computing array in typical neural network accelerators. When the DEMUXs select 0, it is configured to be a normal 2D computing array and can be used for standard convolution. When the DEMUXs select 1, each PE cluster in the design can be used for an BLI output calculation.

Input Feature Layout: To make best use of the entire computing array, the neighboring input features must be fed to the computing array continuously. However, when the input features are sequentially stored in a single port buffer, it can not load the four features in different rows and columns of the input features from the on-chip buffer in a single cycle simply with wider read port. A four-port on-chip buffer can meet the computing requirement, but it is extremely resource-consuming in terms of both power and chip area. To address the problem, we modify both the input feature layout and the structure of the input buffer as shown in Figure 8. Since the four features for each BLI processing of a deformed output feature are located in adjacent rows and columns, we separate the input features into four partitions based on the feature coordinate parity in the feature map and the four features will always be located in different partitions. Accordingly, we have the buffer divided into four banks and each bank accommodate an input feature partition. Four features required by the BLI processing of any feature can be loaded in a single cycle. In addition, we have the input features stored in channel-major order and widen the port of the buffer bank such that features of multiple channels are read at the same time for all the different PE clusters.

Fig. 8. Input feature layout and input buffer organization.

Address Converter: To fetch neighboring features for the BLI, we need to calculate the buffer addresses of the four input features based on the non-integer indices. The basic idea is to obtain the four neighboring integers of the non-integer feature indices first and then deduct the base index of the four features to calculate the on-chip buffer addresses of the required features. Essentially, it is a conversion from 3D feature map indices to the 1D on-chip buffer indices and the higher dimension indices including the channel index and the height index need to be scaled accordingly as formulated in Equation (4). \(\alpha _{m}\) and \(\beta _{n}\) denote the original non-integer feature indices, i.e., coordinates in the 2D feature plane. \(index_{lb}\), \(index_{rb}\), \(index_{lt}\), and \(index_{rt}\) denote the four buffer addresses of the features located at the left bottom, right bottom, left top, and right top, respectively. \(\boldsymbol {A}\) denotes the number of PEs in the computing array in the neural network accelerator, \(\boldsymbol {H}\) denotes the height of the input feature map, \(\boldsymbol {c}\) denotes the channel number of the input feature, \(T_{0}\) denotes the base index of the four neighboring features. As the address conversion in different channels are the same, the formulation only illustrates the conversion in the 2D feature plane. To enable runtime BLI, we have a specialized address converter module implemented. It can be easily pipelined as shown in Figure 9. The generated indices will be aligned and sent to the different input buffer banks to retrieve the corresponding four features for the BLI calculation: (4) \(\begin{equation} \left\lbrace \begin{matrix}index_{lb}=\left(\left\lfloor \left\lfloor \beta _{n} \right\rfloor /2 \right\rfloor \times j+\left\lfloor \left\lfloor \alpha _{m} \right\rfloor /2 \right\rfloor \right) \times i-T_{0}\\ index_{rb}=\left(\left\lfloor \left\lfloor \beta _{n} \right\rfloor /2 \right\rfloor \times j+\left\lfloor \left\lceil \alpha _{m} \right\rceil /2 \right\rfloor \right) \times i-T_{0}\\ index_{lt}=\left(\left\lfloor \left\lceil \beta _{n} \right\rceil /2 \right\rfloor \times j+\left\lfloor \left\lfloor \alpha _{m} \right\rfloor /2 \right\rfloor \right) \times i-T_{0}\\ index_{rt}=\left(\left\lfloor \left\lceil \beta _{n} \right\rceil /2 \right\rfloor \times j+\left\lfloor \left\lceil \alpha _{m} \right\rceil /2 \right\rfloor \right) \times i-T_{0} \end{matrix}\right. \; \; \; \; \; \begin{pmatrix}i=\left\lceil c/(A/4) \right\rceil \\ j = H/2 \end{pmatrix}. \end{equation}\)

Fig. 9. Address converter and coefficient calculation block.

Coefficient Calculation: To enable runtime BLI, coefficients also need to be calculated at runtime and they are formulated in Equation (5) according to Equation (2). The four coefficients can be reused across the different channels, so the formulation only illustrates the calculation in the 2D feature plane. We notice that the multiplication result \(\Delta \alpha _{m}\Delta \beta _{n}\) is required by all the four coefficient calculation, so it is calculated first. Then the rest of the coefficient calculation can be conducted with only addition and subtraction. The pipelined architecture is shown on the left of Figure 9. The coefficients will be stored in the weight buffer for the BLI calculation according to the BLI mapping: (5) \(\begin{equation} \begin{aligned}\left\lbrace \begin{matrix} \eta =\left(1-\Delta \alpha _{m} \right)\left(1-\Delta \beta _{n} \right)\\ \mu =\left(1-\Delta \alpha _{m} \right)\Delta _{n}\; \; \; \, \; \; \; \; \; \; \; \\ \theta =\Delta \alpha _{m}\left(1-\Delta \beta _{n} \right)\; \; \; \; \; \; \; \; \\ \gamma =\Delta \alpha _{m}\Delta \beta _{n}\: \: \: \: \: \: \: \: \: \: \: \: \: \: \: \; \; \; \; \; \, \end{matrix}.\right. \end{aligned} \end{equation}\)

4.3 Runtime Tile Scheduling

To address the dynamic and irregular memory access problem in deformable convolution, we propose to track the data dependency at runtime and optimize the execution with runtime scheduling based on the tracked data dependency. The dependency tracking and scheduling optimization will be illustrated in the rest of this sub section.

Tile Dependency Tracking: To track the dynamic memory accesses, we need a dependency table to record all the required input features for each deformable convolution output feature. Due to the limited on-chip buffer in the accelerator, neural network processing is usually tiled and the dependency table is constructed in the granularity of a tile accordingly. The tile dependency table is abbreviated as TDT. The dependency of the deformable convolution output tile to the input feature tile can be inspected with the deformed feature indices as described in Figure 10. Assume both the input and output features are divided into fixed \(5 \times 5\) tiles. The feature indices, i.e., \(\alpha _{m}\) and \(\beta _{n}\) are compared to the different boundary indices and the comparison result vectors can be used to determine the row index and the column index of the tile.

As shown in Figure 10, the comparison result vector for \(\alpha _{m}\) is (1,1,0,0) and it means \(\alpha _{m}\) is between 0.4H and 0.6H. With a decoder, we can obtain the row index of the dependent input tile. In this example, the row index is 2. Similarly, we can also obtain the column index of the dependent input tile and it is 1 in this example. Given the row index and the column index, we can decide the dependent input tile index and it is 11 in this example as highlighted with red color. With this index, we can further determine the dependent bit vector of which each bit refers to the dependency of the indexed input tile. If the corresponding input tile is dependent, then it will be set to 1, while the remaining bits are set to 0. By continuously inspecting all the deformed features required by an output tile and conducting OR operation with the bit vectors, we can obtain the entire tile dependency vector of an output tile. The tile dependency table is constructed right after the first convolution of the deformable convolution. Although all the coefficients required for the tile tracking can be generated based on the total tile size and configured at runtime, the maximum tile size remains limited by the hardware component setups such as TDT and dependency vector length.

Tile Scheduling: Input tiles loaded to the on-chip buffers can be reused among the different output tile calculation, but both the ordering of the input tile loading and output tile execution can affect the tile reuse especially when the input buffer is limited and some of the input tiles have to be replaced. The input tile utilization varies greatly during the DCN processing as observed in the motivation section, which further aggravates the influence of the ordering of the input tile loading and the output tile execution. To address the above problem, we propose a unified tile scheduling algorithm that handles both the output tile scheduling and input tile scheduling based on the TDT that is updated at runtime.

The proposed scheduling algorithm is presented in Algorithm 1. It includes an output tile scheduling procedure and an input tile scheduling procedure based on the output tile scheduling result. For the output tile scheduling, it essentially selects an output tile that can reuse the input tiles that are already loaded and stored in the on-chip buffer. When the on-chip buffer is empty, it simply selects the output tile that requires the most input tiles, which are more likely to be reused. When the output tile is selected, we need to determine the loading order of the dependent input tiles of the selected output tile. Although it is possible to sort all the dependent input tiles based on the its potential reuse, but the input tile reuse is expensive to estimate at runtime. In this work, we have the dependent input tiles divided into three parts as illustrated in the \(input\_tile\_scheduling(.)\) function. The first part is the input tiles that are already stored in on-chip buffers. They will be scheduled first to ensure the on-chip data reuse. It can be determined by comparing the input tile on-chip status bit vector OC and the dependency bit vector \(B[nextID]\). The second part is the tiles that will be reused by the next output tile calculation. They will be loaded at last such that they can reside in the on-chip buffer for reuse. The next output tile calculation is obtained with the procedure \(output\_tile\_scheudling(.)\). By comparing the current output tile dependency bit vector and the next output tile dependency bit vector, we can determine the overlapped input tiles that can be reused, but the tiles that are already stored in the on-chip buffer will be removed. The rest of the input tiles will be scheduled between the first part and the second part. As the input tile reuse are already considered in the scheduling, a first in first out (FIFO) strategy is used for the input tile replacement for efficient hardware implementation.

The proposed tile scheduling is implemented with customized hardware rather than a software scheduling algorithm on CPUs to ensure efficient tile-based execution. The tile scheduling module is shown in Figure 11. It mainly depends on the TDT to select the next output tile for execution on the computing array in the accelerator. The basic idea is to choose the output tile that has the most dependent input tiles overlapped with that required by current output tile. Thus, we have the dependency bit vector of current output tile AND with the bit vectors of all the un-executed output tiles. The AND result will be sent to an non-zero (NZ) bit counter module that mainly consists of an adder tree to count the number of the non-zero bits. The number will pass through a pipelined comparator to determine the maximum value. The corresponding output tile has the most input tiles overlapped with that of the current output tile, so it will be scheduled for the execution next. Instead of having the output tile scheduling and the execution conducted sequentially, we adopt a pre-scheduling strategy that performs the next output tile scheduling in parallel with the current output tile execution. Since the execution does not have to wait for the immediate scheduling result, more complex scheduling algorithm can be implemented. When the next output tile is selected, we will schedule the dependent input tiles. The input tile scheduling mainly depends on three hardware-friendly bit-wise operations, which divide the input tiles into three parts as already discussed in Algorithm 1. By inspecting the non-zero bit number of the three resulting bit vectors with corresponding NZ bit counters, we can determine the IDs of the input tiles in each partition and push them into three independent queues. As each queue belongs to different scheduling priorities, they can be scheduled sequentially and the input tile scheduling is completed when all the queues are empty.

Fig. 11. Bit vector-based runtime tile scheduling. The grey blocks are mainly used for the output tile scheduling while the light blue blocks are mainly used for the input tile scheduling. The rest of the blocks are shared by both the output tile scheduling and the input tile scheduling.

4.4 BLI and Convolution Fusion

We notice that massive data movement is required when the different processing stages of the deformable convolution is performed sequentially due to the limited on-chip buffer. Inspired by the neural network fusion techniques [36], we try to fuse the different processing stages such that the intermediate data can be reused via on-chip buffers without additional external memory accesses. Basically, we have the input data of the upstream processing stage tiled. When a tile of the output data is obtained in the upstream processing stage, they will be used by the downstream processing stage immediately by simply swapping the input buffer and output buffer. Compared to the stage by stage processing, the fused processing on top of the tiling avoids transferring the intermediate data to and back from the external memory, which is beneficial to both the performance and energy efficiency. While the amount of the deformed indices is usually small compared to the feature data and can be buffered on-chip directly, we mainly have the second processing stage and the third processing stage tiled and fused in practice.

5 EXPERIMENT

5.1 Experiment Setup

Hardware Platforms: The proposed DCN accelerator is implemented with Verilog and synthesized with Synopsys Design Compiler under TSMC 40 nm library. It works at 800 MHz. The configurations of the DCN accelerator are shown in Table 2. Processing elements in the accelerator adopt 8 bit fixed point. We also have DCN implemented on a set of different architectures including an ARM processor (ARM), ARM+TPU, GPU, DCN Accelerator (DCNA), MEDCN [37], and DSEDCN [38]. MEDCN and DSEDCN are also customized DCN accelerators, but they target at FPGAs, which are difficult to compare directly. In this work, we utilize the GOPS presented in their work and scale them to the same clock used in DCNA for a fair performance comparison. The ARM processor is ARM-A7@900 MHz equipped with 1 GB DRAM (DDR3), which is the core of Raspberry Pi 3. The GPU is 256-core NVIDIA Pascal GPU with 8 GB GDDR5 memory and it is the core of Nvidia TX2. Experiments on the ARM processor and the GPU were implemented with PyTorch 1.3 on real platforms, i.e., Raspberry Pi and TX2, respectively. Experiments for ARM+TPU were conducted in a mixed manner. The second stage of the deformable convolution is not supported by TPU and it was performed on the ARM processor instead, while the rest of the neural networks was performed on TPU and evaluated with Scale-Sim [39]. The configurations of the TPU architecture are the same with that used in the DCN accelerator. In addition, both the ARM processor and TPU have 1 GB DRAM equipped. The average power of the ARM processor is 1.3 W and its idle power is 0.3 W. To evaluate the power consumption of DRAM, we accumulate the power consumption of the different memory operations such as Activation (ACT), Read (RD), Write (WR), and Background (BG) power based on Table 3 according to Micron’s power calculators [40].

Table 2.

# of PEs	In Buf	Out Buf	Weight Buf	Index Buf	Inst Buf
16 \(\times\) 32	128 KB	256 KB	256 KB	32 KB	64 KB

View Table

Table 2. Accelerator Parameters

Table 3.

ACT	RD	WR	READ I/O	Write ODT	BG
63.7mW	52.1 mW	52.1 mW	32.7 mW	136.1mW	67.7 mW

View Table

Table 3. Power Consumption of Each Different Memory Operations

Neural Network Benchmark: To evaluate the proposed DCNA, we have two typical neural network models including VGG19 [41] and SegNet [42] used as our benchmark. Deformable convolution can be used to replace any convolution in neural networks, but the replacement configurations can lead to different trade-offs between computation and model accuracy. In this case, we have three typical deformable convolution configurations set for each model and they are denoted as VGG19/SegNet-3, VGG19/SegNet-8, and VGG19/SegNet-F. As the convolution layers close to the output layer are usually smaller, we have deformable convolution placed from the output layer to input layer of the neural networks to minimize the deformable convolution induced computation. VGG19/SegNet-3 and VGG19/SegNet-8 represents that the last three convolution layers and the last eight convolution layers of VGG19/SegNet are replaced with the deformable convolution layers, respectively. VGG19/SegNet-F represents that all the convolution layers are replaced with deformable convolution layers. Details of the benchmarks are summarized in Table 4.

Table 4.

Network	# of deformable Conv	# of Conv	Kernel types
VGG19-3	3	13	3
VGG19-8	8	8	3
VGG19-F	19	0	3
SegNet-3	3	13	3
SegNet-8	8	8	3
SegNet-F	16	0	3

View Table

Table 4. Neural Network Benchmark

Fig. 12. Normalized (to ARM) performance of DCNs on different computing architectures.

Table 5.

	VGG19-3	VGG19-8	VGG19-F	SegNet-3	SegNet-8	SegNet-F
DCN-I	10.2	11.6	70.6	18.7	25.7	123.2
DCN-II	11.6	15.1	84.2	23.2	31.2	155.2

View Table

Table 5. Execution Time (s)

5.2 Performance Evaluation

The performance of the DCN execution on the different computing architectures is normalized to that on an ARM processor and shown in Figure 12 while the wall time of the execution om ARM processor is detailed in Table 5. In general, DCNA achieves 515\(\times\) and 621\(\times\) higher performance on DCN-I and DCN-II, respectively, on average compared to a general ARM processor. DCN-II requires more sampling locations in deformable convolution operations of DCNs, so it has more computation and random accesses involved, which leads to larger execution time. Accordingly, higher performance speedup is achieved for DCN-II on DCNA. On the other hand, we notice that DCNA achieves much higher performance speedup on VGG19/SegNet-F compared to the other two configurations (i.e., VGG19/SegNet-3 and VGG19/SegNet-8) with less deformable convolution operations in the neural networks. The main reason is that the deformable convolution is rather challenging for the ARM processor due to the irregular memory accesses. The execution of deformable convolution operations on an ARM processor dominates the total DCN execution time and becomes the performance bottleneck of DCNs. In contrast, convolution with regular memory accesses is usually intensively optimized on ARM processors in PyTorch and the performance speedup achieved on customized accelerators is relatively lower. When DCNs are deployed on the ARM+TPU architecture, we can only have the convolution operations accelerated with TPU while the deformable convolution is still executed on the ARM processor. According to Amdahl’s law, the deformable convolution remains the performance bottleneck. Thereby, the performance speedup of the ARM+TPU architecture to the ARM processor is rather low especially for VGG19/SegNet-F. Unlike the ARM+TPU architecture, GPU in TX2 can have the entire DCNs implemented and the deformable convolution operations can also benefit from the GPU parallel processing due to its powerful support on general tensor operations. Thus, significant performance speedup is achieved compared to the ARM+TPU architecture. DCNA that is built on top of a conventional neural network accelerator has customized circuit design for both the standard convolution and the new deformable convolution outperforms GPU and exhibits 2.21\(\times\) performance speedup on average. Similar to DCNA proposed in this work, MEDCN also adopts the layer fusion technique to merge the second convolution and the deformed feature generation and reduce the memory accesses. The major difference between DCNA and MEDCN is the way to address the irregular memory accesses in DCNs. MEDCN mainly utilizes a specialized register array to address the irregular memory access induced on-chip bank conflicts. In contrast, DCNA proposed an odd/even IO/buffer suit to the parallel BLI data flow to address the on-chip data conflicts, and it has additional tile scheduling based on pre-computed indices to investigate the input feature locality. Hence, DCNA outperforms MEDCN in general. DSEDCN is built on top of MEDCN. In addition, the DCN is divided into different layers and allocate to different groups. The overall throughput of the DSEDCN is improved through load balancing and global pipeline. While the locality problem remains unresolved.

Fig. 13. Energy consumption of DCNs on four computing architectures.

Table 6.

	VGG19-3	VGG19-8	VGG19-F	SegNet-3	SegNet-8	SegNet-F
DCN-I	15.9	17.1	55.8	29.1	34.3	99
DCN-II	17.3	20.1	65.6	34.5	40.8	122.8

View Table

Table 6. Energy (J)

5.3 Energy Consumption Evaluation

The energy consumption of DCNs on the four computing architectures normalized to the baseline ARM processor is presented in Figure 13 while the actual energy consumption of the ARM processor is shown in Table 6. Particularly, Figure 13 shows both the total energy consumption and the energy consumption distribution. Generally, DCNA with customized hardware acceleration shows the lowest energy and it is 612\(\times\) less than the ARM processor. The energy consumption benefit is attributed to both the much smaller execution time brought by the performance acceleration as illustrated in Figure 12 and the lower power consumption. For the ARM+TPU architecture, although TPU can accelerate the convolution operations and consumes little energy, it still consumes considerable time and energy for the deformable convolution operations on the ARM processor. GPU on Nvidia Jetson TX2 can have both the convolution and deformable convolution accelerated, so it greatly reduces the execution time but its energy consumption remains \(9 \times\) higher on average than DCNA due to the much higher power consumption. Moreover, we notice that the percentage of the DRAM energy consumption on VGG19/SegNet-F is larger than that on VGG19/SegNet-3 on DCNA. The main reason is that VGG19/SegNet-F with more deformable convolution involves many irregular memory accesses and lowers the memory access efficiency. Particularly, DCNA handles the irregular memory accesses with the granularity of a tile. Usually only a portion of the data in a tile is required and many of the data remain unused though the dependency is considered by the on-chip tile scheduler. In addition, some of the tiles may have to be repeatedly loaded due to the limited on-chip buffer and irregular data reuse. Thereby, the memory access efficiency is much lower than that of a standard convolution with regular memory access patterns. The lower memory accesses eventually lead to higher DRAM energy consumption.

5.4 Chip Area Evaluation

The baseline neural network accelerator can be roughly divided into on-chip Data Buffer (input feature/output feature/weight/bias buffer), PE Array, and the Original Control Logic while DCNA requires additional components including Index Buffer, Tile Dependency Table, and the Added Control Logic. The chip area of the different components is presented in Figure 14. It can be seen that DCNA induces only 6.6% chip area compared to the baseline design. Among the added hardware blocks, index buffer that usually needs to store the generated feature indices in a deformable convolution operation takes up the most chip area. In contrast, the additional control logic and the tile dependency table consume negligible chip area. Since the feature indices generated in deformable convolution is usually reused among the different channels, it is much less than the input/output features and weights. Hence, the index buffer size is much smaller compared to the data buffer in the baseline neural network accelerator.

Fig. 14. Chip area of the baseline neural network accelerator and DCNA.

5.5 Optimization Evaluation

Tile Scheduling: To evaluate the influence of the proposed tile scheduling on the DCN performance, we compare it with a naive implementation without bit vector-based dependency tracking (W/O bit vector) and an implementation with bit vector-based tracking but without tile scheduling (W/ bit vector + W/O scheduling). The naive implementation executes the output features sequentially and loads all the dependent input tiles as needed due to the lack of the overall dependency information. The implementation with bit vector-based tracking but without tile scheduling sequentially executes all the output tiles, but it loads the dependent input tiles of the entire output tile rather than a single output feature based on the corresponding TDT. The loaded input tiles will also be used for the calculation of all the features in the output tile such that the loaded tiles can be reused as much as possible. The proposed implementation has both the bit vector-based tracking and tile scheduling considered (W/ bit vector + W/ scheduling). It optimizes the ordering of both the output tile execution and the input tile loading for more efficient data reuse. The comparison is shown in Figure 15. It can be seen that VGG19/SegNet-F with all the convolution layers replaced with deformable convolution benefits most from the tile-based dependency tracking and scheduling. In contrast, VGG19/SegNet-3 with only three small convolution layers replaced with the deformable convolution exhibits marginal performance speedup. It is mainly caused by two reasons. First, the computation of the deformable convolution operations takes up only a small portion of the entire neural network computation, so there is little space left for the performance improvement. Second, the sizes of the deformable convolution operations are small and the data including input features, weights, and output features can be mostly fully buffered. Hence, the tiles can be reused without any scheduling and the proposed tile scheduling shows minor performance improvement in this case. The performance improvement on DCN-I and DCN-II also differs. This is mainly caused by the fact that DCN-II involves more random sampling and thus benefits more from the DCNA acceleration in general.

Fig. 15. Influence of tile scheduling on the DCN performance.

Fig. 16. Influence of tile scheduling on the DCN energy consumption.

On top of the performance improvement, we also evaluate the influence of the tile scheduling on the energy consumption of the DCN execution. The experiment result is shown in Figure 16. Again, it can be seen that VGG19/SegNet-F with most deformable convolution operations benefits most and shows the least energy consumption while VGG19/SegNet-3 with small deformable convolution operations has little optimization space for the tile scheduling. Generally, the energy reduction is mainly attributed to both the reduced execution time according to Figure 15 and the lower power consumption brought by the reduced memory accesses. As the performance improvement is already discussed in prior subsection, we mainly investigate the memory access reduction in this subsection. The total memory accesses issued by DCNA during the DCN processing is shown in Figure 17. By comparing the implementation without bit vector-based dependency tracking and the implementation with bit vector but no scheduling, we observe that the bit vector-based tile dependency tracking removes substantial memory accesses. It is mainly achieved by avoiding repeatedly loading the same input tiles required by the calculation of different output features in the same output tile. The scheduling further reduces the memory accesses by inspecting the input tile reuse among the different output tiles according to the tile dependency table. The experiment shows that the proposed scheduling reduces the memory accesses by 40.7% on VGG19/SegNet-F on average compared to the implementation with only bit vector-based dependency tracking.

Fig. 17. Influence of tile scheduling on the number of DCN memory accesses.

Tile Sizing: Tile size is an important design parameter and it determines the granularity of the tile dependency tracking and scheduling. Since most of the memory access latency in DCNA can be overlapped by the computing latency, tile size mainly affects the memory access efficiency and the DRAM energy consumption eventually, and has little influence on the performance. We take VGG19/SegNet-F with most deformable convolution operations as an example and analyzed the DRAM energy consumption under different tile sizes. The experiment result is shown in Figure 18. It can be seen that the DCN processing with smaller tile size benefits most from DCNA and consumes the least DRAM energy. The main reason is that smaller tile size allows more efficient tile dependency tracking and scheduling. Thus, the on-chip buffer utilization and DRAM memory access efficiency can be improved. In contrast, when the tile size is large, the dependent input features of each output tile can spread across all the input tiles. In this case, all the input tiles have to be repeatedly loaded for each output tile calculation, leaving little optimization space for the proposed tile dependency tracking and scheduling.

Fig. 18. Normalized DRAM energy consumption under different tile sizes.

Layer Fusion: To reduce the intermediate data transmission to and from DRAM, we propose to fuse the processing stages in each deformable convolution operation. The influence of the layer fusion on the energy consumption is shown in Figure 19. It can be seen that the fusion reduces the energy consumption by more than 20% on VGG19/SegNet-F with DCN-II structure. The main reason is that the memory access time in the two DCNs cannot be fully overlapped by the computation time due to the involved large deformable convolution operations and the fusion that avoids large intermediate data transmission improves the DCN performance other than the memory access efficiency. Unlike VGG19/SegNet-F, the deformable convolution operations in the rest scenarios are relatively small and the fusion only reduces the memory accesses while the performance has little improvement. As the DRAM energy consumption takes up only around 20% of the entire DCN energy consumption according to Figure 13, the energy reduction brought by the memory reduction is limited. Particularly, for VGG19/SegNet-3, the entire intermediate data can be mostly fully buffered and there is even little memory access optimization space left for the fusion. Thereby, the fusion shows little energy reduction.

Fig. 19. Influence of layer fusion on the DCN energy consumption.

6 CONCLUSION

DCNs that have been demonstrated to be effective in many practical scenarios with image geometric or photometric variations include new deformable convolution operations other than the conventional neural network operations. The deformable convolution operations that require random sampling over the entire input feature maps incur considerable irregular memory accesses and BLI operations, which cannot be fitted to the existing neural network accelerator architecture. In this work, we revisit the conventional neural network architecture by introducing a runtime tile-based data dependency tracking and scheduling mechanism to address the irregular memory accesses and optimize the data reuse in DCNs. At the same time, we reorganize the BLI operations to fit them to the 2D computing array for parallel processing. Finally, we also fuse the different processing stages in each deformable convolution operation to enable on-chip data reuse and reduce the intermediate data transmission via DRAM. According to our experiments on representative neural networks with different deformable convolution configurations, the proposed DCNA that supports both the convolution and deformable convolution achieves 45\(\times\)–546\(\times\) performance speedup over the ARM+TPU architecture that relies on the ARM processor to handle the deformable convolution. When compared to GPU that can execute the entire DCNs, DCNA shows 3\(\times\) performance speedup and 18.6\(\times\) energy reduction.

Footnotes

¹ “offset” is used to represent the relative distance to sliding window positions of a standard convolution in many DCNs. However, it is inconvenient to retrieve the features with the random offsets in hardware and offsets need to be converted to indices of the features instead. While the conversion is trivial, we use index in both the algorithm description and hardware description to make the notation consistent across the article.
Footnote

REFERENCES

[1] Dai Jifeng, Qi Haozhi, Xiong Yuwen, Li Yi, Zhang Guodong, Hu Han, and Wei Yichen. 2017. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 764–773.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
[2] Cao Zeyu, Li Xiaorun, and Zhao Liaoying. 2019. Object detection in VHR image using transfer learning with deformable convolution. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS’19). IEEE, 326–329.Google ScholarCross Ref
Reference
[3] Zhang Chen and Kim Joohee. 2019. Object detection with location-aware deformable convolution and backward attention filtering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 9444–9453. DOI:Google ScholarCross Ref
Reference
[4] Zhang Dong, Li Lan, Zhu Zizhong, Jin Shangang, Gao Weizhe, and Li Ce. 2019. Object detection algorithm based on deformable convolutional networks for underwater images. In Proceedings of the 2nd China Symposium on Cognitive Computing and Hybrid Intelligence (CCHI’19). 274–279. DOI:Google ScholarCross Ref
Reference
[5] Deng Liuyuan, Yang Ming, Li Hao, Li Tianyi, Hu Bing, and Wang Chunxiang. 2019. Restricted deformable convolution-based road scene semantic segmentation using surround view cameras. IEEE Trans. Intell. Transport. Syst. (2019).Google Scholar
Reference
[6] Chen Liang-Chieh, Zhu Yukun, Papandreou George, Schroff Florian, and Adam Hartwig. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 801–818.Google ScholarDigital Library
Reference
[7] Xiong Yuwen, Liao Renjie, Zhao Hengshuang, Hu Rui, Bai Min, Yumer Ersin, and Urtasun Raquel. 2019. UPSNet: A unified panoptic segmentation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 8810–8818. DOI:Google ScholarCross Ref
Reference
[8] Diao Yingyu, Chen Jingzhou, and Qian Yuntao. 2020. Multi-label remote sensing image classification with deformable convolutions and graph neural networks. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS’20). IEEE, 521–524.Google ScholarCross Ref
Reference
[9] Lu Natasha Goenawan Mandy. 2018. DeformSketchNet: Deformable Convolutional Networks for Sketch Classification. Retrieved from http://cs230.stanford.edu/files_winter_2018/projects/6940505.pdf.Google Scholar
Reference
[10] Lai Siew Cheng, Tan Hung Khoon, and Lau Phooi Yee. 2021. 3D deformable convolution for action classification in videos. In Proceedings of the International Workshop on Advanced Imaging Technology (IWAIT’21) 2021. International Society for Optics and Photonics, 117660R.Google ScholarCross Ref
Reference
[11] Sun Xiao, Xiao Bin, Wei Fangyin, Liang Shuang, and Wei Yichen. 2018. Integral human pose regression. In Proceedings of the European Conference on Computer Vision (ECCV’18). 529–545.Google ScholarDigital Library
Reference
[12] Weng Junwu, Liu Mengyuan, Jiang Xudong, and Yuan Junsong. 2018. Deformable pose traversal convolution for 3D action and gesture recognition. In Proceedings of the European Conference on Computer Vision (ECCV’18). 136–152.Google ScholarDigital Library
Reference
[13] Bertasius Gedas, Torresani Lorenzo, and Shi Jianbo. 2018. Object detection in video with spatiotemporal sampling networks. In Proceedings of the European Conference on Computer Vision (ECCV’18). 331–346.Google ScholarDigital Library
Reference
[14] Mac Khoi-Nguyen C., Joshi Dhiraj, Yeh Raymond A., Xiong Jinjun, Feris Rogerio R., and Do Minh N.. 2018. Locally-consistent deformable convolution networks for fine-grained action detection. Retrieved from https://arXiv:1811.08815.Google Scholar
Reference 1Reference 2Reference 3
[15] Li Ziqiang, Pan Hong, Zhu Yaping, and Qin A. K.. 2020. PGD-UNet: A position-guided deformable network for simultaneous segmentation of organs and tumors. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’20). 1–8. DOI:Google ScholarCross Ref
Reference
[16] Pominova Marina, Kondrateva Ekaterina, Sharaev Maksim, Bernstein Alexander, Pavlov Sergey, and Burnaev Evgeny. 2019. 3D deformable convolutions for MRI classification. In Proceedings of the 18th IEEE International Conference On Machine Learning And Applications (ICMLA’19). 1710–1716. DOI:Google ScholarCross Ref
Reference
[17] Huang Qijing, Wang Dequan, Gao Yizhao, Cai Yaohui, Dong Zhen, Wu Bichen, Keutzer Kurt, and Wawrzynek John. 2019. Algorithm-hardware co-design for deformable convolution. In Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS’19). IEEE, 48–51.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[18] Ahn Saehyun, Chang Jung-Woo, and Kang Suk-Ju. 2020. An efficient accelerator design methodology for deformable convolutional networks. In Proceedings of the IEEE International Conference on Image Processing (ICIP’20). 3075–3079. DOI:Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[19] Chen Yu-Hsin, Emer Joel, and Sze Vivienne. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). IEEE, 367–379.Google ScholarDigital Library
Reference 1Reference 2Reference 3
[20] Albericio Jorge, Judd Patrick, Hetherington Tayler, Aamodt Tor, Jerger Natalie Enright, and Moshovos Andreas. 2016. Cnvlutin: Ineffectual-neuron-free deep neural network computing. ACM SIGARCH Comput. Architect. News 44, 3 (2016), 1–13.Google ScholarDigital Library
Reference
[21] Bromberger Michael, Bastian Pascal, Bergeest Jan-Philip, Conrad Christian, Heuveline Vincent, Rohr Karl, and Karl Wolfgang. 2016. FPGA-accelerated Richardson-Lucy deconvolution for 3D image data. In Proceedings of the IEEE 13th International Symposium on Biomedical Imaging (ISBI’16). IEEE, 132–135.Google ScholarCross Ref
Reference
[22] Wang Ying, Xu Jie, Han Yinhe, Li Huawei, and Li Xiaowei. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the 53rd Annual Design Automation Conference. ACM, 110.Google ScholarDigital Library
Reference
[23] Zhang Chen, Sun Guangyu, Fang Zhenman, Zhou Peipei, Pan Peichen, and Cong Jason. 2018. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. (2018).Google Scholar
Reference
[24] Liu Cheng, Chu Cheng, Xu Dawen, Wang Ying, Wang Qianlong, Li Huawei, Li Xiaowei, and Cheng Kwang-Ting. 2021. HyCA: A hybrid computing architecture for fault-tolerant deep learning. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 41, 10 (2021), 3400–3413.Google ScholarDigital Library
Reference
[25] Xu Dawen, Chu Cheng, Wang Qianlong, Liu Cheng, Wang Ying, Zhang Lei, Liang Huaguo, and Cheng Kwang-Ting. 2020. A hybrid computing architecture for fault-tolerant deep learning accelerators. In Proceedings of the IEEE 38th International Conference on Computer Design (ICCD’20). IEEE, 478–485.Google ScholarCross Ref
Reference
[26] Xu Dawen, Tu Kaijie, Wang Ying, Liu Cheng, He Bingsheng, and Li Huawei. 2018. FCN-engine: Accelerating deconvolutional layers in classic CNN processors. In Proceedings of the International Conference on Computer-Aided Design. ACM, 22.Google ScholarDigital Library
Reference
[27] Yazdanbakhsh Amir, Samadi Kambiz, Kim Nam Sung, and Esmaeilzadeh Hadi. 2018. Ganax: A unified MIMD-SIMD acceleration for generative adversarial networks. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 650–661.Google ScholarDigital Library
Reference
[28] Yan Jiale, Yin Shouyi, Tu Fengbin, Liu Leibo, and Wei Shaojun. 2018. Gna: Reconfigurable and efficient architecture for generative network acceleration. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 37, 11 (2018), 2519–2529.Google ScholarCross Ref
Reference
[29] Zhang Xinyu, Das Srinjoy, Neopane Ojash, and Kreutz-Delgado Ken. 2017. A design methodology for efficient implementation of deconvolutional neural networks on an FPGA. Retrieved from https://arXiv:1705.02583.Google Scholar
Reference
[30] Hegde Kartik, Agrawal Rohit, Yao Yulun, and Fletcher Christopher W.. 2018. Morph: Flexible acceleration for 3D CNN-based video understanding. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’18). IEEE, 933–946.Google ScholarDigital Library
Reference
[31] Wang Hai, Shao Mengjun, Liu Yan, and Zhao Wei. 2017. Enhanced efficiency 3D convolution based on optimal FPGA accelerator. IEEE Access 5 (2017), 6909–6916.Google ScholarCross Ref
Reference
[32] Fan Hongxiang, Ng Ho-Cheung, Liu Shuanglong, Que Zhiqiang, Niu Xinyu, and Luk Wayne. 2018. Reconfigurable acceleration of 3D-CNNs for human action recognition with block floating-point representation. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 287–2877.Google ScholarCross Ref
Reference
[33] Chu C., Chen Fan, Xu D., and Wang Y.. 2021. RECOIN: A low-power processing-in-ReRAM architecture for deformable convolution. In Proceedings of the Great Lakes Symposium on VLSI (GLSVLSI’21).Google ScholarDigital Library
Reference
[34] Esser Steven K., McKinstry Jeffrey L., Bablani Deepika, Appuswamy Rathinakumar, and Modha Dharmendra S.. 2019. Learned Step Size Quantization. In Proceedings of the International Conference on Learning Representations.Google Scholar
Reference
[35] Python Cache Hierarchy Simulator. (n.d.). Retrieved from https://github.com/RRZE-HPC/pycachesim.Google Scholar
Reference
[36] Alwani Manoj, Chen Han, Ferdman Michael, and Milder Peter. 2016. Fused-layer CNN accelerators. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1–12.Google ScholarDigital Library
Reference
[37] Yu Yue, Luo Jiapeng, Mao Wendong, and Wang Zhongfeng. 2021. A memory-efficient hardware architecture for deformable convolutional networks. In Proceedings of the IEEE Workshop on Signal Processing Systems (SiPS’21). IEEE, 140–145.Google ScholarCross Ref
Reference
[38] Meng Yuan, Men Hongjiang, and Prasanna Viktor. 2022. Accelerator design and exploration for deformable convolution networks. In Proceedings of the IEEE Workshop on Signal Processing Systems (SiPS’22). IEEE, 1–6.Google ScholarCross Ref
Reference
[39] Samajdar Ananda, Zhu Yuhao, Whatmough Paul, Mattina Matthew, and Krishna Tushar. 2018. SCALE-Sim: Systolic CNN accelerator simulator. Retrieved from https://arXiv:1811.02883.Google Scholar
Reference
[40] Retrieved from https://www.micron.com/support/tools-and-utilities/power-calc.Google Scholar
Reference
[41] Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from https://arXiv:1409.1556.Google Scholar
Reference
[42] Badrinarayanan Vijay, Kendall Alex, and Cipolla Roberto. 2017. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 12 (2017), 2481–2495.Google ScholarCross Ref
Reference

Index Terms

Accelerating Deformable Convolution Networks with Dynamic and Irregular Memory Accesses
1. Hardware
  1. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific integrated circuits

Recommendations

Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs
DAC '17: Proceedings of the 54th Annual Design Automation Conference 2017

Convolution is a fundamental operation in many applications, such as computer vision, natural language processing, image processing, etc. Recent successes of convolutional neural networks in various deep learning applications put even higher demand on ...
Read More
Accelerating PQMRCGSTAB algorithm on GPU
UCHPC-MAW '09: Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop

The general computations on GPU are becoming more and more popular because of GPU's powerful computing ability. In this paper, how to use GPU to accelerate sparse linear system solver, preconditioned QMRCGSTAB (PQMRCGSTAB for short), is our concern. We ...
Read More
N3H-Core: Neuron-designed Neural Network Accelerator via FPGA-based Heterogeneous Computing Cores
FPGA '22: Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Accelerating the neural network inference by FPGA has emerged as a popular option, since the reconfigurability and high performance computing capability of FPGA intrinsically satisfies the computation demand of the fast-evolving neural algorithms. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Design Automation of Electronic Systems Volume 28, Issue 4
July 2023
432 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/3597460
Editor:
X. Sharon Hu
University of Notre Dame, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 18 July 2023
- Online AM: 15 May 2023
- Accepted: 4 May 2023
- Revised: 12 March 2023
- Received: 1 November 2022
Published in todaes Volume 28, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deformable convolution network
neural network accelerator
irregular memory access
runtime tile scheduling
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 1,041
  Total Downloads
- Downloads (Last 12 months)1,041
- Downloads (Last 6 weeks)137
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Accelerating Deformable Convolution Networks with Dynamic and Irregular Memory Accesses

ACM Transactions on Design Automation of Electronic Systems

Abstract

1 INTRODUCTION

2 BACKGROUND AND RELATED WORK

2.1 Deformable Convolutional Networks

2.2 Unified Deformable Convolution Model

2.3 Neural Network Accelerator Redesigning

3 OBSERVATION

4 DCN ACCELERATOR ARCHITECTURE

4.1 Overall Accelerator Architecture

4.2 BLI Implementation

4.3 Runtime Tile Scheduling

4.4 BLI and Convolution Fusion

5 EXPERIMENT

5.1 Experiment Setup

5.2 Performance Evaluation

5.3 Energy Consumption Evaluation

5.4 Chip Area Evaluation

5.5 Optimization Evaluation

6 CONCLUSION

Footnotes

REFERENCES

Cited By

Index Terms

Recommendations

Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs

Accelerating PQMRCGSTAB algorithm on GPU

N3H-Core: Neuron-designed Neural Network Accelerator via FPGA-based Heterogeneous Computing Cores

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media