FPGA-based systolic deconvolution architecture for upsampling

A deconvolution accelerator is proposed to upsample n × n input to 2n × 2n output by convolving with a k × k kernel. Its architecture avoids the need for insertion and padding of zeros and thus eliminates the redundant computations to achieve high resource efficiency with reduced number of multipliers and adders. The architecture is systolic and governed by a reference clock, enabling the sequential placement of the module to represent a pipelined decoder framework. The proposed accelerator is implemented on a Xilinx XC7Z020 platform, and achieves a performance of 3.641 giga operations per second (GOPS) with resource efficiency of 0.135 GOPS/DSP for upsampling 32 × 32 input to 256 × 256 output using a 3 × 3 kernel at 200 MHz. Furthermore, its high peak signal to noise ratio of almost 80 dB illustrates that the upsampled outputs of the bit truncated accelerator are comparable to IEEE double precision results.


INTRODUCTION
For the past decade, Deep Neural Networks (DNN) have been effectively employed in various applications of computer vision (Dongseok et al., 2019;Chen et al., 2014), speech recognition (Han et al., 2017) and image segmentation (Ronneberger, Fischer & Brox, 2015). Most of these applications concentrate on classification and segmentation problems. Convolutional layers form the primary modules of these DNN, where stacks of kernels are convolved with the input images to generate feature maps, that are subsequently passed through pooling and rectification layers to identify the dominant features (Ma et al., 2016). The process of convolution, rectification and pooling operations are repeated in a sequence till denser features are acquired from a larger receptive field. Finally, the feature maps are flattened and presented to a fully connected layer which provides a classification score . Over the years researchers have attempted to implement a few notable DNNs on hardware, such as the AlexNet, VGG-16 (Lu et al., 2020) with lesser resources but higher throughput Di et al., 2020;Lu et al., 2020). In general, these methods suffer from a common problem related to the usage of the pooling layer which gathers information from larger receptive field but loses the significant spatial coordinates from where the information has been obtained. To overcome this problem, DNN architectures incorporating encoder and decoder modules have been proposed, and amongst them U-Net proposed by Ronneberger, Fischer & Brox (2015) is the most popular model that is mainly used for segmentation applications. In the U-Net architecture, the feature maps that are downsampled in the encoder framework are later upsampled in the decoder stages. Furthermore, the decoder module of the U-Net and its variants include skip connections along with transpose convolution, also referred to as upsampler or deconvolution modules, to generate segmentation results of resolution equivalent to the input resolution (Ronneberger, Fischer & Brox, 2015).
Although many hardware implementations have been produced for encoder module (which is similar to VGG-16 architecture (Lu et al., 2020)), there are very few implementations of the decoder module, which involves the bottle-neck associated with the transpose convolution operation. One of the earliest deconvolution implementations on hardware was proposed by Zhang et al. (2017), where reverse looping and stride hole skipping mechanisms respectively ensure efficient deconvolution through the selection of input blocks based on output space and the removal of fractional addresses within the looping procedures. The deconvolution accelerator used C-based Vivado HLS libraries where loop unrolling and pipelining techniques were introduced to exhibit parallelism on a Zynq-7000 series FPGA. Dongseok et al. (2019) presented a lightweight CNN segmentation processor that includes: (i) dilation convolutions (insertion of virtual zeros within the kernel elements) for normal convolutions; (ii) transpose convolutions (insertion of virtual zeros within the feature maps) for enlargement of the feature maps; and (iii) the use of region of interest (ROI) based selection algorithm to enhance the throughput of the segmentation model. Dongseok et al. (2019) reported that their model when tested on a segementation application reduced the operational cost by 86.6% and increased the throughput (GOPS) by 6.7 times. Lu et al. (2020) introduced the Fast Winograd algorithm (FWA) to reduce the arithmetic complexity involved in the convolution operations and thereby improve the performance of CNN implementations on FPGA. The FWA exploits the structural similarity of the input feature maps and transforms the convolution operations into Element-Wise Multiplication Manipulation (EWMM), which reduces the number of multiplications and increases the required number of additions. Di et al. (2020) extended the use of FWA for transposed convolution implementations on FPGA, where the feature maps presented to the TransConv module were extended (by padding and introducing zeros in between the elements) and decomposed into four smaller subblocks. By applying FWA in parallel to these subblocks, the convolution output was obtained through element-wise multiplication of the input elements with the corresponding kernel coefficients. A performance improvement of 8.6 times was reported. However, the method was inefficient since FWA is suitable only for small kernels (Shi et al., 2019).
A reconfigurable generative network acceleration (GNA) with flexible bits widths for both inputs and kernels weights was proposed by Yazdanbakhsh et al. (2018). Inter and intra processing element (PE) processing and cross layer scheduling mechanisms are engaged to support the computations in the convolution, deconvolution and residual blocks. The inclusion of the dual convolution mapping method (where convolutions are associated with the outputs and deconvolutions are mapped to the inputs) efficiently balances the PE workload in convolution and deconvolution modules. It also improves the utilization performance of the PEs by 61% when compared to traditional methods. The GNA reported a 409.6 giga operations per second (GOPS) at 200 MHz with 142 mW power consumption. A convolution and deconvolution architecture capable of generating segmentations outputs close to real time was presented by Liu et al. (2018). The deconvolution module does not require addition of zeros between the input elements and produces upsampled outputs through a series of operations viz: (i) multiplication of single input pixel with the kernels; (ii) addition of overlapped outputs; and (iii) removal of outputs along the borders. An automatic hardware mapping framework based MATLAB and C scripts was employed to select the best design parameters which were then used to generate the synthesizable HDL code for implementation on the Xilinx Zynq board. A U-Net architecture was implemented and its performance was compared with GPU and CPU implementations. It achieved the best power and energy performance with speed being second only to the GPU implementation. Chang & Kang (2018) presented a massively parallelized deconvolution accelerator, referred as the TDC method, obtained by transforming the deconvolution operator into the four sparse convolutions. To avoid the overlapping summation problem, the height and width of the input images have to be determined to generate output blocks that do not overlap. Also the method has a load imbalance problem caused by the weights of the decomposed sparse convolution filters. Later in Chang, Kang & Kang (2020), the same authors optimized the TDC by rearranging filters which enabled DCNN accelerator to achieve better throughput. When implemented using C-based VIVADO HLS tool, the optimised TDC achieved 108 times greater throughput than the traditional DCNN.
We propose an FPGA-based scalable systolic deconvolution architecture (for different n × n input and k × k kernels) with reduced number of multipliers and adders, requiring no additional padding or insertion of zeros in between the inputs. Our contributions are as follows: 1. We present a Register Transfer level (RTL) based deconvolution architecture capable of upsampling n × n input to 2n × 2n output when convolved with a k × k kernel. The proposed module can be used as a standalone or readily connected to a pipeline to represent the decoder framework of the U-Net or the deconvolution CNN. We present upsampled outputs for intervals 32 × 32 to 64 × 64; 64 × 64 to 128 × 128 and 128 × 128 to 256 × 256 and compare the bit width truncated FPGA results with those of double precision MATLAB outputs. 2. The proposed architecture is systolic and governed by a single reference clock. After an initial latency, an upsampled element is obtained at every clock pulse which is then streamed to the next stage of the pipeline for further processing. A pipelined version capable of generating 256 × 256 output from 32 × 32 input using 3 × 3 kernel requires only 826.55 µs when operating at the frequency of 200 MHz. 3. The proposed architecture is coded using Verilog HDL and hence is void of any additional overheads associated in mapping CPU based algorithm directly to FPGAs. Also, the deconvolution architecture includes simple hardware structures such as the shift registers blocks, counters, comparators and FIFOs and thus can be extended to provide upsampled outputs by convolving with different kernel sizes. We also present the relevant equations to upsample n × n to 2n × 2n using 5 × 5 and 7 × 7 kernels. Further in 'Hardware Implementation of the Upsampling Pipeline' we present the hardware implementation of upsampling an random 32×32 matrix to 256×256 using 3×3 filters. This paper is organized as follows. 'Upsampling Techniques' introduces the upsampling techniques used in deep networks. 'Deconvolution Hardware Architecture' presents the implementation of 4 × 4 to 8 × 8 deconvolution architecture. 'Design of Experiments' presents the experiments related to bit width requirements. 'Analysis of the Deconvolution Accelerator' discusses the required computation time, computation complexity and comparison results with other deconvolution architectures. 'Hardware Implementation of the Upsampling Pipeline' illustrates the implementation of the upsampling pipeline and finally 'Conclusion' summarizes our contributions.

UPSAMPLING TECHNIQUES
The following are the upsampling methods used in deep networks: (i) Interpolation techniques (Lee & Yoon, 2010); (ii) Max unpooling (Shelhamer, Long & Darrell, 2016); and (iii) Transpose Convolution (Chang, Kang & Kang, 2020). Interpolation techniques could be either K-Nearest Neighbours, Bilinear or Bicubic interpolation and Bed of Nails. The first two interpolation methods introduce new samples either through direct copying or by a distance based weighted averaging of the neighbouring inputs. With Bed of Nails, upsampling is performed by inserting zeros in the positions other than the copied input elements. Max unpooling operator introduced in the decoder pipeline acts opposite to the max pooling operation of encoder framework. During the forward pass, at each max pooling operation, the positional indices of the maximum values are stored and later, during decoding, upsampling is performed by mapping the inputs at each stage to the corresponding coordinates, with the rest being filled with zeros. This technique is employed in SegNet (Badrinarayanan, Kendall & Cipolla, 2017), where coordinates of the maximum values of the feature maps obtained during the forward pass are used for the unpooling process during the decoding stages. The above techniques, though simple and efficient have a fixed relationship between input and output, and therefore are independent of the associated data. Hence they find less usage in deep networks where generalization through learning from inputs is a fundamental requirement.
In recent years, many deep learning architectures employ transposed convolution for deconvolution. Transpose convolution can be regarded as the process of obtaining the input dimensions of the initial feature map with no guarantee of recovery of the actual inputs since it is not an inverse to the convolution operation . Upsampling using transpose convolution can be achieved by: (i) sparse convolution matrix (SCM) (Liu et al., 2015); and (ii) fractionally strided convolutions (FSC) (Zhang et al., 2017;Yazdanbakhsh et al., 2018;Chang & Kang, 2018;Di et al., 2020). In SCM based upsampling, the 2D convolution process can be regarded as the multiplication of a SCM with an input image I . The convolution operation for an 8 × 8 input image with a 5 × 5 kernel, to give a 4 × 4 valid convolution output O are given by SCM represents the spatial position of the kernels when slided across the image, where k (0,0) ,k (0,1) ,k (0,2) ...k (4,4) denote the kernel values at corresponding positions. I 64×1 is the flattened input to enable matrix multiplication and O 16×1 denote the flattened output after matrix multiplication which is finally reshaped to O 4×4 . The number of rows and columns of SCM depend on the number of input and output elements, respectively. Using the above relations, the backward pass which recovers the input resolution (4 × 4 to 8 × 8) is trivial by transposing SCM, i.e., SCM T 64×16 × O 16×1 = I 64×1 . SCM or SCM T , which contains the positional coordinates of the kernel, defines the forward or transpose convolution.
The traditional convolution process can also be employed to upsample an n×n input to 2n×2n output by convolving with a k ×k kernel (K k×k ). As the kernel is strided across the input, the convolution operator has to provide contributions associated only with elements present within the k × k window. Thus, to maintain the connectivity pattern and obtain interpolated outputs, it is convenient to introduce zeros in between the input elements before convolution. This procedure introduces fractional level convolution commonly referred as FSC.
To upsample an input image I n×n , an intermediate extended image E l×l is created by: (i) insertion of (s − 1) zeros in between the input elements; (ii) padding zeros (p) around the boundaries; and (iii) padding zeros (a) along the bottom and right edges of the input I n×n . Table 1 summaries the description of all the parameters and Fig. 1 illustrates E l×l , where a = (n + 2p − k) mod s and p = k−1 2 . Next, E l×l is convolved with the corresponding kernel K k×k to obtain the upsampled output O m×m , i.e., where denotes the valid convolution operation, l = (2 × n − 1) + a + 2p and m = 2n = s × (n − 1) + a + k − 2p. To upsample I 2×2 using K 3×3 , p = 1, a = 1, l = 6 and m = 2n = 4, i.e., O 4×4 . Thus, FSC can be readily employed to upsample an n × n input to a 2n × 2n output. Both SCM and FSC when used for upsampling require introduction of zeros (either in SCM or in E) and Table 2 illustrates the number of zeros added for different upsampling intervals.
Thus, when implemented on hardware the redundant operations (due to the zeros) consume large resources which generally lowers the performance of the hardware.   However, when compared across different upsampling intervals the SCM requires exponential padding of zeros along the rows and columns, and thus, like many hardware implementations Di et al., 2020;Chang, Kang & Kang, 2020) we use FSC technique to upsample the inputs. Though the proposed method like Liu et al. (2018); Chang, Kang & Kang (2020) employs four convolution patterns for upsampling, but efficiently decomposes the filters kernels into four simple, efficient and independent equations that avoid the need for redundant zeros required for FSC based upsampling.

DECONVOLUTION HARDWARE ARCHITECTURE
To upsample an n × n input to 2n × 2n output using FSC requires the dilation of the input as explained in the previous section. However, in practice for hardware implementations, inserting and padding zeros are not viable. Thus the proposed architecture consists of the following modules: 1. A shift register (SR) module used for temporary buffering of the streamed inputs.
The input passes through a series of flipflops (FFs), FF 1 to FF n , in a systolic manner governed by a common reference clock. 2. PEs are used to compute interpolated outputs by multiplying the inputs from the shift registers with the stored kernel coefficients. 3. A Data Control module (DCM) which consists of 2 control switches (CSW1 and CSW2) and 4 FIFOs arranged in parallel. CSW1 facilitates the temporary storage of PE outputs and CSW2 enables the systolic streaming of the upsampled results. The length of the FIFOs and SR module depends on the kernel size and the upsampling intervals, i.e., 4 × 4 to 8 × 8 or 8 × 8 to 16 × 16, etc., and Table 3 illustrates the size requirements for different kernel and upsampling intervals.
As the input data progresses at a prescribed data rate into the SR module of the deconvolution accelerator, the PEs multiply the input data with the corresponding kernel coefficient. The control switches of the DCM then enable efficient storage, retrieval and streaming of the upsampled data.

Overview of 4 × 4 to 8 × 8 deconvolution architecture
To upsample a 4 × 4 input to a 8 × 8 output using FSC, a temporary extended image E of size 10×10 is created by inserting zeros between the input elements (shown as white grids in Fig. 1), padding around the boundaries (shown as red grids) and along the right and bottom edges (shown as cyan grids). As the 3 × 3 kernel slides across E, the output is computed from four computational patterns expressed in colours: pink, blue, yellow and green. For example, when the kernel is placed at the top left corner of E, the output O 1 shown as the pink grids, the output image O 8×8 is computed by multiplying the input d 1 with central element k 5 of the kernel, i.e., Likewise, progressing with a stride of 1 along the row followed by the column, the interpolated elements corresponding to the 8 × 8 output is obtained from the 4 × 4 input. For example, when the kernel is strided along the row and column, the blue and yellow grids of O 8×8 give the interpolated output O 2 and O 3 , i.e., 1 The MATLAB code is provided where we compare the upsampled outputs obtained from Eqs. (4) to (7)   3 × 3 5 × 5 7 × 7 3 × 3 5 × 5 7 × 7 3 × 3 5 × 5 7 × 7 3 × 3 5 × 5 7 × 7 3 × 3 5 × 5 7 × 7 3 × 3 5 × 5 7 × 7 Similarly the green grid denoted by O 4 computes the output Figures 2A-2D illustrate the four computation patterns, where k 1 ,k 2 ,k 3 ,...,k 9 respectively correspond to the 3 × 3 kernel coefficients 1,2,3,...,9, and d 1 ,d 2 ,d 3 ,...,d 16 respectively denote the 4 × 4 input 1,2,3,...,16. Thus, by extending the 4 × 4 input and employing Eqs. (4) to (7) we can compute the required 8 × 8 upsampled outputs. 1 The deconvolution architecture to upsample a 4×4 input to a 8×8 output by convolving with a 3 × 3 kernel is shown in Fig. 1 and according to Table 3, the architecture requires: (i) SR module of length 5 to allow buffering and enable computations to be performed in parallel; (ii) 4 PEs to compute Eqs. (4) to (7); (iii) 4 FIFOs each of length 16 are used to store the upsampled outputs; and (iv) a DCM comprising of multiplexers and 4 counters (count1, count2, count3, count4) for indexing the row and columns of the input and output, respectively.
The length of the SR module is based on the kernel size and the input resolution. In general the length of the SR module (Num SR ) is given by Num SR = k−1 2 × n + k−1 2 . For I 4×4 and K 3×3 , the length of SR module is 5. Furthermore, the length each of the FIFO is fixed as n × n. Since the input is 4 × 4, the FIFOs have a length of 16.
The PEs are hardware wired for a particular upsampling interval and kernel size, and execute in parallel to compute one of Eqs. (4) to (7). For example, PE 1 receives input from SR 1 and PE 2 receives inputs from both SR 1 and D 0 . The input and output connections of each PEs and their associated kernel coefficients are shown Fig. 3, where SR 1 , SR 2 , SR 4 and SR 5 are respectively the outputs of the flip flops FF 1 , FF 2 , FF 4 and FF 5 of the SR module.
To explain the operation of module we use the same inputs and kernel coefficients as shown in Fig. 1, and the timing diagram of the generation of the outputs for the first 24 clock cycles is shown in Fig. 4. Once signal De is enabled, the deconvolution accelerator is active and the input data (signal D 0 in the timing diagram) enters the SR module and propagates forward through FF 1 to FF 5 at the positive edge of the clock. At time T = t2, both PE 1 and PE 2 simultaneously receive their input from D 0 and SR 1 , respectively, which are then multiplied with their corresponding kernel coefficients of the K 3×3 to present the outputs, O 1 and O 2 , respectively, i.e., Subsequently as the input data advances, between clocks T = t3 and T = t6 and employing just PE 1 and PE 2 , the upsampled elements of the first row (Row 1 ) of O 8×8 are computed. Due to zero padding at the rightmost boundary of the extended image, the last computation within PE 2 requires just the multiplication of SR 1 × k 4 . This is achieved by employing a counter (count2) to track the column indices and notify the multiplexer as shown in Fig. 3B. The architecture of PE 1 and PE 2 are shown in Figs. 3A and 3B, respectively.
To compute the upsampled elements of Row 2 and Row 3 , along with PE 1 and PE 2 , PE 3 and PE 4 operate in parallel. At clock T = t6, all the PEs simultaneously receive their input (D 0 , SR 1 , SR 4 and SR 5 ) from the SR module which then gets multiplied with the corresponding kernel coefficients and to simultaneously produce the respective outputs. Figures 3C and 3D illustrate the architecture of PE 3 and PE 4 where Here, O 3 and O 4 represent the outputs of PE 3 and PE 4 , respectively. The availability of the input data at every clock cycle and the parallel execution of PEs enable the deconvolution accelerator to compute all 16 interpolated outputs of Row 2 and Row 3 of O 8×8 within 4 clock cycles, i.e., between T = t7 and T = t10. As the input data proceeds into the deconvolution module the elements of Row 4 to Row 7 are computed in the similar fashion. Finally, to compute Row 8 of O 8×8 , (row index is traced using count1) only PE 3 and PE 4 execute in parallel and using Eqs. (10) and (11) Fig. 5. The FIFOs write and read commands are synchronised with the input clock of the accelerator module and a series of controls generated by the DCM enables effective writing and streaming of the upsampled outputs from the FIFOs.

DCM of 4 × 4 to 8 × 8 deconvolution architecture
The DCM is shown in Fig. 6 and consists of two control switches CSW1 and CSW2 that assist in the generation of FIFO write and read commands, enabling temporary storage and retrieval of the data. CSW1 and CSW2 are controlled by counters count1 and count3 which track the row indices of the input and the outputs, respectively. The FIFO write cycle is as follows: 1. To store Row 1 of O 8×8 : Initially count1 = 0, CSW1 = 0, PE 1 and PE 2 execute in parallel and their corresponding outputs stored in FIFO 1 and FIFO 2 , respectively. Also, FIFO 3 and FIFO 4 are write disabled. 2. To store Row 2 to Row 7 of O 8×8 : (Beginning T = t7) count1 increments from 1 to 3, CSW1 = 1, PE 1 , PE 2 , PE 3 and PE 4 execute in parallel, and all the FIFOs are write enabled. PE 3 and PE 4 are connected to FIFO 1 and FIFO 2 where as PE 1 and PE 2 are linked to FIFO 3 and FIFO 4 . The FIFO inputs are interchanged to enable easier read of the outputs during the read cycle. 3. Finally for Row 8 of O 8×8 : count1 = 4, CSW1 = 1, only PE 3 and PE 4 execute in parallel and their outputs are connected to FIFO 1 and FIFO 2 . The read operation is managed by CSW2 and the Read signal is asserted after a delay of β clocks cycles and after De = 1 where β = θ + FIFO delay .θ (refer to 'Computation time of single Deconvolution Accelerator') represents the delay before a valid sample is available at the output of PEs and normally FIFO delay = 2 clock cycles. Thus, to upsample 4 × 4 to 8 × 8 using a 3 × 3 kernel we set β to 3 (θ = 2, for details refer to 'Computation time of single Deconvolution Accelerator'). Once the Read is asserted, count3 and count4 respectively track the number of rows and columns of O 8×8 and the data is read from the FIFOs using separate signals (Fr1, Fr2, Fr3 and Fr4) that are controlled by line control (LC) and transfer control signals (TF), respectively, as shown in Fig. 6. With LC = 1 or 0, and based on the rising edge of the TF, the data is read from the corresponding FIFO in an orderly manner, i.e., Fr1 = !TF && LC. (12) where ! and && denote the logical NOT and logical AND operations, respectively. The FIFO read cycle is as follows: 1. Initially read Row 1 of O 8×8 : count3 = 0, LC = 1 and TF is toggled for every clock cycle. The generated read signals, Fr1 and Fr2, using Eqs. (12) and (13) control the read operations of FIFO 1 and FIFO 2 , respectively. 2. To read Row 2 to Row 8 of O 8×8 : Starting at T = t13, count3 increments from 1 to 7, LC increments for each update of count3 and TF is toggled for every clock cycle as shown in Fig. 4. If LC is 0, using Eqs. (14) and (15) the computed results are read from FIFO 3 and FIFO 4 . When LC is 1, FIFO 1 and FIFO 2 are enabled for reading. Note that count3 is controlled by the column counter count4 which increments for every 0 to 2n − 1. The read cycle of the DCM introduces a delay (DCM delay ) of 3 clock cycles before the outputs are streamed in a systolic manner regulated by a reference clock. The proposed deconvolution architecture can be extended for various upsampling intervals by just extending the number of FFs within the SR module. The number of the PEs remain the same but their inputs differ. The PE equations for different upsampling internals for different kernel size are given in Table A1.

DESIGN OF EXPERIMENTS
The proposed deconvolution accelerator was implemented on the Xilinx XC7Z020 FPGA using the Hardware Descriptive Language, Verilog. The behavioural and structural models were analyzed, simulated and synthesized using Xilinx VIVADO 2017.4. 2 For experiments, we have chosen kernels of size 3×3, 5×5 and 7×7; image resolutions 32×32, 64×64 and 128 × 128 and clock frequencies 200 MHz.

Kernel bit width
At the positive edge of a clock signal, the deconvolution accelerator receives a stream of pixels 8-bit width which propagates through the shift register and PEs. The inputs are multiplied with the corresponding kernel coefficients with the results stored in FIFOs. For hardware implementations, fixed point is the natural choice of data representation due to simplicity and less usage of hardware resources. Thus, the floating point kernel coefficients are converted to fixed point by using a scaling factor of 2 f and expressing the output as (f + 1)-bit within the FPGA. Here the optimum f is chosen by comparing the metrics such as Root Mean Square Error (RMSE) and the Peak Signal to Noise Ratio (PSNR) for different combinations of 2 f with the corresponding IEEE double-precision output. Table 4 illustrates the results, where the kernel coefficients were selected from the distribution of range between −1 to +1 by invoking Keras tool (He et al., 2015). Initially, when f = 7, 8 and 9, the RMSE is high but with increase in the precision (bit width of the kernel), the PSNR improves and RMSE lowers, suggesting that fixed-point calculations are comparable to those of floating point operations. A scaling factor of 2 11 gives acceptable PSNR of 78.52 dB (Rao et al., 1990) with a low RMSE of 0.0303 and indicates that the fixed-point result is close to the IEEE double-precision . Increasing the bit width above 12 resulted in no significant improvement in PSNR and therefore the bit width of the kernels was set to 12-bit (f = 11 and 1 sign bit). Therefore a kernel value of (0.13250548) 10 was first multiplied by 2048 (2 11 ) and its result (271.37122304) 10 was rounded to (271) 10 . Its equivalent fixed-point representation in 11-bit along with 1 sign bit (000100001111) 2 was used to represent the filter coefficient.

PEs output bit width
To illustrate that a deconvolution architecture produces upsampled outputs with considerable accuracy, we compare the upsampled outputs at different upsamping intervals (from 32 × 32 to 256 × 256) with those of the corresponding MATLAB outputs. For a realistic comparison, an image with a flat Power Spectral Density (PSD) (e.g., a white noise) was chosen as an input and the metrics, PSNR and RMSE, were used to evaluate the model. Based on the experimental results of the previous section, the input and kernel bit widths were set to 10-bit and 12-bit, respectively. The output the PEs were varied between 8 to 12-bit and the upsampled results of the deconvolution accelerator was compared with the corresponding MATLAB outputs. Table 5 shows the results and it can be inferred that 10-bit output is sufficient since the PSNR averages more than 58 dB across all upsampling intervals. Further increasing the bit widths resulted in no significant increase in the PSNR but resulted in considerable increase in hardware. Therefore, the choice of 10-bit upsampled outputs is reasonable. With the kernel and input width set to 12-bit and 8-bit, the accelerator produces upsampled outputs of 22 maximum bits (computation within the PEs include both multiplication and addition), and therefore the upsampled elements are left shifted 11 times and the 9 most significant bits (MSB) bits in addition to the sign bit are stored in the respective FIFOs. The shift operation compensates the earlier 2 11 multiplication of the kernel coefficients.

Comparison of upsampled results of different kernel sizes obtained from a trained U-Net models
We compare the outputs of the deconvolution accelerator with the MATLAB versions for various input sizes on kernel coefficients obtained from a trained U-Net model and  natural images obtained from various datasets. First, we upsampled an random image of size 32×32 image to resolutions: 64×64, 128×128 and 256×256 using a 3×3 kernel with a maximum and minimum values of 0.7219356 and −0.64444816. The kernel coefficients obtained from the corresponding decoder frame work of the U-Net are stored in a register as 12-bit fixed point representation (as explained in 'Kernel bit width') and the upsampled results of the previous stage are provided as inputs to the current stage. Figure 7A illustrates the upsampled images at each stage of the pipeline (32 to 256). Tables 6 and 7 respectively show the corresponding performance scores and the resource usage. Furthermore, Table 8 reports resource usage for individual deconvolution units employing 3 × 3 kernels. Next, the camera man and the Natural images are examined with similar interpolation intervals.
To illustrate that the proposed model can be extended for different kernel sizes, we also    Chang, Kang & Kang (2020). Thus the 10-bit deconvolution accelerator indeed produces upsampled outputs comparable to MATLAB results.

ANALYSIS OF THE DECONVOLUTION ACCELERATOR Computation time of single Deconvolution Accelerator
The total computation time (T total ) required in terms of clock cycles for upsampling is given by where T CT is the time required to obtain 2n×2n samples from a n×n input, θ denotes the delay before a valid sample is available at the output of the PEs. T CT is obtained as follows: 1. To compute Row 1 of the 2n × 2n, PE 1 and PE 2 execute in parallel n times. 2. To compute Row 2n of the 2n × 2n, PE 3 and PE 4 execute in parallel n times. 3. To computes rows Row 2 to Row 2n−1 of the 2n × 2n, PE 1 , PE 2 , PE 3 and PE 4 operate in parallel as batches represented by N with each batch executing n times. Therefore where n denotes the input size and N is given by The denominator indicates that 2 rows of the 2n × 2n output are computed when the all the PEs execute in parallel. The initial delay θ depends on k and is given by denotes the ceiling operation. Figure 8 illustrates T total and Table 9 tabulates θ , T CT and T total for different upsampling intervals and kernels. Thus, using the 3 × 3 kernel to upsample 4 × 4 to 8 × 8, (substitute k = 3 in Eq. (19)), the first effective result at the output of the PEs (PE 1 and PE 2 ) is obtained after a delay of two clock cyles, (i.e., θ =2). Subsequently PE 1 ,PE 2 execute 4 times in parallel to compute the samples of Row 1 . For Row 2 to Row 7 , all the PEs independently execute 4 times in parallel but in 3 pipelined batches (N = 3 as computed using Eq. (18)). Finally, for Row 8 , PE 3 ,PE 4 again execute  (15). The time-to-read (T R ) the upsampled elements is 2n×2n for an n × n input since the upsampled elements are streamed in a systolic manner (1 output per clock cycle) in reference to the common clock.

Computation time for the Pipelined architecture
The DCM allows separate read and write controls of the FIFOs and thus the upsampled elements of deconvolution accelerator can be readily streamed to the next stages: 2n × 2n to 4n × 4n, 4n × 4n to 8n × 8n and so on to represent a pipelined architecture that is similar to the decoder module of the U-Net. The computation time for the pipelined (T P ) deconvolution framework is given by where D denotes the number of upsampling intervals, T R (time-to-read) is T R = (2 D×n ) 2 ) and DCM delay = 3, and β is the delay before the read signal (Read) is asserted (refer to 'DCM of 4 × 4 to 8 × 8 deconvolution architecture'). To upsample 32 × 32 to 256 × 256 using K 5×5 , T P is computed by substituting D = 3, β + DCM delay = 8 (β = θ + FIFO delay ; refer to Table 9 for θ and 'DCM of 4×4 to 8×8 deconvolution architecture' for FIFO delay and DCM delay , and T R = 65536 cycles ((2 3 × 32) 2 ) in Eq. (20)). Thus, T p = 65560 clock cycles (3 × 8 + (2 3 × 32) 2 ). Furthermore, for example, if a clock frequency of 50 MHz is considered, then the T P of the three-stage pipelined deconvolution module capable of upsampling 32 × 32 to 256 × 256 is 1310.84 µs (65542 × 0.02 µs), thus achieving a frame rate of 763 fps (frames per second). Figure 8 illustrates T P for a two stage pipelined deconvolution framework (n × n to 4n × 4n).

Comparison of computation complexity of the proposed architecture with other deconvolution architectures
The total number of operation (multiplications and additions) required to complete the upsampling process represents the computation complexity of the model. For the proposed architecture the number of multipliers OP mul and adders OP add required to upsample n × n to 2n × 2n using k × k kernel are given by The total operations OP total is given by Table 10 shows the OP mul , OP add and OP total for various upsampling intervals and kernel sizes. When compared with existing architectures(refer to Table 10) where the total operations are computed using k 2 n 2 + 2k(k − s)(n − s) + (k 2 − s 2 )(n − 2) 2 (for Liu et al. (2018)) and (2 × k 2 − 1) × n 2 for (Zhang et al. (2017) and Yan et al. (2018)), the proposed deconvolution architecture reduces the required operations by a maximum of 20%. We attribute this reduction to the pipelined structure of the architecture which executes either 2 or 4 PEs in parallel per clock cycle to produce the interpolated outputs. Also, at any clock where Freq denotes the frequency. From Table 11, it is evident that the proposed architecture uses fewer operations and therefore less hardware resources to upsample. Furthermore, the proposed architecture produces the best resource efficiency of 0.309 GOPS/DSP at 200 MHz. The lowest clock cycles are required to upsample a 128 × 128 input to 256×256 across all considered architectures. We attribute the improvement to the hardware design which benefits in the reduction of operations and produces a maximum operations saving of 23% (by comparing the OP total of Di et al. (2020)) which directly relates to lower usage of the hardware resources. Furthermore, the proposed deconvolution accelerator achieves GOPS = 3.641 and GOPS/DSP = 0.135 for the pipelined architecture 32 × 32 to 256 × 256.

Extension of the proposed Deconvolution Accelerator
Although traditional U-Nets are based on 3 × 3 (Shvets & Iglovikov, 2018) kernels, few architectures either employ 5 × 5 (Chang, Kang & Kang, 2020) or 7 × 7 (Badrinarayanan, Kendall & Cipolla, 2017) in their encoder-decoder pipeline. Thus, to allow reusability of the architecture, we present in Table A1, equations for different upsampling intervals for 3 × 3, 5 × 5 and 7 × 7 kernels. The number of PEs are the same, but the length of the SR module and the FIFOs differ(refer to Table 3). Thus, by rewiring the inputs to the PEs, different upsampling intervals using different kernels sizes are obtained.  Figure 9A illustrates the upsampling pipleline where 32 × 32 random input is upsampled to 256 × 256 output using ZYNQ AX7020 FPGA board. Here to avoid computational overheads, the 8 bit 32 × 32 input was initialized in ROM, and systolically presented to the deconvolution accelerator pipeline as shown in Fig. 9B. The upsampling results for each layer (64 × 64 and 128 × 128) along with final 256 × 256 output is shown in the display screen (Fig. 9A). The complete upsampling pipeline required 131µs when executed at 50 MHz clock frequency. Here Xilinx IP cores, namely, Block ROM (https://docs.xilinx.com/v/u/Yy8V_830YccMjYlS44XWXQ) and RGB to DVI Video Encoder (employing HDMI interface) (https://www.xilinx.com/support/documentation/application_ notes/xapp495_S6TMDS_Video_Interface.pdf) were used for initialization of the inputs and display of the upsampled outputs.

CONCLUSION
We present an FSC based systolic deconvolution architecture capable of upsampling n × n input to 2n×2n output using a k ×k kernel. The standalone (128×128 to 256×256) and the pipelined versions (32 × 32 to 256 × 256) implemented using 3 × 3 on a Xilinx XC7Z020 platform, achieved an overall performance and resource efficiency of 2.781 GOPS and 3.641 GOPS, 0.309 GOPS/DSP and 0.135 GOPS/DSP, respectively. When compared with other deconvolution architectures, the proposed architecture requires the least number of operations (with a saving of 23%) which results in lower usage of hardware. Furthermore, the high PNSR value demonstrates that the 10-bit upsampled results of deconvolution accelerator are comparable to IEEE double-precision outputs. In addition, the proposed architecture has a high scalability (the length of FIFOs and SR module change but number of PEs remain same) to suit different upsampling intervals. Table A1 Appendix: Equations for extending the deconvolution accelerator different upsampling intervals (n × n to 2n × 2n based different kernel sizes.