Acceleration of convolutional neural network based diabetic retinopathy diagnosis system on field programmable gate array

ABSTRACT


INTRODUCTION
The diabetic retinopathy (DR) is a medical condition that causes vision loss due to an increase in blood glucose level.At least 783 million people are predicted to have diabetic issues by 2045 [1].The increase in blood glucose level causes a hormonal imbalance that produces eye fatigue when blood pressure and glucose levels are high, the capillaries and veins in the retina are damaged, preventing blood flow and resulting in DR.It became blindness when the treatment is not received in a time.According to medical specialists, an early detection of DR helps avoid severe visual loss [2].DR should be monitored on a frequent basis.Therefore, a basic checkup could be performed twice a year.However, the management of this condition is timely consuming treatment that can be controlled by monitoring blood glucose levels.
The traditional technique for assessment and analysis to discover morphological abnormalities in the eyes makes the experimental procedure more complicated and time consuming.An ophthalmologist diagnoses the color fundus of the patient and then processes the patient condition.Most of the time, patients are not treated on time due to a lack of facilities.Once DR is detected early, the deteriorating process can be controlled, preventing irreparable vision loss.Manual diagnosis of this condition can often result in misdiagnoses.Automated techniques allow early DR detection, which has a huge public health benefit in terms of preventing blindness.A fundus camera is commonly used in automated systems, which is connected to another device that has the diagnosis algorithm; these devices can be personal computers (PCs) or smartphones (based on central processing units (CPUs) or graphical processing units (GPUs)).By integrating the algorithm within ophthalmic imaging equipment (using field programmable gate array (FPGA) system on chip (SoC)), we present a more performant embedded DR diagnosis system.Figure 1 presents the three different approaches for DR diagnosis.Deep neural networks (DNNs) especially convolutional neural networks (CNNs), have established themselves as the most popular strategy in the current scenario, with higher performance in a variety of domains, and above traditional machine learning approaches particularly in image analysis and treatment [3].However, this task may now be completed quickly and easily using advanced deep learning (DL) techniques such as CNNs in a hardware-based solution.Nevertheless, the implementation of CNN on hardware platforms remains challenging.CNNs usually impose large demands on computation and memory consumption given that the number of operations and parameters increases with the complexity of the model architecture.Therefore, the performance will strongly depend on the hardware target resources.This CNN's growing complexity has made researchers think about using optimization strategies to make them more hardware friendly and designing hardware accelerators to enhance their performance on embedded systems.
In the first part of this study, we used transfer learning to design our model.We preprocessed the input data, which is color fundus photography, to reduce undesirable noise in the image.No DR, mild DR, moderate DR, severe DR, and proliferate DR are the five different DR stages detected by the model.Extensive experiments and comparisons with other research work show that the proposed method is effective.
In the second part, to obtain a medical diagnostic device for DR, we implemented the designed CNN model on embedded platform while applying optimization techniques.In addition, we have accelerated our inference model on the FPGA platform using hardware intellectual property (IP).To further demonstrate the advantages of such a hardware architecture we implemented our model on Google GPU and Google tensor processing unit (TPU).
CNNs have widely proved efficient with a higher number of applications related to computer vision tasks [4], [5] and medical applications [6], [7].The CNN has been used in recent studies to automate the detection and classification of DR [8], depending on the classification method employed.These studies are categorized as binary classification to detect DR and multi-level classification to assess the proper stages.
Xu et al. [9] used a tiny architecture of a CNN model to automatically classify the color fundus images in the Kaggle dataset as normal or DR images with an accuracy of 94.5%.According to Hajabdollahi et al. [10], the original visual geometry group16 (VGG16) network is updated and used for DR binary classification.The model achieved 93.89% accuracy.Furthermore, a hierarchical pruning strategy is proposed to simplify the CNN structure, resulting in a 1.89% accuracy loss.
Kwasigroch et al. [11] offered an automatic detection approach for DR based on CNN and demonstrated that CNNs may be successfully used to handle this kind of difficult task.The obtained model was tested on two tasks based on the examination of retinal images: the first is the detection of DR and the second is the determination of its stage.The model had an accuracy of around 82% in diagnosing DR and 51% in determining its stage.Shaban et al. [12] proposed a deep CNN architecture that can accurately categorize DR into 3 classes: no DR, moderate DR (which includes patients with mild or moderate non-proliferative diabetic retinopathy (NPDR), and severe DR (which includes patients with severe NPDR or proliferative diabetic retinopathy (PDR) in the late stages) with an accuracy of 89%.Automated analysis of retinal color images has such advantages as increased reliability and coverage of screening programs, reduced barriers to access, and early detection and treatment.In the domain of retinal image analysis, all top solutions used CNNs to identify signs of DR in retinal images.
From the hardware perspective, Ghani et al. [13] presented an automated retinal fundus image detection approach for glaucoma and DR based on artificial neural network (ANN).The proposed tiny architecture classifier was implemented on the Artix-7 FPGA platform, yielding high performance (100% accuracy and nearly 400 ns of latency).According to Washburn et al. [14], the RAPIDS compute unified device architecture (CUDA) machine learning libraries (cuML) based fuzzy c-mean clustering segmentation method was used for the processing and segmentation of input retinal images.The authors proposed a hardware-based system to help ophthalmologists with the mass screening of DR.On the NVIDIA Jetson, the system performed high-speed detection of lesions on input retinal images (300 ns latency).
Pritha et al. [15] implemented AlexNet for retinal disease detection on FPGA.The proposed CNN model categorizes retinal images into 4 classes (optical disk cartridge (ODC), diabetic nephropathy (DN), central serous retinopathy (CSR), and normal image) in 2 seconds.Compared to software implementation of solutions related to RD detection, the results on hardware platforms demonstrate the enhancement of performance in terms of latency and energy efficiency with the flexibility of the embedded system.In our work, we implemented a large CNN model with multi-level classification to accurately identify the proper stages of DR with high performance on different platforms.This paper is structured as follows: the proposed approach and the adopted method with different experiments are detailed in section 2. Results are evaluated in section 3. Finally, a summary of this research output and potential future work is presented in section 4.

METHOD
The proposed pipeline for designing, implementing, and accelerating a DR system is shown in Figure 2. First, a CNN model dedicated to DR detection has been designed, validated, and trained.In this step, the ResNet50 structure has been updated, and the pre-trained network parameters are fine tuned using preprocessed color fundus photography as input data.Then, optimization techniques have been applied to the model to make it more efficient and hardware-friendly.Finally, it has been deployed and accelerated on the Zynq FPGA using the Xilinx data processing unit (DPU).Different steps of the proposed approach are presented in the following subsections.Retinal images are acquired using different imaging conditions and equipment, which results in very mixed image quality.Subtle signs of retinopathy at an early stage can be easily masked in a low contrast, blurred, or low resolution image.Analysis of an image of low quality may produce unreliable results when the system labels an image as normal while lesions are present.That is why image quality is a very important factor for DR detection system sensitivity.
For our application, we have used a dataset that consists of a large set of high resolution retina images provided by Kaggle [16].A clinician has rated each image on a scale of 0 to 4 according to the international clinical diabetic retinopathy (ICDR) severity scale, which is explained: i) 0 is no DR; ii) 1 is mild DR; iii) 2 is moderate DR; iv) 3 is severe DR; and v) 4 is roliferate DR. Figure 3 shows a representative sample of each class.We used 3,662 images for training and 1,992 images for the test.The training data is distributed on the previous scale, as Figure 4.

Model architecture
With the advances in CNNs, it has become possible to design and train models to automatically discover subtle local features without the need for manual annotation of individual lesions.The neural network used in this work is a CNN with a deep layered structure that combines nearby pixels into local features and then progressively aggregates those into global hierarchical features.The model does not explicitly detect lesions; instead, it learns to recognize them based on local features.It was trained on a classification task and produced a continuous score between the five previous classes, which indicates the presence and severity of DR.
The model architecture used in this work originates from the ResNet50 network since it is highly reliable when processing microscopic and complicated images and it solves the vanishing gradient problem.ResNet central concept is to introduce an "identity shortcut connection" that skips one or more layers to minimize information loss Figure 5.Each basic block is a sequence of Conv2D layers followed by a batch normalization layer and a ReLu activation function, all of which is added to an identity shortcut at the end.

Transfer learning
Transfer learning is a methodology in which a model that has been trained for one task is repurposed for another.Recently, it has gradually become a topic in both academia and industry [17], [18].Transfer learning focuses on transferring knowledge (features and weights) across domains.Most classification applications based on CNNs are making increasing use of transfer learning technique.
The main idea is to pretrain models on a big dataset and then fine tune models on specific target datasets.The obtained parameters from the first model are used as the initialization of the second model.Yosinski et al. [19] demonstrated that transferred parameter initialization is better than random parameter initialization.Since then, transfer learning has been applied to a variety of tasks in different fields.In addition to providing the capability to use already constructed models, transfer learning has several benefits.It speeds up the process of the training and can also result in a more accurate and effective model in most cases.In our study, the transfer learning technique was used to speed up training and improve the model performance.A pre-trained ResNet50 model on the large dataset ImageNet is used as the starting point and then refined on our Kaggle dataset related to retina images.

Hardware acceleration
After testing and validating the optimized CNN inference on CPU, we are moving to a strictly hardware implementation on FPGA looking for more speedup and energy efficiency.Indeed, we are going to concentrate on accelerating our CNN inference on heterogeneous FPGA, including embedded cores, using a specifically designed hardware accelerator.Such heterogeneous SoCs provide effective platforms for embedded, complex CNN based applications.We are mainly interested in implementing the CNN inference on the Xilinx ZYNQ FPGA using DL DPU.Its architecture exploits the cooperation between the processing system (PS) and the programmable logic (PL) in Xilinx Zynq devices.The Xilinx deep neural network development kit (DNNDK) [20] is also being explored to significantly simplify the implementation of optimized CNN models by automatically mapping these models on FPGA platforms.This tool can accelerate the mapping of the CNN inference phase onto Xilinx hybrid CPU-FPGAs SoC through an easy to use C/C++ programming interface.

Development board
Experiments have been carried out on the Xilinx ZedBoard, illustrated in Figure 6.It is based on Xilinx Zynq-7000 All Programmable SoC.The Zynq architecture provides a PL (FPGA), a PS and a large number of communication interfaces.A dual core advanced risc machine (ARM) Cortex-A9 processor with a variety of peripheral connectivity interfaces is included in the PS.ZedBoard has proven efficient for rapid prototyping and proof of concept development [21].

Data processing unit overview
Research by Xilinx released DPU [22] as an IP core to improve CNN inference performance on hybrid FPGAs.It is a programmable engine optimized for CNNs and their demanding computing functions.It uses a specialized instruction set and supports a great variety of CNNs.DPU IP can be implemented in the PL as a co-processor connected directly through an AXI interconnect to the main multicore ARM processor of the Zynq®-7000 SoC or Zynq UltraScale+™ MPSoC devices.The CNN model and its massive computations are computed in parallel in the PL.The not supported layer functions of some CNN models are traditionally handled by the processor.A high performance scheduler, an instruction fetch unit, a hybrid computing array, and a global memory pool are all part of the DPU. Figure 7 presents the hardware architecture of the DPU. Figure 7. DPU Hardware architecture [22] The DPU uses quantization and high clock frequency in digital signal processor (DSP) slices.Quantization is required and can be applied using the DNNDK tool.DSP slices are clocked at a very high frequency using a DSP double data rate (DDR) technique [22], which uses a 2× frequency domain to increase peak performance.The DPU requires instructions to implement the model.These instructions are generated by the deep neural network compiler (DNNC), where optimizations have been made.
To control the operation of the computer engine, the DPU fetches these instructions from the offchip memory.For high performance, data is buffered in on-chip memory, and it is reused as much as possible to save memory bandwidth.A deep pipeline design is used for the computing engine.The processing element (PE) in Xilinx devices make full use of fine grained building blocks including multipliers, adders, and accumulators.

Design and implementation flow
The deployment of DNN into Xilinx DPU platforms needs different steps, as illustrated in Figure 8.In this part, the Xilinx Vivado design flow steps have been followed.To configure the PL part of the Zynq-7000 and integrate the DPU IP, the design block is designed first.We need mainly the PS and the DPU IP.Other IPs are instantiated to meet the hardware requirements for the proper functioning of the system.Some configurations on the DPU are required, in order to respond to CNN architecture.
After connecting all the IP components, the very high hardware description language (VHDL) files are created so that the Vivado synthesizer can interpret and manage the design block.After the implementation phase, the design was completely routed on the FPGA.A Bitstream is generated that will be exported to be used as the base hardware for software development.The AXI4 Interconnect serves as a vital communication connection between the PS and the PL for bidirectional data transfer.The PS accesses the In the second part, we used Petalinux to generate a custom Linux image that incorporates the DL aspect of the project and the Xilinx DNNDK tool.DNNDK is an integrated framework that offers a complete range of toolchains for inference with the DPU, including compression, compilation, deployment, and profiling.Among the components of DNNDK are the deep compression tool (DECENT) and the DNNC.The quantizer DECENT is used for the model quantization with the precision of INT8.The compiler DNNC generates the DPU instructions related to the implementation of the model and memory access.This results in an executable linker file (ELF), which is then compiled together with other custom C/C++ program instructions that control other tasks of the DL algorithm, such as image loading, visualization, and other preprocessing tasks.The CPU and DPU elf files are then compiled into a single file that is loaded onto the board.Then, the CNN inference tasks are initialized in ARM and assigned to DPU. Figure 8 presents the flow for hardware and software implementation of CNN on FPGA using DPU IP.

Data processing unit configuration and hardwre resource utilisation
In this section, we aim to evaluate different configurations of DPU to provide a more comprehensive analysis.A variety of DPU architectures are available and can be easily adapted to be supported on different embedded boards.It can be configured with a variety of convolution architectures based on the convolution unit parallelism.From B512 to B4096 are the DPU IP architectures [22].The DPU convolution architecture has three dimensions of parallelism: pixel parallelism, input channel parallelism, and output channel parallelism.Each architecture requires different programmable logic resources, and larger architectures with more resources can achieve higher efficiency.The B4096 architecture, for example, makes use of all available hardware resources in terms of block RAMs (BRAMs) and DSPs, resulting in a peak performance of 4096 operations per cycle.Some configurable parameters in the DPU IP can be used to optimize resource utilization and customize different features.Depending on the amount of programmable logic resources available, different configurations for DSP and BRAM utilization can be selected.In addition, a single DPU IP may have a maximum of three cores.To achieve higher efficiency, multiple DPU cores can be used.As a result, more programmable logic resources are consumed.

Implementation on data processing unit
To implement a CNN inference on an FPGA using DPU IP and Xilinx tools, we need to follow the steps presented in Figure 9. Firstly, the model is trained and prepared on a PC using the TensorFlow (TF) framework.Secondly, the Xilinx DNNDK is used to quantize and compile the model and to generate the DPU instructions.Then, the CNN inference tasks are initialized and assigned to the DPU by the PS.Comparisons with other research studies are presented in Table 2 to better evaluate the performance.Our prosed model outperforms other works in terms of accuracy with multilevel classification.High accuracy is critical in practically all medical applications, especially in detecting DR and assessing its stage.A misdiagnosed DR can lead to a lack of therapy and further progression of the illness, which can cause blindness.

Method
Classification type Accuracy (%) Inception V3 [23] Multi-level (5 class) 63.23 VGG-D [11] Binary (DR/NO DR) 81.70 VGG-D [11] Multi-level (5 class) 51 CNN [24] Multi-level (5 class) 74.04 CNN [12] Multi-level (3 class) 89 VGG16 [10] Binary (DR/RDR) 93.80 ResNet50 (Ours) Multi-level (5 class) 92.90 Furthermore, we run our DR detection model on the Google colaboratory GPU and TPU platforms.The GPU is an NVIDIA TESLA K80 with 2 CPU cores, 12 GB of RAM, and up to 4.1 teraflops (TFLOPS) of performance.While the TPUs were TPUv2 (8 core) with an Intel Xeon (4 core) CPU and 16 GB of RAM.As the results show in Table 3, our model performed well on the Google GPU computational platform.Furthermore, the GPU performed better than the TPU in every case.This is due to the fact that TPU setup takes some time when compiling the model and distributing the data in the clusters, so the first iteration always takes time.

Hardware acceleration
The inference phase of the DR detection system based on a complex CNN model has been implemented successfully on a limited resource FPGA, the Xilinx Zynq-7000 SoC.In order to test and The CNN model with 8-bit quantization embedded in the Zynq-7000 predicts DR and evaluates its stage on a single input image in 26.2 ms while consuming only 3.237 watts.More details are presented in Table 4.With the use of the B1024 DPU architecture, we achieved comparable performance with more area saving.However, results showed that using B800 and B512 DPU types decreased the performance by 50% (nearly the GPU performance).
Furthermore, the results presented in Table 5 show that our model inference on hybrid FPGA device outperforms the software implementation on CPU (PC), on Google GPU and on Google TPU with an average loss in accuracy.FPGA SoC has become a beneficial hardware device to replace the GPU for complex CNN implementation thanks to its powerful parallel processing and low power consumption.Hence, the hybrid Zynq-7000 FPGA device remains a viable candidate for implementing the complex CNN model used for the DR detection system.The obtained results show the efficiency of our system in detecting DR stages.Despite the high performance, our system cannot be compared to other related work, taking into consideration the difference in the task, the DL used model, and the hardware platform Table 6.
Subsequently, the whole system performance could be improved with some enhancements in DPU configuration.Therefore, we tested the use of two DPU cores with the four possible DPU architectures (B1152, B800, B1024, and B512).We have noticed that the use of hardware resources (DSP and RAM) has been doubled, which causes a resource issue that is not supported on our ZedBoard.The utilization of the DPU is limited by the computational capability and hardware resource capacity of the board.To improve performance and create a more powerful embedded DR detection system, this work can be further developed and implemented in a larger-resource FPGA SoC that supports multiple DPU core utilization.

CONCLUSION
This paper has demonstrated the feasibility of obtaining an embedded DR instrument diagnostic system by implementing a complex CNN model on FPGA SoC to detect DR stages.Using color fundus photography as input data, a CNN model for the detection of DR and its stages depending on severity is ISSN: 2252-8776  presented.Our model performs well, with a 92.9% accuracy for five-class classification on the kaggle dataset.The model has been optimized and deployed on embedded platforms to obtain a medical diagnostic device.Our system can reduce the time of the manual diagnosis and be more effective than automated solutions that use a fundus camera linked to AI models running on CPUs or GPUs to diagnose DR.The emphasis of this paper has also been on accelerating the complex and heavy CNN inference phase that has been adapted for DR application.In this work, the DPU accelerator has been used for the hardware (HW) implementation of our DR detection system on FPGA based SoCs.The implementation process of our CNN model on DPU has been studied and tested and it shows that, thanks to the Xilinx DPU IP and related tools, the model can be optimized, deployed, and tested easily on the FPGA SoC.The possible architectures of DPU and hardware resources have been investigated on the Zynq-7000 FPGA to improve the inference performance of our CNN model.Our embedded system, in a tiny Smart SoC, outperforms the GPU in terms of latency and power consumption.This work can be further developed and deployed in a larger UltraScale FPGA SoC using multiple DPU cores in order to achieve better performance and to obtain a more efficient embedded DR detection system.

Figure 1 .
Figure 1.Different DR diagnosis techniques

Figure 2 .
Figure 2. Pipeline of the proposed system for DR detection

Figure 3 .
Figure 3. Images from each of the five DR classes

Figure 6 .
Figure 6.ZedBoard[21] & Commun Technol, Vol. 12, No. 3, December 2023: 214-224 220 input images stored on the SD card and then inputs them to the DPU during inference.Data input can be visualized on external displays connected directly to PS (universal asynchronous receiver transmitter (UART) and secure shell (SSH)).

Figure 8 .
Figure 8. Hardware and software implementation flow

Figure 9 .
Figure 9. Workflow for CNN implementation on DPU


ISSN: 2252-8776 Int J Inf & Commun Technol, Vol. 12, No. 3, December 2023: 214-224 222 validate the impact of deploying a hardware Xilinx DPU accelerator on the inference phase performance, different possible DPU architectures supported on the board, have been tested looking for good performance with less HW resource utilization.The best peak of performance is achieved with the B1152 architecture.

Table 1 .
Different model performance for DR classification

Table 3 .
DR detection model performance summary

Table 5 .
Implementation results on Zynq-7000 FPGA and other platforms

Table 6 .
Performance comparison of similar implemented research work related to DR