Automatic zoning for retinopathy of prematurity with a key area location system

Retinopathy of prematurity (ROP) usually occurs in premature or low birth weight infants and has been an important cause of childhood blindness worldwide. Diagnosis and treatment of ROP are mainly based on stage, zone and disease, where the zone is more important than the stage for serious ROP. However, due to the great subjectivity and difference of ophthalmologists in the diagnosis of ROP zoning, it is challenging to achieve accurate and objective ROP zoning diagnosis. To address it, we propose a new key area location (KAL) system to achieve automatic and objective ROP zoning based on its definition, which consists of a key point location network and an object detection network. Firstly, to achieve the balance between real-time and high-accuracy, a lightweight residual heatmap network (LRH-Net) is designed to achieve the location of the optic disc (OD) and macular center, which transforms the location problem into a pixel-level regression problem based on the heatmap regression method and maximum likelihood estimation theory. In addition, to meet the needs of clinical accuracy and real-time detection, we use the one-stage object detection framework Yolov3 to achieve ROP lesion location. Finally, the experimental results have demonstrated that the proposed KAL system has achieved better performance on key point location (6.13 and 17.03 pixels error for OD and macular center location) and ROP lesion location (93.05% for AP50), and the ROP zoning results based on it have good consistency with the results manually labeled by clinicians, which can support clinical decision-making and help ophthalmologists correctly interpret ROP zoning, reducing subjective differences of diagnosis and increasing the interpretability of zoning results.


Introduction
Premature infants are a special group of newborns, and the smaller the gestational age and the lighter the birth weight, the more likely to be complicated with other diseases, the worse the prognosis, which is an important cause of neonatal death.It is worth noting that retinopathy of prematurity (ROP) is a leading cause of childhood blindness worldwide, accounting for about 6%∼18% of the causes of blindness in children [1][2][3][4], and it refers to the abnormal proliferation of retinal blood vessels in infants due to the incomplete development of retinal blood vessels, leading to retinal contraction, detachment and eventually blindness induced by some factors, which is a kind of ocular atrophy serious blinding eye disease [5].The multi-center study of early treatment for retinopathy of prematurity (ET-ROP) has found that 68% of newborns with birth weight less than 1251 g had ROP of different degrees [6], of which patients with mild ROP can naturally subside without treatment in general, while patients with severe ROP generally need treatment and may develop into permanent blindness [7,8].
In recent years, with the general establishment and improvement of neonatal intensive care units and the significant improvement of neonatal life care technology, more and more low birth weight premature infants survive, making the incidence of ROP on the rise [9].According to the statistics of the World Health Organization (WHO), in 2010 alone, there were about 184700 premature infants with different degrees of ROP worldwide, and about 32% of them needed treatment, of which only 40% received treatment and more than 10% of them would suffer from blindness or severe visual impairment [7].Visual impairment not only seriously affects the survival and quality of life of children, but also brings serious economic burden to families and society.Therefore, early diagnosis, effective prevention and timely treatment of ROP are very important.At present, the prevention and treatment mode of ROP at home and abroad is that ophthalmologists usually use wide-angle digital retinal imaging (RetCam) to carry out fundus examination on high-risk premature infants, and then diagnose, analyze and formulate appropriate treatment plans based on the international classification of ROP (ICROP) [10,11].ROP can be divided into three zones according to the location of ROP lesion defined in ICROP, which can reflect the severity of the disease and is considered to be more important than stage for severe ROP [12].Figure 1 shows some examples of ROP fundus image and Table 1 lists the detailed definition of Zone I to Zone III.In addition, many ophthalmologists have insufficient experience and strong subjectivity in the diagnosis process, resulting in high inconsistency of ROP zoning results.Therefore, it is of great significance to realize automatic ROP zoning recognition based on objective ROP zoning definition.The OD and macular are located in the green and blue boxes respectively, where the green "+" and blue "+" indicate the OD and macular center.The red box is the location of ROP lesion.

Table 1. Definition of Zone I to III of ROP
A circular area with a radius of twice the distance from the center of the optic disc to the fovea of the macula.

II
An annular area with a radius of the distance from the optic disc to the nasal serrated margin except Zone I.

III
The remaining crescent shaped areas outside Zone I and II vitreous.
With the rapid development of artificial intelligence (AI) technology and medical technology, more and more image processing methods have been applied to retinal image processing [13][14][15].As the most extensive branch of AI, the deep learning (DL) approaches based on convolutional neural network (CNN) has shown a good application prospect in the field of medical image analysis.In addition, to reduce the workload of ophthalmologists and improve the efficiency of ROP diagnosis, researchers have successively proposed many computer-aided ROP diagnosis systems based on DL approaches.For example, Worrall et al. realized plus disease detection based on the ImageNet pre-trained GoogLeNet and the approximate Bayesian posterior probability, which is the first fully automatic ROP detection system, and also the first attempt to use the deep convolutional neural network (DCNN) to diagnose plus disease [16].Mao et al. developed an automatic diagnosis and quantitative analysis system for plus disease based on vessel segmentation, and the diagnostic sensitivity and specificity were 95.1% and 97.8% respectively [17].In addition, Sivakumar et al. used semi-supervised learning framework to achieve the classification of normal and plus disease [18].In addition to the identification of plus disease, there are many studies on ROP screening and grading.For example, an automatic ROP detection system called "DeepROP" was proposed, which first diagnosed the image as normal or ROP for ROP identification, and then classified the ROP image as minor or severe for ROP grading [19].The sensitivity of the system for ROP identification and ROP grading was 96.64% and 88.46% respectively, indicating that although the prospective work of ROP identification was good, the accuracy of ROP grading needed to be improved.Another DL based automated diagnostic tool called "DIAROP" was proposed for the early diagnosis of ROP ("normal" and "ROP") [20].Similarly, Wang et al. and Huang et al. used transfer learning and some mainstream convolutional neural networks (Inception-V3, ResNet50, VGG19, etc.) to achieve ROP screening and grading [21,22], while Wu et al. developed a DL system to predict the incidence and severity of ROP according to the fundus images and clinical features [23].Although the above studies have achieved ROP screening and grading, they have not achieved ROP staging and zoning.Recently, some researchers have begun to focus on the studies of ROP staging and zoning.For example, Chen et al. used full convolutional network (FCN) and multi-instance learning (MIL) to realize ROP staging ("normal" and stage 1 to 4 ROP) and then based on this, they introduced attention mechanism to guide the MIL pool to focus on the ROP characteristics at different stages and further improve the staging performance [24,25] [29], and then based on this, they proposed an improved U-Net architecture, combining squeeze-and-excitation (SE) module and attention gates (AG) module to realize the segmentation of boundary line/ridge line and blood vessel in ROP fundus image [30], where the location of macular is based on experience and the detection of Zone III is limited [31,32].In addition, Vijay et al. proposed an intelligent system based on DL to automatically detect OD and retinal vessels to realize the identification of ROP Zone I [33].Although the research can realize the identification of ROP Zone I, it has two limitations: (1) It cannot identify the ROP of Zone II and Zone III.(2) The scope of Zone I is obtained through experience and is not the standard ROP zoning definition.Different from the above studies, we develop a KAL system to focus on the automatic ROP zoning strictly based on the definition of Zone I to III in this paper.Our main contributions are summarized as follows: (1) A lightweight residual heatmap network (LRH-Net)) is proposed for the location of OD and macular center, which can transform the location problem into a pixel-level regression problem based on the heatmap regression method and maximum likelihood estimation theory.
(2) To meet the needs of clinical accuracy and real-time detection, one-stage object detection framework Yolov3 is used to achieve ROP lesion location.
(3) Comprehensive experiments are conducted to evaluate the performance of the proposed KAL system and the experimental results demonstrate that our KAL system can achieve excellent performance on the location of OD and macular center and ROP lesion.Meanwhile, the ROP zoning results based on the proposed KAL system have good consistency with the zoning results manually labeled by clinicians, indicating that our KAL system can provide an important reference for objective ROP zoning diagnosis clinically.

Related work
Key point location: Key point location is one of the important links in the field of image processing, aiming at locating the key points with strong semantics of the objects of interest in the image, such as the corners of the mouth, nose and eyes in the face image or the bones and joints in the human body image.Taking face key point location as an example, the main location models can be divided into traditional methods based on manual features and deep learning methods.Traditional methods mainly include active shape models (ASMs) [34], supervised gradient method (SDM) [35], cascaded pose regression (CPR) [36], etc, which mainly use manually designed features to fit the location of face key points.In recent years, with the rapid development of DL, the DL based key point location algorithms have been widely used and their performance has exceeded the traditional methods, which is mainly due to the strong feature extraction ability of deep neural network (DNN) [37][38][39][40].Generally, the key point location based on DL can be divided into coordinate regression method (CRM), binary segmentation method (BSM) and heatmap regression method (HRM).The coordinate regression method is to regard the key point location as a numerical coordinate regression problem, which directly predicts the coordinate value of the target point.The binary segmentation method refers to transforming the location problem into a binary segmentation problem, expanding the coordinate label of a single pixel into a local region, and taking the segmented mask center as the target point position.Different from the above two methods, the heatmap regression method usually includes three steps: (1) Generate a two-dimensional (2D) heatmap with the same size as the input image based on the key points position.(2) Training regression network to predict the heatmap.
(3) The predicted heatmap is post-processed to obtain the position coordinates of key points.In addition to the three methods mentioned above, in some practical scenarios with low requirements for location accuracy, the object detection models such as Fast-RCNN [41] and Yolo series algorithms [42] can also be used for key point location, which take the center of the detected rectangle as the target point.Specifically, the goal of coordinate regression model is to regress the offset of the target point relative to the image origin (0,0).However, due to the limitation of the receptive field of the model, the prediction error of long-distance offset is usually large.Secondly, in the process of model training, the coordinate regression method is directly based on numerical coordinates to calculate the loss function.Compared with the binary segmentation method and the heatmap regression method, the constraints provided in the supervised learning are less, resulting in low efficiency of parameter updating and slow convergence speed in the process of model training.In addition, in terms of interpretability, compared with binary segmentation method and heatmap regression method, the way of directly regressing the coordinate value leads to the lack of interpretability of the model.Similarly, compared with the coordinate regression method, although the binary segmentation method has certain advantages in model optimization and interpretability, the accuracy of location often depends on the accuracy of segmentation.If the target area is not segmented, it will lead to location failure.The heatmap regression model is to directly regress the probability value that the pixel at each position in the image belongs to the target point.Therefore, in the process of model optimization, the output probability value of all pixels in the image participates in the calculation of the loss function, thus improving the update efficiency of model parameters and the convergence speed of the model.In addition, similar to the binary segmentation method, the heatmap regression method also realizes the optimal learning from image to image, which facilitates the data augmentation operation of the input image and the label image simultaneously, such as rotation, flipping, etc., which can improve the robustness of the model [37].Therefore, we study the conversion from RetCam fundus image to probability map based on heatmap regression method to realize the location of optic disc and macular center.
Object detection algorithm: With the rapid development of computer vision, object detection technology has been greatly developed as a basic task in the field of computer vision, and has been widely used in face detection, pedestrian detection, and pathological area detection, etc. [43][44][45].The purpose of object detection is to find all interested objects in the image and determine their categories and positions, which can be divided into manual features based traditional methods and DL based methods.Traditional methods usually use some shallow features or manually designed features, and need to select appropriate features for each object detection task.Although the traditional object detection algorithms have achieved good performance, their adaptability, mobility and generality are poor for complex and diverse scenes, while DL based object detection algorithms have been widely used due to its strong feature extraction ability of DNN and their performance has exceeded the traditional methods.The CNN based object detection algorithms can be divided into two-stage object detection algorithm and one-stage object detection algorithm.Among them, the former first performs feature extraction, then performs region proposal (RP), and finally obtains the corresponding classification/location results.The typical two-stage object detection algorithm is Fast-RCNN [42], and its accuracy is generally high, but its speed is slow, so it is difficult to achieve real-time detection.Unlike the two-stage object detection algorithm, the one-stage object detection algorithm does not need RP, but directly extracts features from the network to predict the category and location of objects, thus greatly improving the detection efficiency.The typical one-stage object detection algorithm is Yolo series algorithms [42,46], which divide the image into several grids, then generate a priori box based on the anchor mechanism, and only need one step to generate a detection box, greatly improving the prediction speed.Therefore, to meet the clinical accuracy and real-time requirements, we exploit one-stage object detection framework Yolov3 to achieve ROP lesion location.

Overview
In this paper, we develop a new KAL system based on CNN as shown in Fig. 2, which consists of a key point location network named LRH-Net and an object detection network named ROP lesion location network (ROP-LLNet).It can be seen from Fig. 2 that LRH-Net can locate the OD and macular center in RetCam fundus image, while ROP-LLNet can locate the ROP-related lesion, and finally realize the ROP zoning recognition by observing the location of the ROP lesion according to the definition of ROP zoning in the ICROP as shown in Table 1.

Lightweight residual heatmap network
The previous research has shown that compared with coordinate regression method and binary segmentation method, heatmap regression method reduces the complexity of the key point location task and has advantages in model optimization, result visualization and location accuracy [37].However, to meet the real-time requirements, the traditional heatmap regression network only estimates the key point on a single coarse scale of 1/4 or 1/2 of the input image, which cannot meet the requirements of high accuracy [40].In response to the above problems, we design a lightweight residual heatmap network (LRH-Net) combining CNN and heatmap regression method to realize the location of the OD and macular center in RetCam fundus images as shown in Fig. 3.It can be seen from Fig. 3 that the proposed LRH-Net is mainly composed of feature encoder, feature decoder, skip connection and post-processing unit.The feature encoder is used to extract spatial features from the input fundus image, while the feature decoder is used to construct the heatmap from the encoded features.Skip connection is embedded between the encoder and decoder to supplement the loss of fine information caused by down-sampling, retain the spatial information of the original image, and make the model more accurate and robust, while the post-processing unit is applied to convert the output heatmap G ′ (x, y) into numerical coordinate (x ′ f , y ′ f ).Feature encoder: We design a lightweight backbone network as an encoder based on depthwise separable convolution (DSC) [47] and residual structure [14] to extract the hierarchical features of RetCam fundus images.Similar but different from original ResNet18, we use DSC to replace 3 × 3 convolution in the residual block of original ResNet18 as shown in Fig. 4(a), where the DSC is composed of depthwise convolution (DC) and pointwise convolution (PC).
Different from conventional convolution, depthwise convolution means that a convolution kernel is only responsible for one channel, and a channel is only convolved by a convolution kernel, where the number of channels of the feature map generated in this process is exactly the same as the number of channels of the input feature map, while the pointwise convolution is very similar to the conventional convolution, and its convolution kernel size is 1 × 1×M (M is the number of channels in the previous layer), which weights and sums the feature maps obtained from the convolution of each channel in the depth direction to generate a new feature map.Based on the above analysis, the essence of DSC is to realize the relationship between pixels and the connection between channels in two steps.The connection between pixels of the same channel is first considered by DC, and then each channel is connected by PC to achieve the purpose of the original convolution operation, while greatly reducing the number of parameters and improving the computing efficiency.In addition, considering that there were only three skip connections between the encoder and decoder in the original LinkNet [48], it will cause the output feature map of the last decoder module to fail to conduct the skip connection, so that the supplementary information of the corresponding encoder layer cannot be obtained.To alleviate this problem, the proposed LRH-Net uses bilinear interpolation to add an upsampling operation before the feature extractor of LinkNet, and then obtains a feature map with input image size and 64 channels by using a convolution of 7 × 7 × 64, which is used to achieve skip connection with the output feature map of the last decoder module, thus achieving the fusion of shallower detailed information and deeper semantic information, reducing the details lost in the shallow layer of the encoder, and further accelerating the model convergence speed.Feature decoder: To restore the high-level semantic feature maps generated by the feature encoder, we use four decoder modules in the decoder path, which is similar to the decoder of LinkNet [48] and consists of a series of transpose convolution (TC), batch normalization (BN) and ReLu activation functions.Figure 4(b) shows the decoder module.In addition, the feature map is restored to the original input image with 64 channels after the last decoder module, so we exploit 1 × 1 to obtain the output heatmap G ′ (x, y).
Skip connection: Compared with original UNet [13], our LRH-Net uses addition operation to replace concatenation operation for skip connection to improve network performance.It is worth noting that there are two main advantages to achieve skip connection directly by adding the encoder features and decoder features of the corresponding layer: (1) When the network is deep, deconvolution cannot realize the recovery of image details, and the skip connection can recover part of the information lost due to the down-sampling of the encoder.(2) When the lost information is relearned through the addition operation, the channel reduction scheme does not add additional parameters, which ensures the lightweight of the model, reduces the amount of calculation and processing time, and thus accelerates the model training.
Post-processing unit: After model training, LRH-Net outputs the corresponding heatmap G ′ (x, y), which needs to be converted to the corresponding numerical coordinate (x ′ f , y ′ f ) by post-processing unit.Common post-processing methods include the maximum value method (MVM), threshold center method (TCM), and Gaussian distribution based maximum likelihood estimation (MLE).The maximum value method is the most common method to convert the heatmap into numerical coordinates, which is realized by extracting the position of the peak value in the heatmap, and takes the index corresponding to the maximum pixel value in the heatmap as the corresponding key point coordinates, as shown in Eq. ( 1).
Although the maximum value method is easy to conduct, its final result is only determined by a single pixel.When there are burrs and spikes in the prediction results, errors will occur, so the robustness of this method is not good.In view of the single information reference problem of the maximum value method, the threshold center method first extracts the pixel position set R corresponding to the predicted heatmap that is greater than the threshold value, such as 0.5, and then calculates the average value of all element coordinates in the set R, as shown in Eq. ( 2) and (3).Compared with the maximum value method, threshold center method considers a wider range of pixel values, which reduces the prediction error to a certain extent.However, this method gives the same weight to all elements when calculating the mean value and does not consider the output probability of corresponding elements, so there are still some problems.
where R represents the set of pixel positions with pixel values greater than the threshold value of 0.5 in the predicted heatmap, N represents the number of elements in the set R, x i and y i are the coordinate value corresponding to each element in the set R.
It is worth noting that in view of the problems of the threshold center method, the maximum likelihood estimation method estimates the distribution parameters from a statistical perspective, which regards the pixel value at each location as the probability density of that location.According to the maximum likelihood estimation theory, the coordinates of the target point can be estimated by Eq. ( 4) and (5).Different from the threshold center method, the maximum likelihood estimation method constructs a Gaussian distribution, while the essence of the threshold center method is uniform distribution.Considering that under ideal conditions, the predicted heatmap follows Gaussian distribution, we use the maximum likelihood estimation method as the post-processing method to obtain the coordinates of the predicted OD or macular center (x ′ f , y ′ f ).
where p i is the probability value corresponding to each element in the set R.

ROP lesion location network
Figure 5 shows the overview of ROP lesion location network (ROP-LLNet), which is based on one-stage object detection framework Yolov3.It can be observed from Fig. 5 that ROP-LLNet consists of feature extractor, feature fuser and detector.Firstly, we use DarkNet53 as the feature extractor to extract the spatial features of the input image.Then, the features of the middle layer, the middle and lower layer and the top layer extracted by the feature extractor DarkNet53 are fused using the feature pyramid network (FPN) to obtain three effective features of different sizes.Finally, the prediction results are obtained by the detector.Specifically, the structure of DarkNet53 is similar to ResNet, which removes the last fully connected layer and consists of 52 convolution layers.The first 3 × 3 convolution is mainly used to increase the number of channels, obtain more effective feature maps without changing the size of the input image, and expand the receptive field of the feature map, while the second 3 × 3 convolution with steps of 2 is mainly used for down-sampling to reduce the amount of parameters and computation, and then stack multiple residual blocks to gradually realize feature extraction.For the feature fuser, multi-scale feature fusion is used to sample the deep feature map and concatenate it with the shallow feature map.The top-down + bottom-up feature fusion is realized, and multiple feature maps with different scales and different receptive fields are output, which can reduce the number of network parameters on the premise of ensuring accuracy.In addition, the detector is composed of three detection heads, each of which consists of Conv + BN + LeakReLu (CBL) module and classification layer of 1 × 1 convolution and outputs corresponding classification and positioning results.It is worth noting that we use Yolov3 with 1 × 1 convolution to replace the fully connected layer for classification, mainly based on the following two reasons: (1) The input scale of the fully connected layer is limited, while the convolution layer only needs to limit the number of input and output channels, which is flexible and convenient to use.(2) The full-connected layer outputs one-dimensional or two-dimensional features, so when the feature map is input to the full-connected layer, it needs to be flattened, which will destroy the spatial information of the feature map to a certain extent.Unlike the fully connected layer, the convolution layer outputs three-dimensional features, which can greatly preserve the spatial information of the corresponding feature map of original image and is more conducive to outputting the corresponding channel value in the output space when matching the positive sample.In particular, considering that if the model is trained from the beginning, the randomness of the initialization weight is large, which will lead to the poor feature extraction and slow convergence speed.To accelerate the network convergence and promote feature extraction, we use the pre-training weight on the VOC dataset to initialize the Yolov3.

Loss functions
As illustrated in Fig. 2, our KAL system consists of a key point location network named LRH-Net and an object detection network named ROP-LLNet.It can be seen from Fig. 3 that the proposed LRH-Net takes the fundus images as input and outputs the predicted heatmap, converting location problem into pixel-level probability regression problem.For this regression problem, we use pixel-level mean square error (MSE) as the loss function to optimize the model.The loss function is defined as follows: where N represents the number of samples.y i and y ′ i are the ground truth and the corresponding predicted results.
In addition, our ROP-LLNet takes the fundus images as input and outputs the classification and location prediction results of the ROP lesion.Therefore, the total loss function consists of three parts: confidence loss (L conf ), location loss (L loc ) and category loss (L cls ).The total loss function combined is defined as follows: where S 2 represents the number of grids, and B represents the number of candidate boxes generated by each grid.if there are lesions in the image, is the opposite.α loc and α noobj are super parameters, which are set to 5 and 0.5 respectively.σ represents the sigmoid function.

Experiments and results
In this section, we first introduce the experimental dataset in detail.Then, the experimental setup is described, including the experimental strategy and parameter settings in the training phase and evaluation metrics in the testing phase.Finally, the experimental results are presented in detail.

Dataset
To evaluate the performance of the proposed KAL system, we collect 1246 fundus images for the location of OD and macular center and 1692 fundus images for the location of ROP lesion from Children's Hospital of Soochow University, respectively.All images were collected by professionals using RetCam3 in 2019, with the resolution of 1600 × 1200 × 3. The collection and analysis of image data were approved by the Institutional Review Board of the Children's Hospital of Soochow University and adhered to the tenets of the Declaration of Helsinki.An informed consent was obtained from the guardians of each subject to perform all the imaging procedures.The data is labeled by an attending ophthalmologist with more than 10 years of ROP clinical experience under the supervision of two chief ophthalmologists.All fundus images are randomly divided into training set, validation set and test set according to subjects, which are shown in Table 2.For convenience, the dataset of the location of OD and macular center is called D 1 , while the dataset of the location of ROP lesion is called D 2 .
where A is a parameter used to normalize the pixel value of G(x, y) to [0,1], σ 2 is the variance and is used to control the spread of the corresponding heatmap.As shown in Fig. 6, the larger the variance, the higher the spread of the corresponding heatmap.Parameter settings: The proposed KAL system is performed on the public platform Pytorch.A NVIDIA RTX3090 GPU with 24GB memory is used to train the model with back-propagation algorithm by minimizing the loss function as illustrated in Eq. ( 6) and (7).Specifically, the LRH-Net uses the Adam with an initial learning rate of 0.0002 and weight decay of 0.0005 as optimizer.The batch size and epoch are set to 16 and 200, respectively.For ROP-LLNet, we use the Adam as optimizer, where the initial learning rate and weight decay are set to 0.0001 and 0.0005 respectively.The batch size and epoch are set to 16 and 100, respectively.For the sake of fairness, all experiments in this study maintain the same super-parameter and environment configuration.
Evaluation metrics: To comprehensively evaluate the performance of different methods for the location of the OD and macular center, the Mean Euclidean Distance Error (MEDE) [37] and the Normalized Euclidean Distance Error (NEDE) [35] are used.Both are defined as follows: where N is the number of samples, (x i , y i ) and (x ′ i , y ′ i ) represent the true coordinate and corresponding predicted coordinate in the i-th fundus image.EDE i is the Euclidean Distance Error of the i-th fundus image, while Diag(I) is the diagonal length of input image.It can been observed from Eq. ( 14) and ( 15) that the smaller the MEDE and NEDE, the closer the predicted coordinate are to the real coordinate.In addition, similar to Ref. [40] and taking into account the requirements of medical image on positioning accuracy, if the NEDE is greater than 0.02, it means that the target point positioning fails, and if the NEDE is less than 0.01, it means that the target point positioning error is small, that is, the positioning accuracy is high.
For the ROP-LLNet, we first use the average precision at IOU threshold of 0.5 (AP 50 ) and accuracy (ACC) to evaluate its performance.In addition, considering the actual clinical application requirements, we also calculate the frames per second (FPS) that the model can process.Among them, ACC is defined as follows: ACC = N sucess N Total (17) where N sucess and N total represent the number of images successfully located and the total number of images.

Experimental results for key point location
From the above analysis, we can see that for the location of the OD and macular center, MEDE is first used as the evaluation index, and then NEDE is exploited to measure the accuracy of the location.The quantitative results are shown in Table 3, Table 4, Table 5, and Table 6, and the numbers highlighted in bold indicate the best performance.In this section, we will give the detailed experimental results and the corresponding analysis.Comparison of different methods: Section 2 theoretically analyzes the advantages of heatmap regression method.To further verify and compare the effectiveness of different location methods, we have conducted a series of comparative experiments as shown in Table 3 and Table 4.As can be seen from Table 3 and Table 4, the coordinate regression method has the worst performance, followed by the binary segmentation method, and the heatmap regression method has achieved better performance.It is worth noting that the proposed LRH-Net has achieved the best performance in the location of OD and macular center.Specifically, it can be observed from Table 3 and Table 4 that the MEDE of our LRH-Net is 6.13 pixels and 17.03 pixels respectively, and the number of localization failures is 0 and 14 according to NEDE.Firstly, compared with the coordinate regression method, the performance of our LRH-Net has been greatly improved and the model complexity and computational cost are much lower.The MEDE of OD and macular center location has been reduced by 42.54 pixels and 34.76 pixels respectively, the number of failed fundus images has been reduced by 116 and 105, and the number of well positioned fundus images has increased from 35 and 32 to 233 and 173, respectively.In addition, for the location of OD center, we compare the segmentation method on a typical image segmentation network named UNet [13] and a specialized OD segmentation network for fundus image of premature infants named AFENet [49].The experimental results show that compared with the coordinate regression method, the segmentation method has achieved a smaller MEDE, but the positioning accuracy often depends on the accuracy of the segmentation.For fundus images of premature infants, image quality and contrast are usually poor due to incomplete retinal structure development, illumination and eye movement during the shooting process.For such fundus images, the segmentation method will lead to the phenomenon that the target area is not correctly segmented, which will lead to the failure of localization.However, our method with less model parameters and computational cost can achieve more accurate positioning (see the parameters and GFLOPs in Table 3), indicating that the proposed method is superior to the segmentation method.
It is worth noting that we also compare our LRH-Net with the lightweight U-shaped network LinkNet [48] based on the heatmap regression method.The experimental results show that the performance of LinkNet is superior to the coordinate method and the segmentation method, and the suboptimal positioning performance is achieved, while our LRH-Net further improves the positioning accuracy, and the model complexity and computational cost are far lower than LinkNet (see the parameters and GFLOPs in Table 3).These results show the effectiveness of the proposed LRH-Net in this study.
Comparison of heatmap with different variances: Variance σ 2 is used to control the spread of the Gaussian heatmap.The larger the σ 2 , the stronger the spread of the heatmap.Figure 6 shows the heatmaps corresponding to different variances σ 2 .To verify the influence of different variances on positioning accuracy, we have conducted a series of comparative experiments as shown in Table 5.As can be seen from Table 5, the minimum positioning errors of the OD and macular center are about 3200 and 6400 for σ 2 respectively, with MEDE of 6.13 and 17.03.Meanwhile, it can be observed from Table 5 that with the increase of variance σ 2 , the positioning error first decreases and then increases, especially for the location of the macular center, the possible reasons are: (1) When the variance is small, only part of the pixels around the target point contribute to the calculation of training and test errors, which may be similar to the category imbalance problem in the classification task, resulting in the failure of the model to achieve better convergence.(2) Similarly, when the variance is too large, due to the large pixel intensity around the target point, there is a certain redundancy area, and the constraint of the label is reduced, which may lead to the low sensitivity of the loss function change, that is, the training efficiency of the model is low.Therefore, it is very important for the heatmap regression method to select the appropriate variance to construct the Gaussian heatmap according to the size of the image and the range of the features around the target point.
Comparison of different post-processing methods: Section 3 theoretically analyzes the advantages and disadvantages of three different post-processing methods.To further verify the impact of different post-processing methods on positioning accuracy, a series of experiments are conducted as shown in Table 6.As can be seen from Table 6, the performance of the maximum value method is the worst, followed by the threshold center method, and the maximum likelihood estimation method achieves the best positioning accuracy.Compared to the maximum value method and the threshold center method, the MEDE of the maximum likelihood estimation method used in this paper for the location of OD center has decreased by 8.55 and 0.16 pixels respectively, and the MEDE for the location of macula has decreased by 3.03 and 2.23 pixels respectively.The above experimental results show that different post-processing methods have a very important impact on the positioning accuracy, and the maximum likelihood estimation method achieves the best performance, which can effectively reduce the prediction error of the model.

Experimental results for ROP lesion location
To comprehensively evaluate the performance of ROP-LLNet, we first evaluate its positioning performance with AP 50 and ACC, and then calculate its speed evaluation index FPS.Table 7 gives the quantitative results.As can be seen from Table 7, our ROP-LLNet has achieved 93.08% for AP 50 and 96.75% for ACC, respectively.In addition, to further verify the accuracy of ROP-LLNet in predicting the location of ROP lesion, the visualization results of the prediction and its corresponding ground truth for ROP lesion location on the test dataset are given, as shown in Fig. 7.As can be observed from Fig. 7, our ROP-LLNet can accurately predict the location of ROP lesions with high confidence.It is worth noting that the result of the speed evaluation index FPS also shows that the one-stage object detection framework can meet the clinical real-time requirements.

Results of clinical validation
Realizing objective and accurate ROP zoning recognition according to the definition of ROP zoning in ICROP can assist ophthalmologists to assess the severity of diseases and provide quantitative analysis for the evaluation and monitoring of treatment effect.To further verify the comprehensive performance of the ROP zoning of the KAL system proposed in this paper, we have conducted clinical validation on the ROP zoning data of the Children's Hospital of Soochow University, mainly including the following four steps: (1) The proposed KAL system is used to locate the OD and macular center and the ROP lesion, and the coordinates of the OD and macular center are obtained by the maximum likelihood estimation method.(2) The Euclidean distance between the OD and macular center is calculated.(3) Draw the corresponding area according to the ROP zoning definition.(4) Observe the location of lesion for ROP zoning recognition.Based on the above four steps, we can get the final zoning results.We compare the zoning results obtained by using the KAL system with the results manually labeled by ophthalmologists.Table 8 shows the confusion table between the results of our method and those manually labeled by ophthalmologists.It can be seen from Table 8 that the ROP zoning results obtained by KAL system are highly consistent with the results manually labeled by ophthalmologists, and the Kappa coefficient is 0.8440.In addition, if the results manually labeled by ophthalmologists are taken as the ground truth, the ROP zoning accuracy of KAL system is 93.83%.The above results further demonstrate the effectiveness and clinical applicability of the method proposed in this paper.

Conclusion and discussion
ROP zoning can reflect the severity of the disease, and is one of the important factors in formulating an appropriate treatment plan.However, due to the strong subjectivity and great difference of ophthalmologists in the diagnosis of ROP zoning, accurate ROP zoning is still a challenging task.In this study, to solve these problems, a new KAL system consisting of LRH-Net and ROP-LLNet is proposed to assist clinicians to achieve rapid, objective and accurate ROP zoning diagnosis.Firstly, aiming at the advantages and disadvantages of the existing key point location methods and to achieve the balance between real-time and high-accuracy, we propose a heatmap regression based lightweight network named LRH-Net to achieve accurate location of OD and macular center.Then, to meet the needs of clinical accuracy and real-time detection, one-stage object detection framework Yolov3 is used to achieve ROP lesion location.Finally, we conduct a series of experiments to validate the effectiveness of the proposed method.
Compared with the coordinate regression method and binary segmentation method, our LRH-Net can effectively improve the positioning accuracy of the OD and macular center with low model complexity.Meanwhile, the experimental results also show that ROP-LLNet can accurately predict the location of ROP lesions with high confidence and fast speed.In addition, the clinical validation results of ROP zoning also show that the KAL system proposed in this study has a good consistency with the zoning results based on the manually labeled by clinicians, showing the clinical application value.
In conclusion, the proposed KAL system holds promise for objective and accurate ROP zoning in fundus images and can reduce the workload of clinicians and improve work efficiency.We believe that our KAL system can also provide important reference for evaluating the severity of ROP-related disease, formulating treatment plans and evaluating and monitoring the treatment effect.And it is expected to be applied to other key point localization and lesion localization tasks, such as the diagnosis of age-related macular degeneration, which requires further exploration and verification.
Although the proposed KAL system has shown good performance on existing ROP zoning clinical validation dataset, there are still some limitations.The evaluation of clinical validation is based on limited data.In addition, due to limitations in data diversity, experimental evaluations are mainly conducted on data collected from the same center.However, in practical clinical environments, there are significant differences in data distribution and imaging quality among different centers, which may lead to performance degradation when the network trained on data from a specific center is directly applied to data obtained from another center.Therefore, in the near future, we will collect more fundus images of premature infants from multiple centers to further explore and verify the effectiveness of the proposed KAL system.

Fig. 1 .
Fig. 1.Examples of ROP fundus image.The OD and macular are located in the green and blue boxes respectively, where the green "+" and blue "+" indicate the OD and macular center.The red box is the location of ROP lesion.
. Similarly, Li et al. developed an automatic system based on DCNN to classify the image into normal and stage 1 to 3 ROP [26].Tong et al. used ResNet101 and Faster-RCNN for the detection of ROP staging and plus disease [27].In our previous work, aiming at the complex pathological features and the clinical diagnosis characteristics of ROP staging, a new three-stream feature fusion and ordinal classification network (TFF-OCNet) was proposed to realize ROP staging [28].It is worth noting that for ROP zoning, Zhao et al. used ResNet50 pre-trained on Microsoft COCO dataset to achieve the identification of ROP Zone I [12].Although the DL framework has achieved good results in the identification of ROP Zone I, it does not involve the identification of Zone II and Zone III.Recently, Agrawal et al. used U-Net with Hough circle transform to realize ROP zoning

Fig. 3 .
Fig. 3. Overview of the proposed LHR-Net."MLE" represents maximum likelihood estimation, while G ′ (x, y) and (x ′f , y ′ f ) are predicted heatmap and its corresponding numerical coordinate, respectively.The green dashed box represents the post-processing unit, which can convert the output heatmap G ′ (x, y) into numerical coordinate (x ′ f , y ′ f ).
x i and y i are the center coordinate of the real box, while x * i and y * i are the center coordinate of the prediction box.w i and h i are the width and height of the real box respectively, w * i and h * i are the width and height of the prediction box respectively.C i and C * i represent the confidence parameter values, p i (c) and p * i (c) indicate the classification parameter values.

Experimental strategy:
As shown in Table2, the data in this paper are randomly divided into training set, validation set and test set according to subjects.To reduce the computational cost, we adjust the fundus image size of the central point location and the lesion detection to 384 × 384 × 3 and 416 × 416 × 3 by bilinear interpolation.Many previous studies have shown that data augmentation is one of the effective strategies to increase the diversity of data distribution and can alleviate the over-fitting problem.Therefore, online data augmentation has been performed.For LHR-Net, random vertical and horizontal flip, random contrast, random brightness and random sharpness transformation are used in this study, while for ROP-LLNet, the random data augmentation operations used in this study include random rotation, random vertical and horizontal flip.In addition, for the location of the OD and macula center, the Gaussian heatmap corresponding to each fundus image needs to be established as the ground truth G(x, y) before the LRH-Net training.Given the coordinate (x f , y f ) of the OD or macular center, G(x, y) is defined as follows:

Fig. 6 .
Fig. 6.Comparison of heatmap with different variances.The green "+" and blue "+" are the OD and macular center respectively.(a) indicates the original fundus image, (b1) to (b3) and (c1) to (c3) represent the heatmaps of the OD and macular center with the variance of 400, 1600 and 6400 respectively.

Fig. 7 .
Fig. 7.The visualization results of the location of the ROP lesion, in which the red box and the green box represent the ground truth and the model prediction results, respectively.