ESDiff: a joint model for low-quality retinal image enhancement and vessel segmentation using a diffusion model

In clinical screening, accurate diagnosis of various diseases relies on the extraction of blood vessels from fundus images. However, clinical fundus images often suffer from uneven illumination, blur, and artifacts caused by equipment or environmental factors. In this paper, we propose a unified framework called ESDiff to address these challenges by integrating retinal image enhancement and vessel segmentation. Specifically, we introduce a novel diffusion model-based framework for image enhancement, incorporating mask refinement as an auxiliary task via a vessel mask-aware diffusion model. Furthermore, we utilize low-quality retinal fundus images and their corresponding illumination maps as inputs to the modified UNet to obtain degradation factors that effectively preserve pathological features and pertinent information. This approach enhances the intermediate results within the iterative process of the diffusion model. Extensive experiments on publicly available fundus retinal datasets (i.e. DRIVE, STARE, CHASE_DB1 and EyeQ) demonstrate the effectiveness of ESDiff compared to state-of-the-art methods.


Introduction
Blood vessels are important structure in retinal fundus images, and their morphological changes can be used not only to identify, classify, and diagnose various diseases but also to gain a better understanding of disease progression and evaluate the effectiveness of treatment.Researchers have also demonstrated that changes in retinal vessels can serve as an early screening method for certain brain and cardiovascular conditions [1][2][3].Therefore, retinal vessel segmentation plays a fundamental role in the analysis of retinal images.
High-quality retinal images are crucial in the diagnosis of various eye diseases, including diabetes, retinal tears and detachment, glaucoma, age-related macular holes, and degeneration [4].Accurate identification of vascular structures based on high-quality images enables doctors to make precise disease diagnoses.However, retinal image degradation caused by diseases, equipment limitations, or environmental factors is a common issue, with lesions, blur and uneven illumination being the primary types of degradation, as shown in Fig. 1.These three images with varying degrees of uneven illumination depict images affected by inconsistent or uneven lighting conditions.Some regions of the images might appear excessively dim.The blurred image exhibits a lack of clarity, likely due to factors such as motion blur during image capture, out-of-focus conditions, or issues with the imaging equipment.The artificial image contains artificially added pathological features.Consequently, enhancing low-quality retinal images to improve their quality is an essential requirement for many downstream tasks, directly or indirectly impacting the diagnosis of various retinal abnormalities.
Recently, researchers have proposed numerous deep learning-based algorithms for image enhancement and vessel segmentation, aiming to alleviate the burden on doctors [5,6].Hu et al. proposed a novel SuperVessel algorithm that takes low-resolution images as input and produces high-resolution and accurate vessel segmentation [7].They employed high-resolution as their auxiliary branch, providing potential high-resolution detailed features to enhance vessel segmentation.Alimanov et al. introduced a cycle-consistent generative adversarial network (CycleGAN) with a convolutional block attention module and a modified UNet model (CBAM-UNet) for retinal vessel segmentation to address the issues of low-quality retinal image enhancement and vessel segmentation [8].The adversarial training in CycleGAN progressively guides the model towards realistic rendering, while the attention module enhances the feature extraction capability.However, when applied to retinal image generation, CycleGAN has some disadvantages: 1) It requires expensive computational resources and simultaneous training of a generator-discriminator pair.2) There is a risk of losing pathological lesions and introducing non-existent vessels, which can disrupt vascular and pathological structures.
In recent years, various diffusion models have gained significant attention in the field of computer vision [9,10], such as Denoising Diffusion Probabilistic Model (DDPM) [11].Diffusion models can be categorized into two main groups: unconditional diffusion models and conditional diffusion models.In particular, unconditional diffusion models take noise sampled from Gaussian distribution as input to generate images and videos.Conditional models typically incorporate a combination of sampling noise and embedded features from one modality to generate output in other modalities.Specifically, DDPM has achieved remarkable outcomes in several sub-tasks of image generation, including text-guided image synthesis [12].Furthermore, it has been demonstrated that due to its exceptional performance when compared to other generative models like GANs and other state-of-the-art approaches, DDPM has gained significant popularity in recent research.However, no existing work has yet explored the use of diffusion models to integrate low-quality retinal image enhancement and vessel segmentation tasks.
In this manuscript, we propose a novel fundus image enhancement and vessel segmentation framework based on diffusion model to address the challenge of removing redundant information in low-quality images.The novelties and contributions of the proposed method are as follows: • To the best of our knowledge, ESDiff is the first joint model for low-quality retinal image enhancement and vessel segmentation based on diffusion model.
• We propose the vessel mask-aware diffusion model (VMDM) to pursue high-quality images, with vessel segmentation as an auxiliary task.There is a mask prediction head after the last layer of denoiser to predict the refined mask.
• We propose a degradation module (DM) to extract the degradation factors through a modified UNet and embed them into the diffusion model.This allows us to eliminate the degradation factors and iteratively optimize the intermediate results at each time step.
• Comprehensive experiments on publicly available datasets demonstrate the superiority of ESDiff over existing state-of-the-art methods using image enhancement and vessel segmentation-related metrics.

Retinal fundus images enhancement
The methods for enhancing retinal images can be broadly categorized into traditional handcrafted non-parametric methods and data-driven methods based on deep learning [13].
Non-parametric methods based on manual priors utilize traditional image processing techniques and prior knowledge to enhance the quality of fundus images.For example, fundus image enhancement algorithms for frequency filtering [14], histogram equalization [15] and spatial filtering [16].
In order to achieve data-driven image enhancement, researchers have proposed two main approaches: degradation modeling-based enhancement and unpaired image translation-based enhancement.The degradation modeling-based method focuses on accurately modeling the degradation process to restore degraded images to their original quality.In contrast, the unpaired image translation-based method maps images to visually more appealing domains without explicitly modeling the degradation process.The commonly used methods for image quality enhancement typically employ fully supervised learning methods, which require supervision using high-quality reference images corresponding to the input during the training phase [17].On the other hand, enhancement methods based on unpaired image translation rely on adversarial training.For example, Zhao et al. introduced a CycleGAN-based network to achieve unpaired retinal image deblurring [18,19].

Retinal vessel segmentation
In recent years, with the rise of deep learning, retinal vessel segmentation has made great progress [20].Deep networks have the capability to automatically learn features with a large number of learnable parameters, thereby enabling more accurate segmentation of retinal vessels.Guo et al. further optimized the neural network using a deep supervised scheme and compared the vascular probability maps obtained by each convolution layer with the ground truth [21].Fu et al. used the multi-scale and multi-level network structure to treat the retinal blood vessel segmentation task as the boundary detection task to improve the segmentation performance [22].And they adopted a conditional random field approach to add lateral output layers to the network and model the remote interactions between pixels to learn a rich hierarchical representation.
With the rise of diffusion model, it has attracted many researchers to use diffusion model to study fundus blood vessel segmentation.Kim et al. proposed a diffusion adversarial representation learning model (DARL) which they used DDPM to learn the distribution of the vessel background, so that potential vessel structures can be easily identified to extract the foreground vessel region [23].In addition, Kim et al. proposed a contrastive diffusion adversarial representation learning (C-DARL) model on this basis.The model consists of a diffusion module and a generation module to learn the distribution of multi-domain vascular data and learn a more realistic vessel representation based on the contrastive loss of the vessel mask [24].Alimanov et al. developed a retinal tree (ReTree) dataset, which includes retinal images, corresponding vascular trees, and a segmentation network based on a DDPM trained using images from the ReTree dataset [25].They created a two-stage DDPM that generates vascular trees from random numbers belonging to a standard normal distribution.Subsequently, this model is guided to generate retinal images from the given vascular tree and random distribution.However, a few researchers have studied how to segment blood vessels from uneven illumination, blur or artifacts of fundus images, which brings certain limitations to clinical application.

Low-quality fundus image enhancement and vessel segmentation using a diffusion model
The overall network architecture is shown in Fig. 2. ESDiff utilizes a diffusion model to achieve both low-quality image enhancement and vessel segmentation.

Preliminaries
The Denoising Diffusion Probabilistic Model (DDPM) belongs to a category of generative models capable of producing samples from Gaussian noise [11].It accomplishes this by acquiring knowledge about the data's distribution via a Markov process.DDPM consists of two processes: a forward process and a reverse process.The forward process is a Markov process in which the next state is obtained by sampling from a Gaussian distribution.The reverse process represents the joint distribution of the data.The forward diffusion path is a Markov chain to convert the Gaussian noise to x 0 with a total of T steps: The state at the time step t can also be calculated from the initial state x 0 .This can be thought of as the next state being obtained by adding a small Gaussian noise with a variance scheduling β t to the current state.Therefore, the noise target x t distribution from the initial state x 0 is expressed as: where The reverse process is a generation step that can also be approximated by a Gaussian distribution.It iteratively performs denoising from a standard Gaussian distribution over t time steps to generate the corresponding image.Then, DDPM is trained to approximate the reverse diffusion process: Recent studies have shown that the variance plan of the reverse step can be learned using neural networks in the same way as the forward process [11].The parameterization formula for predicting mean µ θ (x t , t) is as follows: To carry out the learning of the denoising process, we first generate sample x t ∼ q(x t |x 0 ) by adding Gaussian noise ϵ to x 0 and then train a model ϵ θ (x t , t) to predict the added noise.When the means of the distributions are equated, the training objective can be further simplified to:

Vessel mask-aware diffusion model
In this study, obtaining high-quality fundus images through image enhancement is crucial for accurate blood vessel segmentation.These enhanced images serve as crucial references for precisely identifying the location of blood vessels.To address this, we propose a novel joint task approach that integrates image enhancement and blood vessel segmentation, aiming to simultaneously obtain enhanced images and accurate segmentation results.Distinguishing itself from previous image enhancement methods that rely on conditional diffusion models [26], our proposed approach utilizes a vessel mask-aware diffusion model.The model can gradually generate high-quality fundus images, and it improves the quality of images and the accuracy of blood vessel segmentation by refining the mask.We begin by reviewing the previous conditional diffusion model, which is an approach to obtaining high-quality images through repeated refinement.This model utilizes a stochastic iterative denoising process that adapts denoising diffusion probabilistic model, allowing it to learn a conditional reverse process p θ (x 0:T |y) without modifying the diffusion process q(x 1:T |x 0 ) for x.During the training phase, we sample triplets (x 0 , y, ṽ) from the joint distribution q(x, y, ṽ).ṽ can be obtained by using matched filtering to effectively extract the edge of the blood vessel and thus extract the entire blood vessel structure [27].
Algorithm 1 summarises the training phase of the proposed method, where we learn the reverse process for vessel mask-aware diffusion model: where xt−1 represents the intermediate image obtained during the diffusion process at each time step t.We can sample intermediate x t terms directly from high-quality image x 0 through ϵ by marginalizing the Gaussian diffusion process.We employ a UNet architecture similar to the denoiser ϵ θ of [10].The denoiser ϵ θ takes the low-quality image y, the intermediate variable x t , and the time step t as input to predict the noise map n t and the refined vessel mask v t as follows: Due to the strong reliance of vessel mask information on the generation of high-quality retinal images and the close correlation between the enhancement of low-quality retinal images and the corresponding vessel mask, we propose a model that jointly performs high-quality image prediction and vessel mask refinement.At each time step t, we incorporate a mask prediction head into the architecture of ϵ θ .This additional component, consisting of a 1 × 1 convolutional layer followed by a Sigmoid function, is responsible for predicting the refined mask based on the output of the last layer of ϵ θ .In the following articles [11], the conditional denoising method learns the prediction of the noise.The diffusive objective function is: Furthermore, during the training stage, by utilizing pairs of high-quality and low-quality blurred images, we can use the ground truth (GT) vessel mask as a reference to impose constraints on the refinement process of the mask, ensuring its coherence and accuracy.Due to the limited availability of paired datasets for training, we described the acquisition of paired datasets in Section 4.1.
In particular, the ground truth vessel mask v GT can be obtained by binarizing the residual mapping between the low-quality image and the corresponding high-quality image v GT = {︄ 1 x − y>0.1 0 otherwise .
By combining the above losses, we obtain the hybrid objective function L total , which guides the training of the denoiser ϵ θ in our VMDM as follows: (11) where λ represents the weighted coefficient used to balance the influence of each term.the corresponding vessel mask, we propose a model that jointly performs high-quality image prediction and vessel mask refinement.At each time step t, we incorporate a mask prediction head into the architecture of ϵ θ .This additional component, consisting of a 1 × 1 convolutional layer followed by a Sigmoid function, is responsible for predicting the refined mask based on the output of the last layer of ϵ θ .In the following articles [11], the conditional denoising method learns the prediction of the noise.The diffusive objective function is: Furthermore, during the training stage, by utilizing pairs of high-quality and low-quality blurred images, we can use the ground truth (GT) vessel mask as a reference to impose constraints on the refinement process of the mask, ensuring its coherence and accuracy.Due to the limited availability of paired datasets for training, we described the acquisition of paired datasets in Section 4.1.
In particular, the ground truth vessel mask v GT can be obtained by binarizing the residual mapping between the low-quality image and the corresponding high-quality image v GT = {︄ 1 x − y>0.1 0 otherwise .
By combining the above losses, we obtain the hybrid objective function L total , which guides the training of the denoiser ϵ θ in our VMDM as follows: where λ represents the weighted coefficient used to balance the influence of each term.

Degradation model
We utilize the method proposed by Ye et al. for estimating the illumination map [28].The low-quality fundus image and its corresponding illumination map are concatenated as inputs to generate a high-quality fundus image.To retain essential features from the original fundus image, we incorporate a skip connection that connects the input and output of the modified UNet architecture.Furthermore, in addition to preserving precise details, the symmetric extension path in the UNet integrates the illumination attention mechanism.
Degraded color fundus images can be considered as a combination of clean images and degradation factors.Therefore, the description of degraded fundus images can be simplified as follows: y = x + d y (12) where x represents the corresponding clean image, y represents the degraded low-quality image, and d y represents the degradation factors associated with y.When obtaining the degraded

Degradation model
We utilize the method proposed by Ye et al. for estimating the illumination map [28].The low-quality fundus image and its corresponding illumination map are concatenated as inputs to generate a high-quality fundus image.To retain essential features from the original fundus image, we incorporate a skip connection that connects the input and output of the modified UNet architecture.Furthermore, in addition to preserving precise details, the symmetric extension path in the UNet integrates the illumination attention mechanism.Degraded color fundus images can be considered as a combination of clean images and degradation factors.Therefore, the description of degraded fundus images can be simplified as follows: where x represents the corresponding clean image, y represents the degraded low-quality image, and d y represents the degradation factors associated with y.When obtaining the degraded low-quality image y, the restoration process of the high-quality image x ′ becomes: The key issue is to identify the degradation factors hidden in y.Rather than directly inputting the low-quality image into the network to eliminate degradation factors and obtain an enhanced image, our network is designed with a dedicated branch specifically aimed at explicitly extracting the degradation factors.Given a low-quality image y, the opposite degradation factors −d y can be extracted as −d y = U(y), where U represents the modified UNet architecture shown in Fig. 3.The formulation of the overall restoration network G can be expressed as follows: This can be accomplished by introducing a skip connection between the input y and the output of U(y).We incorporate this process into the iteration of the diffusion model to refine the result at each time step t.Therefore, Eq. ( 14) can be modified as: where the obtained x t−1 serves as the input for the next time step t in the diffusion process.The enhancement process maintains the structural integrity and spatial coherence of these important elements, ensuring they are retained with minimal displacement or distortion.

Datasets
Due to the lack of paired low-quality retinal images, high-quality retinal images, and their corresponding high-resolution vessel masks for training our model, we utilized a pre-trained data-driven degradation model proposed by Shen et al. [29] to degrade the retinal images from the DRIVE, STARE, and CHASE_DB1 datasets during the training phase.The DRIVE dataset consists of 40 retinal images with a resolution of 565 × 584 pixels [30].The CHASE_DB1 dataset comprises 28 retinal images with a resolution of 999 × 960 pixels [31].The STARE dataset contains a total of 20 images with a resolution of 700 × 605 pixels [32].In order to tackle the problem of limited training data, we employed horizontal and vertical flipping as well as rotation for data augmentation on all three datasets, resulting in training set of 160, 96, and 160 images, respectively.There are a total of 64 images in the testing set, including 40 images on the DRIVE dataset, 8 images on the STARE dataset and 16 images on the CHASE_DB1 dataset.
The EyeQ dataset consists of 12,534 training images and 16,249 testing images, which are divided into three categories based on their quality: 16,817 images of good quality, 6,435 images of usable quality, and 5,540 images of reject quality [33].All images have a fixed resolution of 800 × 800.We randomly selected 2441 "reject" images for the testing set.

Evaluation metrics
We evaluated the image enhancement performance of the model using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) [34].Additionally, it should be noted that paired images are required for the quantitative metrics PSNR and SSIM.PSNR quantifies the extent to which an image is affected by noise, while SSIM measures or predicts the quality of an image relative to the original uncompressed or undistorted image.For the PSNR and SSIM metrics, higher values represent better results.We evaluated the performance of the model for vessel segmentation task using evaluation metrics such as Accuracy (Acc), Sensitivity (Se), Specificity (Sp), Area under the ROC curve (AUC) and F1-score.

Implement details
The proposed method was implemented using PyTorch and trained on an NVIDIA RTX A30 GPU.For training and testing, we resized all input images to a size of 512 × 512.During the training phase, data augmentation techniques were employed on the dataset, which involved random horizontal flipping, vertical flipping, random cropping, and random rotation.The model was trained and tested using the Adam optimizer with an initial learning rate of 1 × 10 −5 .We used 1000 timesteps (T).The training was performed for 150 epochs.We evaluated overfitting by conducting 5-fold cross-validation.In the experiment, we set hyperparameter λ = 0.5.The detailed hyperparameter settings can be found in the Section 4.7.In comparison, our method recognizes more vessel regions than U-Net++ and CBAM-UNet.Additionally, when compared to CS-Net and SCS-Net, ESDiff enhances the segmentation integrity of retinal vessels.However, DE-DCGCN-EE surpasses our method in the clarity of thin vessels, preserving intricate details and accurately capturing thin, multi-scale, and irregularly curved vessels.We analyze the reasons behind these challenges and infer that we used low-quality images as input for our framework, whereas other methods processed images directly acquired in high-quality.Although the first step of our method involves image enhancement, there are still disparities in the enhanced results compared to images captured under ideal conditions, leading to potential limitations, especially in capturing extremely thin blood vessels.Overall, our ESDiff method provides clear segmentation not only for thick vessels but also for thin vessels.Furthermore, our method exhibits fewer false positive vessels, resulting in less noisy and clearer segmentations.We conducted further comparisons to demonstrate the superiority of our method over SCR-Net and DASQE in retinal image enhancement.The methods of DASQE and SCR-Net have little effect on the uneven illumination regions.While the SCR-Net method enhances the brightness of retinal images, there are still some regions with low illumination in the results.These results of DASQE method and SCR-Net method may lead to clinical misdiagnosis.In contrast, our proposed ESDiff can effectively correct uneven illumination, preserving the details and natural appearance of retinal images.

Quantitative analysis
We conducted quantitative comparisons between our proposed method and other deep learningbased approaches.The results of the image enhancement experiment are displayed in Table 1, with the superior outcomes highlighted in bold.The quality of fundus images degrades due to various diseases and lighting conditions, presenting challenges for accurate segmentation and diagnosis.As shown in Table 1, CycleGAN exhibits lower metric values in the image enhancement task compared to the other seven models.On the EyeQ dataset, our ESDiff method outperforms all the compared methods in terms of PSNR and SSIM, with values of 19.27 and 0.864, respectively.Compared to OTT-GAN, our method's PSNR has increased by 0.50, and our SSIM has improved by 0.086 when compared to DASQE.Table 2 presents the quantitative results comparing the segmentation performance of the original UNet, CBAM-UNet, CS-Net, U-Net++, SCS-Net, DE-DCGCN-EE and our proposed model on the DRIVE, STARE, and CHASE_DB1 datasets.The table demonstrates that our proposed ESDiff achieves the highest values in Se and Acc metrics compared to six other methods on the DRIVE dataset.Here, Se represents the proportion of pixels correctly identified as blood vessels within the actual vascular region.Although ESDiff's F1-score is 0.10% lower than that of DE-DCGCN-EE, it still falls within an acceptable range of improvement.On the STARE dataset, our method attains the highest Acc and F1-score among these methods, with values of 97.77% and 84.65%, respectively.On the CHASE_DB1 dataset, the Se of SCS-Net is 0.26% lower than our method, which measures the accuracy of blood vessel segmentation.ESDiff achieves the optimal Sp and Acc values, reaching 98.66% and 97.97%, respectively.The improvement in other metrics such as F1-score, further demonstrates the better performance of our proposed ESDiff in retinal blood vessel segmentation.

Ablation study
We conducted ablation studies on the EyeQ dataset to evaluate the effectiveness of the design module within the ESDiff framework for image enhancement.Additionally, we evaluated the effectiveness of vessel segmentation through ablation studies on the DRIVE, STARE, and CHASE_DB1 datasets.
Since blood vessel segmentation and image enhancement are closely related, higher-quality image enhancement leads to more accurate segmentation.The degradation model preserves most of the important features from the original fundus images, and the degradation factors of low-quality images are determined.These degradation factors are incorporated into the iterative process to guide the refinement of the intermediate output at each time step t.After integrating the degradation model, the results of image enhancement highlight the vascular structure more, and more thin vessels can be segmented.The addition of the degradation model makes the degraded parts clearer and divides more thin blood vessels.The qualitative comparisons of ablation studies of the degradation model are shown in Fig. 5, where magnified details of specific blood vessels are provided.As indicated in Table 3, on the EyeQ dataset, the data enhancement metrics, specifically PSNR and SSIM, increased by 1.62 and 0.103, respectively.These improvements prove the effectiveness of the degradation model in enhancing image quality.Table 4 presents the results of ablation experiments on the DRIVE, STARE, and CHASE_DB1 datasets, respectively.Integrating the degradation model into the network significantly enhanced vessel segmentation evaluation metrics on the DRIVE dataset: Acc increased to 0.970, Se to 0.858, Sp to 0.985, AUC to 0.979, and F1-score to 0.828.On the STARE and CHASE_DB1 datasets, the addition of the degradation model also yielded favorable results in various evaluation metrics.In conclusion, the inclusion of the degradation model effectively enhances the quality of retinal fundus images and improves the accuracy of the vessel segmentation methodology.5, when the model trained on the STARE dataset is tested on the DRIVE dataset, our proposed ESDiff achieves optimal metrics for Se, AUC, and F1, with values of 79.53%, 97.71%, and 80.34%, respectively.Specifically, compared to the other six methods, ESDiff obtains the highest metric for Se and improves the Se metric by 0.71% compared to DE-DCGCN-EE.This is because of the limited presence of thin blood vessels in the ground truth of the STARE dataset, the model is not effectively segmenting thin blood vessels when tested on the DRIVE dataset, and ESDiff can effectively alleviate this issue.When the model trained on the DRIVE dataset is tested on the STARE dataset, ESDiff obtains the optimal Sp and Se values, with values of 98.56% and 80.41%, especially Sp values are improved by 0.16% compared with CBAM-UNet.However, the Acc of our method is 1.54% lower than that of DE-DCGCN-EE.The AUC of our model on the STARE dataset is 98.10% and the F1-score is 80.11%, demonstrating the satisfactory overall segmentation performance of our method.

Hyperparameter setting
In order to investigate the setting of hyperparameter in the loss function (11), we compared different values of λ to implement our model.Table 6 and Table 7 present the quantitative evaluation results on the validation dataset based on the hyperparameter settings.
During the training phase, we experimented with three different values for λ: λ = 0.8, λ = 0.5, and λ = 0.3.In the study of retinal image enhancement, the model performs best with λ = 0.5.In the research on retinal vessel segmentation, our model achieves satisfactory results compared to other state-of-the-art methods when λ = 0.5.Considering these results, we report the performance of our model trained with the hyperparameter set to λ = 0.5.

Discussion
In this manuscript, we propose a joint model based on the diffusion model (ESDiff) for retinal image enhancement and vessel segmentation tasks.Images with uneven illumination, blurriness and low contrast, irregular curved vessels are apparent challenges in retinal vessel segmentation.These difficulties pose significant challenges in accurately identifying and segmenting retinal vessel pixels.Moreover, training vessel segmentation models on numerous low-quality retinal image datasets often leads to suboptimal outcomes.The vessel segmentation performance of methods trained on these low-quality retinal fundus datasets often fails to achieve ideal results.In order to perform simultaneous tasks of fundus image enhancement and blood vessel segmentation, we propose VMDM to pursue high-quality images, with vessel segmentation as an auxiliary task.There is a mask prediction head after the last layer of denoiser to predict the refined mask.In Table 1, according to the evaluation metrics and visualization results, it can be seen that the ESDiff model has better image enhancement ability than other state-of-the-art methods.Additionally, for low-quality fundus images characterized by uneven illumination and blur, we propose DM to extract degradation factors.This is achieved through a modified UNet, and the extracted factors are then embedded into the diffusion model.Some state-of-the-art methods [29,31,[35][36][37][38][39] have focused on preserving the structural information of retinal fundus for the task of image enhancement by improving the understanding of retinal fundus images.As shown in Table 3, ablation experiments verify that the addition of DM improves the effectiveness in fundus image enhancement.
To evaluate the segmentation performance, we conducted extensive experiments on three datasets to demonstrate the effectiveness of our proposed method.The performance results are shown in Table 2, and which show that the performance of our proposed method is better, and the metrics are mostly higher than other methods.Although there are joint models that have accomplished the tasks of fundus enhancement and vessel segmentation [8], our evaluation metrics outperform them in both of these tasks.Furthermore, as shown in Fig. 4, it is evident that ESDiff excels in preserving intricate details and capturing thin and irregularly curved vessels when compared to existing vessel segmentation methods [41][42][43].This also implies that our proposed method is capable of diagnosing retinal vessel issues, offering valuable support to healthcare professionals in disease diagnosis and potentially reducing the workload of clinical experts in the field of medicine.
Although ESDiff performs well on most low-quality fundus images, there are still parts for improvement, especially in the case of extremely thin blood vessels.As illustrated in Table 2, DE-DCGCN-EE [44] exhibits higher metrics than our method, and it achieves more complete segmentation of thin blood vessel details across the three publicly available datasets.We analyze the reasons behind these challenges and infer that we used low-quality images as input for our framework, whereas other methods processed images directly acquired in high-quality.Although the first step of our method involves image enhancement, there are still disparities in the enhanced results compared to images captured under ideal conditions, leading to potential limitations, especially in capturing extremely thin blood vessels.In summary, our proposed joint model based on diffusion model, which can output high-quality images and corresponding vessel segmentation results based on low-quality input images.

Conclusion
In this manuscript, we propose a framework based on the diffusion model to jointly pursue high-quality image prediction and vessel mask refinement, called ESDiff.To enhance low-quality images and segment vessels, we introduce the Vessel Mask Diffusion Model (VMDM) and consider mask refinement as an auxiliary task of the diffusion generator.Additionally, we propose a degradation model (DM) that utilizes a modified UNet to obtain degradation factors and preserve feature details, which are then embedded into the diffusion model.At each time step t, the intermediate results of ESDiff are iteratively refined to eliminate degenerative factors and generate more accurate retinal images.Finally, through comparison and ablation studies of image enhancement and vessel segmentation, we demonstrate that our proposed method outperforms other approaches in addressing the challenges of uneven illumination and blurring.This method demonstrates its effectiveness in the segmentation of blood vessels in low-quality retinal images and holds significant importance for disease diagnosis.

Fig. 1 .
Fig. 1.The degraded fundus images.From top to bottom, from left to right: original high-quality image, uneven illuminated, blurred and artificial fundus images.

Fig. 2 .
Fig. 2. Overall framework of ESDiff.x represents a high-quality image, y represents a low-quality image, and ṽ represents vessel mask corresponding to the original low-quality image obtained by using matched filtering.

4. 4 . 1 .Figure 4
Figure 4 displays the original retinal blood vessel image and the results of our method along with six other competing methods: UNet, CBAM-UNet, CS-Net, U-Net++, SCS-Net, and DE-DCGCN-EE.Partially enlarged details of each image are added.It is observed that UNet segmentation exhibits unresolved microvessels and crucial intersecting vessels from this figure.In comparison, our method recognizes more vessel regions than U-Net++ and CBAM-UNet.Additionally, when compared to CS-Net and SCS-Net, ESDiff enhances the segmentation integrity of retinal vessels.However, DE-DCGCN-EE surpasses our method in the clarity of thin vessels, preserving intricate details and accurately capturing thin, multi-scale, and irregularly curved vessels.We analyze the reasons behind these challenges and infer that we used low-quality images as input for our framework, whereas other methods processed images directly acquired in high-quality.Although the first step of our method involves image enhancement, there are still disparities in the enhanced results compared to images captured under ideal conditions, leading to potential limitations, especially in capturing extremely thin blood vessels.Overall, our ESDiff method provides clear segmentation not only for thick vessels but also for thin vessels.Furthermore, our method exhibits fewer false positive vessels, resulting in less noisy and clearer segmentations.We conducted further comparisons to demonstrate the superiority of our method over SCR-Net and DASQE in retinal image enhancement.The methods of DASQE and SCR-Net have little effect on the uneven illumination regions.While the SCR-Net method enhances the

Fig. 4 .
Fig. 4. The visual comparisons of our proposed ESDiff with six other state-of-the-art methods for vessel segmentation on DRIVE, STARE, and CHASE_DB1 datasets.

Fig. 5 .
Fig. 5. Comparison of ablation studies for the degradation model.