Controllable editing via diffusion inversion on ultra-widefield fluorescein angiography for the comprehensive analysis of diabetic retinopathy

By incorporating multiple indicators that facilitate clinical decision making and effective management of diabetic retinopathy (DR), a comprehensive understanding of the progression of the disease can be achieved. However, the diversity of DR complications poses challenges to the automatic analysis of various information within images. This study aims to establish a deep learning system designed to examine various metrics linked to DR in ultra-widefield fluorescein angiography (UWFA) images. We have developed a unified model based on image generation that transforms input images into corresponding disease-free versions. By incorporating an image-level supervised training process, the model significantly reduces the need for extensive manual involvement in clinical applications. Furthermore, compared to other comparative methods, the quality of our generated images is significantly superior.


Introduction
The exponential increase in the number of diabetic patients in the past few decades has posed a crucial global health threat [1].According to the latest edition of the International Diabetes Federation (10th edition, 2022) indicates that 1 in 10 people are living with diabetes [2,3].Diabetic retinopathy (DR) is a common complication of diabetes and one of the leading causes of irreversible blindness.Therefore, early detection and intervention are crucial solutions to improving the quality of life.The effect of DR on the retina is clearly evident in retinography results.Fluorescein angiography (FA) is considered the gold standard for evaluating vascular changes that occur in various stages of DR [4][5][6].The Early Treatment Diabetic Retinopathy Study (ETDRS) recommends classifying DR into three categories: no apparent retinopathy (normal), non-proliferative diabetic retinopathy (NPDR), and proliferative diabetic retinopathy (PDR).The NPDR group is further divided into mild NPDR, moderate NPDR, and severe NPDR [7][8][9].The ETDRS has introduced a reliable stereoscopic method in 7 standard fields (7-SF) for the detection and classification of diabetic retinopathy.However, capturing images with 7-SF requires skilled photographers and pharmacological pupil dilation, which is not friendly to both institutions and patients.Ultra-widefield fluorescein angiography (UWFA) has a wider imaging range and has demonstrated strong clinical potential in identifying a subset of patients with increased risk of progression in DR severity [10][11][12][13].Figure 1 shows the appearance of three orders of severity of DR on UWFA with montages of 7-SF.Bright microaneurysms are widely distributed in the NPDR case, where 7-SF cannot completely capture this characteristic.The large ischemic areas (non-perfusion areas) is located near the right side in the PDR case, while the 7-stand field only captures the central perspective.and achieved robust performances in lesion recognition in UWFA [18][19][20][21].Although deep learning methods can accurately and efficiently automate the detection of medical images, they rely heavily on large datasets with pixel-level annotations.In addition, these models can only be applied to particular biomarkers.Recently, artificial intelligence generated content (AIGC) has attracted widespread attention due to its promising performance.Diffusion probabilistic models (DPMs) have shown impressive promise in image generation [22].DPMs are characterized by strong training stability, high diversity of hidden spaces, and high quality of generated images.
Text-guided image generation models have the ability to produce images that align with the content described in the text [23][24][25].Inversion-based image editing methods can generate new image content based on directed modifications of prompts [26][27][28].However, the characteristics of medical images are often not comprehensible by foundation models, necessitating the design of a dedicated deep learning model for the evaluation of UWFA images.
To mitigate the aforementioned limitations, we propose a text-controlled deep learning model to automatically evaluate multiple factors related to DR.The primary contributions of our algorithm are summarized as: (1) We propose a unified model capable of multi-faceted evaluation for DR in UWFA, including classification, biomarkers localization, and indexes estimation.(2) We introduce a two-stage stable diffusion tuning method that considers both image and text information simultaneously.(3) Using text-guided diffusion inversion techniques, we create controllable edits to UWFA, generating images free from disease for DR analysis.(4) Compared with state-of-the-art generative methods, the proposed method has achieved superior results.

Methods
In this section, we introduce the data acquisition standards and the proposed generation-based algorithm.Algorithmic content includes the tuning of stable diffusion using image-level category information, as well as the execution of controllable edits on input images based on the tuned Early analytical studies relied on the manual division of lesion areas in UWFA [14][15][16][17].However, these methods require a significant number of qualified ophthalmologists.Therefore, automated methods for accurately and efficiently detecting medical images are essential for prompt treatment.Deep learning techniques have shown strong capabilities in ophthalmology and achieved robust performances in lesion recognition in UWFA [18][19][20][21].Although deep learning methods can accurately and efficiently automate the detection of medical images, they rely heavily on large datasets with pixel-level annotations.In addition, these models can only be applied to particular biomarkers.Recently, artificial intelligence generated content (AIGC) has attracted widespread attention due to its promising performance.Diffusion probabilistic models (DPMs) have shown impressive promise in image generation [22].DPMs are characterized by strong training stability, high diversity of hidden spaces, and high quality of generated images.Text-guided image generation models have the ability to produce images that align with the content described in the text [23][24][25].Inversion-based image editing methods can generate new image content based on directed modifications of prompts [26][27][28].However, the characteristics of medical images are often not comprehensible by foundation models, necessitating the design of a dedicated deep learning model for the evaluation of UWFA images.
To mitigate the aforementioned limitations, we propose a text-controlled deep learning model to automatically evaluate multiple factors related to DR.The primary contributions of our algorithm are summarized as: (1) We propose a unified model capable of multi-faceted evaluation for DR in UWFA, including classification, biomarkers localization, and indexes estimation.(2) We introduce a two-stage stable diffusion tuning method that considers both image and text information simultaneously.(3) Using text-guided diffusion inversion techniques, we create controllable edits to UWFA, generating images free from disease for DR analysis.(4) Compared with state-of-the-art generative methods, the proposed method has achieved superior results.

Methods
In this section, we introduce the data acquisition standards and the proposed generation-based algorithm.Algorithmic content includes the tuning of stable diffusion using image-level category information, as well as the execution of controllable edits on input images based on the tuned model.

Acquisition and grading
A retrospective chart review was performed for all the patients who underwent Optos UWFA (Optos 200Tx, Dunfermline, Scotland, United Kingdom) following requests for examination due to diabetes.More than 5000 DR patients with UWFA examination between 2015 and 2020 from the eye center of Renmin Hospital of Wuhan University (Wuhan, China) were retrospectively reviewed.For each case of UWFA images, one image from the early phase (30 seconds to 1 minute) and one from the late phase (5 to 7 minutes) were selected.Institutional review board approval was obtained from the Ethics Committee of Renmin Hospital of Wuhan University for the analysis of anonymized data and the study was conducted in accordance with the principles of the Declaration of Helsinki.
The DR grading standard adopts the ETDRS standard, which is divided into Normal (no apparent retinopathy), NPDR, and PDR.Exclusion criteria were as follows: eyes with refractive media opacity that interfered with the quality of the peripheral retinal image, preretinal hemorrhage that obscures fluorescence, and treated eyes.The eyes were excluded if the quality of their UWFA images prevented a reliable evaluation.

Proposed tuning algorithms
We construct an algorithm for tuning the generative model, with the results generated being used for a comprehensive evaluation of the DR indicators.As shown in Fig. 2, our proposed method, which is built on the backbone of stable diffusion [23], comprises two key phases: a pair of tuned condition encoders and a tuned latent diffusion model.Contrary to the original stable diffusion that only uses CLIP [29] text encoding as a guiding condition, our approach utilizes both the image and the text as directive information.Diverging from previous image editing methods, we utilize a fusion of image and text features as guiding conditions, thereby making fuller use of both the image content and semantic information.In addition, we tune the image encoder during the training stage, eliminating the need to retrain specific image parameters during the inference stage.This results in a fast and efficient image editing method.
Specifically, we encode the information from both sides of the eye ("left" and "right"), two periods of imaging ("early" and "late"), and five categories: three basic gradings ("normal", "npdr", and "pdr"), one category to determine DR ("dr"), and another to indicate PDR ("np").More detailed texts are listed in Fig. 2(a).Given the relevance between the content of UWFA images and the nine aforementioned texts, we consider all the texts as inputs.Using infoNCE [30] as the label-level contrastive loss, we match the text consistent with the input image labels: where z represents a image feature yielded by the image encoder, t j and t + i represent test features and matched test features yielded by the text encoder, and τ is a temperature parameter that controls the sharpness of the distribution.Furthermore, to ensure that the features extracted by the image encoder retain crucial information related to image content but independent of   The "NALL" indicates there is no category filled in the text model.

Acquisition and Grading
A retrospective chart review was performed for all the patients who underwent Optos UWFA (Optos 200Tx, Dunfermline, Scotland, United Kingdom) following requests for examination due to diabetes.More than 5000 DR patients with UWFA examination between 2015 and 2020 from the eye center of Renmin Hospital of Wuhan University (Wuhan, China) were retrospectively reviewed.For each case of UWFA images, one image from the early phase (30 seconds to 1 minute) and one from the late phase (5 to 7 minutes) were selected.Institutional review board approval was obtained from the Ethics Committee of Renmin Hospital of Wuhan University for the analysis of anonymized data and the study was conducted in accordance with the principles of the Declaration of Helsinki.
The DR grading standard adopts the ETDRS standard, which is divided into Normal (no apparent retinopathy), NPDR, and PDR.Exclusion criteria were as follows: eyes with refractive media opacity that interfered with the quality of the peripheral retinal image, preretinal hemorrhage that obscures fluorescence, and treated eyes.The eyes were excluded if the quality of their UWFA images prevented a reliable evaluation.

Proposed Tuning Algorithms
We construct an algorithm for tuning the generative model, with the results generated being used for a comprehensive evaluation of the DR indicators.As shown in Fig. 2, our proposed method, classification, we have designed an instance-level contrastive loss: where z + represents the image features that originate from the same patient as z, meaning one set of features belongs to an early stage and another set corresponds to a later stage.The z j represents the image feature yielded by the image encoder.We employ prompt tuning [31] to train CLIP image and text encoders.In each self-attention layer, we freeze the original model parameters and introduce a set of trainable parameters as prompts for efficient parameter training.
For the latent diffusion, we utilize a fusion of text features and image features as guiding conditions.As shown in Fig. 2(b), the image and text are processed through the pre-trained CLIP image encoder and text encoder to obtain the corresponding features.Subsequently, two trainable mapping networks independently align the image and text features to the same dimension.In this work, we utilize a 2-layer multi-layer perceptron (MLP) as the mapping network.We employ cross-attention mechanism to fuse the mapped features, and the fused features maintain the same shape as the condition features in stable that only use a text encoder.Specifically, we generate query q using the mapped text features and establish key k and value v using the mapped image features.The cross-attention fusion process can be represented as: where z fusion represents the fused features, Softmax(•) represents the Softmax function, and d represents the dimension of features.During the training phase, we randomly select texts that match the images from nine templates of the text encoder and one text without category information (i.e., "a ffa of a fundus") to serve as the text conditions.In this study, we employ low-rank adaptation (LoRA) [32] to tune the cross-attention layers of the diffusion model, enabling it to generate corresponding UWFA images under the proposed condition encoding scheme.

Controllable image editing
During the inference phase, we first transform the image that needs editing with the text without category information into a noise image according to forward diffusion.Then, using a text containing the target category as a condition, we use diffusion inversion to controllable generate the edited image.As shown in Fig. 2(b), the difference between the edited image and the input image can illustrate the distinctions between the controlled category and the category of the input image.In this study, we set the controlled category to be "normal", thus generating images without disease.Through the difference compared to the input image, we can qualitatively locate biomarkers and quantitatively evaluate the DR associated indexes.Since the guiding features obtained by the proposed method encapsulate rich image content information and text semantic information, the edited images adhere to both the original image structure and the new category characteristics, ensuring that category-independent areas remain as unchanged as possible.Previous methods based on text-guidance often cause significant structural changes to images through diffusion inversion, resulting in numerous areas of interference in the difference maps.By introducing image features into the guiding conditions, we can not only edit the image category more accurately but also compute more precise indexes through difference maps.

Implementation details
We employ SDv1-5 as the pre-trained model for stable diffusion [23] with a resolution of 512 × 512, where the text encoder is CLIP-large-patch14 [29].The images and texts are encoded into 1024-dimensional and 768-dimensional features respectively, and both are projected to 768 dimensions for the computation of contrastive learning loss.We tune the image and text encoders of CLIP using prompts with lengths of 64 and 16 respectively.For the fusion module, we employ two MLP as the mapping layers, which consists of two linear layers and one ReLU activation layer, aligning the feature dimension to 768.We train the CLIP with 500 epochs, a learning rate of 0.0005 and a batch size of 8. We train the stable diffusion with 20000 steps, learning rate of 0.0001 and a batch size of 1.The dataset is divided into five folds for cross-validation experiments according to the rule of individual independence.
For all experiments, the information about the device was as follows: Intel Xeon CPU E5-1650 v4 @ 3.60GHz and NVIDIA GeForce RTX 3090 24GB.Model development was performed with Ubuntu 20.04, Python 3.11, torch 2.0.0, and diffusers 0.22.0 from https://huggingface.co.

Results
In this section, we present the results of our DR analysis system based on the proposed method.We generate disease-free edited images and distinguish their categories.We qualitatively identify abnormal areas and quantitatively assess auxiliary diagnoses, in addition to evaluating the correlations between ischemia and leakage relative to the severity of DR.

Dataset
In this retrospective examination, 280 DR cases (171 NPDR, 109 PDR) from 194 naive patients (110NPDR, 84PDR) were included.Additionally, 119 normal cases, without any fundus disease, from 103 individuals were included in this study for evaluation purposes.A total of 399 cases from 297 individuals were included for model training and testing.No differences were found on the side of the eye of all eyes, including 205 cases of OD (51.4%) and 194 cases of OS (48.6%).More detailed demographic data are provided in Table 1.

Classification using CLIP
In this study, we encode the category information as texts for CLIP, thereby enabling classification through the comparison of similarity between image features and text features.For the imaging information, the classification accuracies for eye side and imaging period are 97.49% and 100.00% respectively.The brightness of the fluorescence and exposure intensity might make it challenging to distinguish between early and late periods in some cases.However, the model can accurately identify the left and right eyes, indicating that it has high discriminatory power for basic imaging information.We evaluate the classification accuracy for one three-class grading and two binary gradings.The three-class grading differentiates between normal, NPDR, and PDR, achieving a classification accuracy of 93.98%.We further assess the binary classifications for the presence of DR and whether it is proliferative, garnering respective accuracies of 99.25% and 95.24%.The onset of DR is typically accompanied by the emergence of biomarkers, making it relatively easy to identify.However, the distinctions in the image manifestations between NPDR and PDR are more subtle, and when the disease progresses to severity, there are more pathological areas, leading to higher classification difficulties.Nevertheless, the model still achieves high recognition rates over 90%.

Disease-free image editing
The deep learning model predicts disease-free images and diagnoses.Edited images resemble normal angiography and preserve the vascular structure of the authentic image.This means that lesion biomarkers are disguised as normal tissue.We report the results of our method and two state-of-the-art image editing methods: a CycleGAN-based image translation method [33,34] and a diffusion-based method named DDIM inversion [35].
For qualitative analysis, Fig. 3 illustrates the edited images generated by models trained on the UWFA dataset.The method based on CycleGAN, due to the use of cycle consistency loss during training, ensures that the structure of the generated images aligns with real images.However, GAN-based methods face the issue of inadequate output diversity, leading to local repetition in the generated images, and potential problems relating to continuity.The method based on diffusion inversion can generate high-quality images that align with controllable text conditions.However, because of the absence of image-level constraints, despite using the noise after real image diffusion as the initial state, minor changes occur in the structure of the generated images (such as the number of vessels and location of the optic disc) compared to real images.Additionally, the consistency of brightness in the generated images is not as good as other methods.These shortcomings interfere with further quantitative analysis.More specifically, the differences between various methods become more evident in cases with more severe lesions.In Fig. 3(c), the insufficient generative capabilities of CycleGAN misidentify the leakage area as the optic disc, while the lack of image content information of DDIM Inversion results in a significant change in the brightness of the generated images.In Fig. 3(d), an insufficient global guide prevents CycleGAN from successfully converting the bleeding area, while the lack of image content information of DDIM Inversion causes changes to the vasculature layout.In contrast, our method manages to repair the lesion area as accurately as possible in both cases, all while maintaining stability in other information.For quantitative analysis, we utilize three metrics to measure the quality of the generated image from different perspectives.The metrics are structural similarity (SSIM), peak signal-to-noise ratio (PSNR), and normalized mean squared error (MSE).SSIM measures the structural similarity, PSNR measures the pixel similarity, and MSE measures the noisy similarity.Table 2 presents a comparison of results from three models.To achieve the desired generative effects, the default output resolution for the CycleGAN-based methods is set to 256 × 256, while for those based on the diffusion model it is set to 512 × 512.For a fair comparison, we provide scores at both resolution levels.Although CycleGAN employs cycle consistency loss to ensure local structural alignment, the lowest SSIM is due to texture repetition and structural recognition errors, which are a result of its inferior generative capabilities.DDIM Inversion can generate text-guided images with layouts similar to real images.However, the content of the generated results does not strictly align with the real images, leading to a lower PSNR and MSE.Since our method takes into account both image and text features, it allows the generated images to maintain a good balance between reality and editability.This makes our results more suitable for the requirements of comprehensive DR analysis compared to other methods.

Biomarkers localization
Figure 4 shows three representative cases of normal, NPDR, and PDR with early and late periods.The vascular leakage appears as bright spots of varying sizes in UWFA.The late angiogram shows the morphology of more agents accumulating, so the vascular leakage is more obvious in contrast to other tissues than the early angiogram.On the other hand, ischemic areas (non-perfusion areas) appear as patches of low brightness compared to normal tissues.The ischemic areas, which appear as regions lacking blood flow, are typically bordered by adjacent vessels.In early angiography, the diffusion of the contrast agent has little effect on the clarity of blood vessels.Therefore, the boundary of ischemic areas is more accurately defined than in late angiography.The difference between the real image and the generated image offers valuable guidance for identifying lesions.An automated algorithm that utilizes threshold and local brightness is used to extract brighter areas indicating vascular leakage and darker areas indicating ischemic areas.Given a set of difference results of early and late angiograms, we suggest qualitatively assessing biomarkers localization through vascular leakage in late images and ischemic areas in early images.

Related indexes
The quantitative metrics of ischemic index and leakage index are defined as the percentage of the total retina that is composed of ischemic areas and areas of vascular leakage, respectively [36].For each image, the square of the distance between the fovea and optic disc serves as a biological standard for mapping image size to real retinal size.Based on the biomarker localization masks, the ratios of abnormal areas are directly proportional to the two indexes mentioned above.In this paper, we utilize the ratio equivalent to represent both the ischemic index and the leakage index.Figure 5(a) and Fig. 5(b) shows the box-and-whisker plots of the ischemic index and leakage index, respectively.The analysis of the independent sample t-test indicates that both the ischemic index (P < 0.0001) and the leakage index (P < 0.0001) are significantly associated with the severity of DR. the ratios of abnormal areas are directly proportional to the two indexes mentioned above.In this paper, we utilize the ratio equivalent to represent both the ischemic index and the leakage index.respectively.The analysis of the independent sample t-test indicates that both the ischemic index (P < 0.0001) and the leakage index (P < 0.0001) are significantly associated with the severity of DR.

Multivariate Analysis
We use a concentric ring template to measure the topographic areas of UWFA images [37].An example is shown in Fig. 5(c), and the regional multifactor analysis, using t-SNE downscaling, is presented in Fig. 5(d).We performed a linear multi-factor analysis on the reduced-dimensional features.The accuracy of the linear hyperplane in identifying individuals with and without DR after dimensionality reduction is 98.45%, which is comparable to the accuracy of neural network for image recognition.Normal samples are typically distributed within a small range.On average, Fig. 5.The statistical analysis results of ischemic index and leakage index.As the severity of DR increases, both (a) ischemic index and (b) leakage index also tend to increase, as indicated by higher median, upper, and lower quantiles for both indexes.(c) The first ring, known as the macular ring, is centered on the fovea and extends from the macula to the optic disc in radius.Each successive ring is at an equal distance from the center.The area was measured for each of the biomarker sites within the five rings.(d) Analysis of dimensionality reduction of regionally graded indexes.There are discernible differences in the spatial distribution between individuals without DR and those with DR (both NPDR and PDR).

Multivariate analysis
We use a concentric ring template to measure the topographic areas of UWFA images [37].An example is shown in Fig. 5(c), and the regional multifactor analysis, using t-SNE downscaling, is presented in Fig. 5(d).We performed a linear multi-factor analysis on the reduced-dimensional features.The accuracy of the linear hyperplane in identifying individuals with and without DR after dimensionality reduction is 98.45%, which is comparable to the accuracy of neural network for image recognition.Normal samples are typically distributed within a small range.On average, the distribution of PDR samples is farther from the normal samples than the NPDR samples.

Discussions
This paper presents a method for controllable editing of UWFA images for comprehensive DR analysis, based on diffusion inversion.In this section, we further discuss the clinical advantages of UWFA images in DR analysis, visualize more edited results, highlight the benefits of the unified model, and summarize the conclusions of a comprehensive DR analysis.Additionally, we discuss several limitations of this study and outline future research directions.

Comparison of visual range
DR is a complication of diabetes, characterized by lesions in numerous retinal regions.FA is a crucial tool for assessing DR [12].While conventional FA provides a 30-degree field of view which only covers 5% of the retina, the ETDRS introduced the 7-SF system, expanding the view to nearly 75-degree and covering around 34% of the retinal surface [13,38].Increasingly, studies have emphasized the need to assess the peripheral retina for DR severity and progression risk [39][40][41].Providing a 200-degree viewing angle that covers up to 82% of the retina, UWFA has emerged as a promising technology that enables complete evaluation of both the posterior pole and the peripheral retina in one image.Figure 6 illustrates a comparison of the visual range of ETDRS and UWFA by overlapping.the distribution of PDR samples is farther from the normal samples than the NPDR samples.

Discussions
This paper presents a method for controllable editing of UWFA images for comprehensive DR analysis, based on diffusion inversion.In this section, we further discuss the clinical advantages of UWFA images in DR analysis, visualize more edited results, highlight the benefits of the unified model, and summarize the conclusions of a comprehensive DR analysis.Additionally, we discuss several limitations of this study and outline future research directions.

Comparison of Visual Range
DR is a complication of diabetes, characterized by lesions in numerous retinal regions.FA is a crucial tool for assessing DR [12].While conventional FA provides a 30-degree field of view which only covers 5% of the retina, the ETDRS introduced the 7-SF system, expanding the view to nearly 75-degree and covering around 34% of the retinal surface [13,38].Increasingly, studies have emphasized the need to assess the peripheral retina for DR severity and progression risk [39][40][41].Providing a 200-degree viewing angle that covers up to 82% of the retina, UWFA has emerged as a promising technology that enables complete evaluation of both the posterior Research has shown significant vascular changes in the peripheral retina using UWFA [42,43].The diagnosis and management of DR are largely dependent on the detection and assessment of the lesion, with a reported increase in the severity grading of DR due to additional peripheral lesions in more than 10% cases [13,44].Ischemic areas are more prevalent in peripheral regions (three to ten times larger than in the posterior pole retina) and correlate closely with DR severity [45,46].Vascular leakage also relates to diabetic macular edema (DME) severity, with varying associations found in different retinal regions [47,48].Recent studies combining ischemic and leakage indices suggest a positive correlation between DR severity and peripheral retinal lesions [49,50].Additional findings highlight that lesions such as hemorrhages, microaneurysms, and neovascularization beyond 7-SF are frequently associated with an increased risk of progression of DR [51].

Controllable image editing
Based on DPM, AIGC demonstrates a high quality and diverse range of results.For the analysis of medical images, DPM models can be useful in various tasks, including classification [52,53], anomaly detection [54,55] and segmentation [56,57].Controllable image editing allows for the acquisition of new images by modifying text prompts through the method of diffusion inversion.However, due to the inability of language models to comprehend many medical terminologies and the overly specialized definitions of biomarkers across medical images, there is a lack of existing general-purpose text-guided DPMs for generating medical images.In this work, we utilize both image and category information as guiding conditions for stable diffusion, allowing the image editing results to retain as much category-independent content as possible.
Specifically, using the same text as in the forward diffusion process, i.e., text without category information, for diffusion inversion will result in generated images identical to the input images.Using expected categories as controlling conditions during the reverse process will yield generated images edited to the expected categories.By interpolating between the features of the input and target texts, one can obtain intermediate images that lie between the characteristics of the two images.Figure 7 presents four examples of interpolation, in which we uniformly performed five interpolations on the text features from the texts without category information to the texts of the expected categories.The interpolation results were then used in the diffusion inversion process to obtain the intermediate images.The smoothly transitioned intermediate images demonstrate that the proposed method is capable of editing only category-related regions while maintaining consistency in other areas, and it shows a strong recognition capability for category information.

Unified automatic approach
The traditional process of image analysis relied on manual statistics to investigate pathology.Experienced experts can make accurate judgments, but the rapidly growing volume of data dictates that this process is not efficient.Therefore, speed and automation have become critical requirements for medical image analysis methods [36,58].Recently, deep learning models have shown impressive performance in the evaluation of medical images [18,19,21,59].DR may cause the appearance of multiple lesions, while existing algorithms can only analyze single level factors individually.Figure 8 reports two automated procedures.A comprehensive evaluation of DR in the past required the use of multiple models.In our method, we obtained multiple results for diagnoses, localization of ischemic areas, and localization of vascular leakage.Based on the localization maps, the ischemic index and leakage index were obtained to assess the severity of DR.The cost of training affects the effectiveness of the model.For fully supervised models, such as the segmentation models shown in Fig. 8, accurate lesion contours were used as the ground truth during the training stage.Skilled researchers annotated areas of lesions at the pixel-level.The irregularity of DR lesions and the high resolution of UWFA presented challenges for comprehensive annotations.In this study, we only used text at the image level for supervision.The image-level annotations were shared between the training of CLIP and latent diffusion stages.

Unified Automatic Approach
The traditional process of image analysis relied on manual statistics to investigate pathology.
Experienced experts can make accurate judgments, but the rapidly growing volume of data dictates that this process is not efficient.Therefore, speed and automation have become critical requirements for medical image analysis methods [36,58].Recently, deep learning models have shown impressive performance in the evaluation of medical images [18,19,21,59]

Comprehensive analysis of DR
The ischemic index refers to the percentage of ischemic area in the visible retina, while the leakage index is the percentage of the area that leaks within the ROI (peripheral or panretinal).Numerous studies have confirmed a correlation between the indices and the severity of DR [36,48,60].Various methods have been proposed to calculate these indices, from manual pixel counting to automatic segmentation using software platforms and deep learning techniques [14,15,37,58].These methods have been largely successful but lack the ability to compute other metrics.We evaluated these two indices using the predictions of our model and found consistency with previous independent studies.As shown in Fig. 5(a) and Fig. 5(b), the average values of both the ischemic index and leakage index rose as the severity increased.Further multivariate analysis is presented in Fig. 5(d), where two indices of multi-region are considered as high-dimensional features.The results of dimensionality reduction intuitively showed stronger discrimination than single indexes, which proves that the distribution area of the lesion contributes to its severity.The normal samples were almost completely separated from the DR samples, and the distribution areas of the NPDR and PDR samples partially overlapped.In fact, NPDR is a developmental stage between normal and PDR, and certain lesions may exhibit more severe symptoms even in the absence of neovascularization.

Limitations and future works
Our model was only implemented on the UWFA dataset.The lesions have clear manifestations in UWFA images and can be visually distinguished.However, adverse effects of FA may include fluorescent allergy and long-term mydriasis.Color fundus photographs (CFP) do not require fluorescein, involve less exposure, and result in fewer cases of mydriasis, making them more convenient for patients.Benefiting from the non-invasive imaging acquisition and high axial resolution, optical coherence tomography angiography (OCTA) can provide wide-field images at various disease stages and more sensitive early disease indicators [5,[61][62][63].In future work, the proposed approach could be considered for evaluating various modalities of datasets.
The CLIP and stable diffusion we utilized were pretrained on natural images and tests and then tuned on our UWFA dataset.While we have already achieved exciting generation results, the lack of extensive pretraining on a large number of medical images limits the upper potential of our method.The CLIP model did not achieve as high accuracy in the 3-grade classification task and the task of whether PDR appeared as it did on other classification tasks.NPDR and PDR presented numerous confusing pieces of information at more severe stages of the disease, leading to classification errors.Additionally, the quality of features encoded by the ability of CLIP to recognize images and text affects the quality of control conditions for image generation.In the future, training our method on larger scale medical datasets may yield even better performance.

Conclusion
In this study, we designed a pipeline to automatically obtain comprehensive factors related to DR using UWFA.We have developed a diffusion-based model that jointly incorporates image features and text features for controllable editing input images in a disease-free style.By comparing the real images with the edited ones, the lesions were automatically identified and located.The ischemic area in the early phase is used to analyze the ischemic index.Vascular leakage during the late phase is measured to determine the leakage index.The experimental results showed that our approach can achieve satisfactory classification and localization outcomes, with the quality of the generated images surpassing that of state-of-the-art comparative methods.

Fig. 1 .
Fig.1.Comparison of the UWFA and 7-SF.The area within the yellow outline is the range of the 7-SF field in UWFA.The early phase images were obtained between 30 seconds and 1 minute, while the late phase images were obtained between 5 and 7 minutes.

Fig. 1 .
Fig. 1.Comparison of the UWFA and 7-SF.The area within the yellow outline is the range of the 7-SF field in UWFA.The early phase images were obtained between 30 seconds and 1 minute, while the late phase images were obtained between 5 and 7 minutes.
a left fundus a ffa of a right fundus an early ffa of a fundus a late ffa of a fundus a ffa of a normal fundus a ffa of a dr fundus a ffa of a npdr fundus a ffa of a pdr fundus a ffa of a np fundus

Fig. 2 .
Fig. 2. Pipeline of the proposed deep learning based DR analysis system.The blue areas indicate that the parameters are frozen during training, while the orange parts represent the parameter updates during training.(a) We use the prompt tuning strategy to train the image and text encoders of CLIP.(b) We fuse image features and text features to train stable diffusion, and controllable edit images during the inference stage.The "NALL" indicates there is no category filled in the text

Fig. 2 .
Fig. 2. Pipeline of the proposed deep learning based DR analysis system.The blue areas indicate that the parameters are frozen during training, while the orange parts represent the parameter updates during training.(a) We use the prompt tuning strategy to train the image and text encoders of CLIP.(b) We fuse image features and text features to train stable diffusion, and controllable edit images during the inference stage.The "NALL" indicates there is no category filled in the text.

Fig. 3 .
Fig. 3. Qualitative comparisons.We present results from different methods.Four examples with early and late images are shown in the first row.(a) and (b) show the results of two NPDR examples.(c) and (d) show the results of two PDR examples.

Fig. 4 .
Fig. 4. Localization of biomarkers.Three examples with early and late images are shown in the first row, (a) Normal, (b) NPDR and (c) PDR.The generated edited images are demonstrated in the second row.The localization of vascular ischemic areas (blue) and the localization of leakage areas (red) are demonstrated in the third row.

Fig. 5 .
Fig.5.The statistical analysis results of ischemic index and leakage index.As the severity of DR increases, both (a) ischemic index and (b) leakage index also tend to increase, as indicated by higher median, upper, and lower quantiles for both indexes.(c) The first ring, known as the macular ring, is centered on the fovea and extends from the macula to the optic disc in radius.Each successive ring is at an equal distance from the center.The area was measured for each of the biomarker sites within the five rings.(d) Analysis of dimensionality reduction of regionally graded indexes.There are discernible differences in the spatial distribution between individuals without DR and those with DR (both NPDR and PDR).

Fig. 5 (
Fig. 5(a) and Fig. 5(b) shows the box-and-whisker plots of the ischemic index and leakage index,

Fig. 6 .
Fig.6.A pair of examples of visual range.The blue area in the early period is the predicted ischemic areas, and the red area in the late period is the predicted vascular leakage.Seven circles with radius of the distance between the optic disc and the fovea (noted as ) simulate the extent of 7-SF in the UWFA, represented by white masks.Two concentric rings with radii of 2 and 4 are represented by green.

Fig. 6 .
Fig.6.A pair of examples of visual range.The blue area in the early period is the predicted ischemic areas, and the red area in the late period is the predicted vascular leakage.Seven circles with radius of the distance between the optic disc and the fovea (noted as D) simulate the extent of 7-SF in the UWFA, represented by white masks.Two concentric rings with radii of 2D and 4D are represented by green.

Fig. 7 .
Fig. 7. Visualization of intermediate image.We perform linear interpolation between the original text labels of the input images and the texts that edit images, and present generated images at different interpolation ratios.

Fig. 8 .
Fig. 8.The process of training and testing in two automatic schedules.Separated training of multiple single models can evaluate multiple metrics.Our method only needs to train one texts supervised model to comprehensively evaluate multiple metrics.
. DR may cause the appearance of multiple lesions, while existing algorithms can only analyze single level factors individually.Fig. 8 reports two automated procedures.A comprehensive evaluation of DR in the past required the use of multiple models.In our method, we obtained multiple results for diagnoses, localization of ischemic areas, and localization of vascular leakage.Based on the localization maps, the ischemic index and leakage index were obtained to assess the severity of DR.The cost of training affects the effectiveness of the model.For fully supervised models, such as the segmentation models shown in Fig. 8, accurate lesion contours were used as the ground truth during the training stage.Skilled researchers annotated areas of lesions at the pixel-level.The irregularity of DR lesions and the high resolution of UWFA presented challenges for comprehensive annotations.In this study, we only used text at the image level for supervision.

Fig. 8 .
Fig. 8.The process of training and testing in two automatic schedules.Separated training of multiple single models can evaluate multiple metrics.Our method only needs to train one texts supervised model to comprehensively evaluate multiple metrics.