Harnessing Artificial Intelligence for Enhanced Renal Analysis: Automated Detection of Hydronephrosis and Precise Kidney Segmentation

Take Home Message Recent advances in artificial intelligence allow automated detection of hydronephrosis in ultrasound images with high accuracy. These methods pave the way for automated standardized diagnosis and treatment of patients with renal colic.


Introduction
Hydronephrosis is a major criterion in diagnosing acute renal colic [1].It indicates the possible presence of obstruction (eg, urolithiasis) or reflux in the upper urinary tract, and shows up on ultrasound as a hypoechogenic fluid in both the renal pelvis and the calyces.With 750 000 cases per year in Germany, urolithiasis significantly impacts the population's quality of life and socioeconomic factors [2].According to Smith-Bindman et al [1], utilizing ultrasonography as the initial diagnostic tool in suspected nephrolithiasis cases has been shown to reduce the necessity for computed tomography scans, resulting in lower radiation exposure and no increase in serious adverse events, pain, or hospital visits.A substantial variation in the treatment of renal colic has also been documented.Imaging and blood tests are performed in about half of patients, urinalysis is not performed in one-fifth of patients, and antibiotics are incorrectly prescribed to one-fourth of patients [3].Based on these data, a standardized and automated initial triage of patients, incorporating various clinical and hematological information, including the potential identification of hydronephrosis through ultrasound evaluations of the kidneys, has the potential to mitigate radiation exposure, improve diagnostic accuracy, and enhance subsequent treatment of renal colic.Recent advances in deep learning neural networks (NNs), imaging techniques, and computational capabilities have facilitated image-based pattern recognition within different datasets.These models have already been used for predictive pattern discovery [4,5] and in the image-based detection of various urological pathologies [6,7].
Using an ultrasound database of urological organs, an NN can successfully be trained to distinguish the kidneys from other urological organs and automatically detect the presence of hydronephrosis.
In cases of renal colic, our clinical experience suggests that the presence of kidney dilation, in conjunction with other clinical and laboratory features, plays a pivotal role in diagnostic and therapeutic decision-making processes, placing more emphasis on its existence rather than its extent.
As part of our endeavor to develop software that aids in triage and offers automated therapeutic recommendations for patients with renal colic in the emergency room before contact with doctors, we have created an NN designed to detect kidney dilation automatically.

Patients and methods
We created a local image database of anonymized ultrasound images from three different ultrasound devices from our department (Fig. 1).

Image preprocessing and NNs
Image preprocessing and image augmentation are important steps in machine learning, as it helps standardize images and reduce noise [8].This step also improves the ability of models to learn relevant information from the database [9].To simulate the conditions of the ultrasound examination in which the probe can take different rotation positions, we applied random image augmentation to our dataset and rotate the images.Moreover, the contrast of the ultrasound images is affected by the imbalanced distribution of the ultrasound waves by the position of the kidney on different depths or by the interposition of different structures between the probe and the organ (ie, adipose tissue, liver, spleen, etc.).Accordingly, to simulate reality, we further augmented the dataset with different degrees of brightness.
The image preprocessing and NN training were done using the python Pytorch library [10].The images were first cropped, resized (224 Â 224 pixels), and normalized.Subsequently, data augmentation was performed using Pytorch random augmentations; the method is based on the RandAugment method proposed by Cubuk et al [11].The augmentations included random horizontal flips, rotations with a random angle, and random applications of filters with a default probability of 50%.For image classification, we opted to use NN architectures present within the state of the art (SotA) algorithms [12][13][14].Consequently, the NNs created were based on the ResNet [12], AlexNet [13], and GoogLeNet [14] architectures.Some minor changes were made to the architectures and their parameters before training them with our dataset.First, the AlexNet architecture was implemented with 0.5 dropout layers and the ReLU activation functions.Furthermore, the input was adapted for grayscale images by reducing the input channels to 1 instead of 3 normally used for RGB images.Second, the GoogLeNet architecture was also adapted for grayscale images, and dropouts of 0.2 and 0.7 were used for the main and auxiliary output layers, respectively.Finally, the ResNet architectures used within this work were based on the implementations offered by the Pytorch torchvision pack- age [15] adapted for grayscale images.Every NN was trained using the cross-entropy loss and the Adam Optimizer with a learning rate of LR and weight decay of 0.0005.
As metrics for evaluation, the F1 score and the accuracy of the trained NN models were calculated within an unseen data test set.Both the accuracy and the F1 score are metrics commonly used within machine learning to assess model performance.These are functions of the true negatives (TNs), true positives (TPs), false negatives (FNs), and false positives (FPs) of the test data samples.The accuracy is the number of correctly labeled data over the total number of data samples, while the F1 score is calculated as shown in Supplementary Figure 1.

Automated kidney segmentation
For the segmentation models, each image data sample was accompanied by its corresponding mask represented by a red contour surrounding the area of interest, that is, the kidney.Therefore, the masks were extracted and transformed into binary images within the preprocessing of the images.Subsequently, the images and their masks were cropped, resized, and normalized as in the classification datasets.Furthermore, random data augmentation, similar to the approach used in the hydronephrosis detection dataset, was applied to the segmentation data samples [11].The Deeplabv3 models [10] were chosen from the SotA to perform semantic segmentation on our dataset, specifically the ResNet50 and ResNet101 architectures.During implementation, we found that the pretrained models available within the torchvision [10] package drastically outperformed the models trained from scratch.Consequently, the images were transformed to RGB to be compatible with the pretrained models.Our models were initialized using the pretrained ones as a starting point and further trained with our dataset.To assess the model training performance, the image pixel-wise precision and accuracy were calculated.Each pixel of the output image was categorized as TP, TN, FN, or FP with respect to the original mask.Then, for the validation, we calculated the accuracy, precision, dice, jaccard, recall, and ASSD score of the segmentation networks on 50% of the test dataset.
Explainable artificial intelligence (XAI) algorithms were used to analyze the output of our models.To this end, we used the Captum python library [16] to implement the XAI algorithms.First, the outputs were analyzed using the integrated gradient method proposed by Sundararajan et al [17].It is a representation of the integral of gradients with respect to inputs along a dimension of the input.Next, the gradient Shapley Additive Explanations (SHAP) were calculated.Gradient SHAP calculates the expected values of gradients at random points between a baseline and the input with Gaussian noise [18].Furthermore, the noise tunnel attribution method was performed in the data samples, where the attribution is computed multiple times while incrementally adding Gaussian noise to the sample (Fig. 2) [19].Finally, an occlusion-based method was used where contiguous patches of the image are replaced with the base- line and the output difference is computed.The resulting images represent a heatmap of the most significant patches of the original data sample [20].

2.3.
Software and hardware Our analyses were based on Python 3.10 (Python Software Foundation, Wilmington, DE, USA) and were built in Pytorch.All analyses were performed on a computer with 2 Intel Xeon Gold processor 6248R (Intel, Santa Clara, CA, USA) and an Santa Clara, California, USA NVIDIA GeForce RTX 3090 with 24 GB RAM.

Ultrasound devices
The ultrasound equipment employed to acquire the images comprised Herlev, Denmark BK with the 5000 (N1) and Specto (N2) models and Solingen, Nordrhein-Westfalen, Deutschland GE using Logiq P9 (N3).The images were all two dimensional and were captured using B-mode ultrasound with the help of convex abdominal probes (2-5 MHz) while the patients were lying on their back.

Ethics
This study followed German data regulations and the Declaration of Helsinki, and was approved by our local ethical committee (EK 22-360).

Results
A total of 723 ultrasound images were collected from three different ultrasound devices (see Table 1).Of the images, 523 were used as the training set and 200 as the test set.The most images from the training set were recorded with the BK 5000 device (N1) followed by BK Specto (N2) and GE Logiq P9 (N3).All six models were trained in order to achieve the best F1 score.All the models were trained using the Adam optimizer for 1500 epochs, whereas GoogLeNet was trained for 750 epochs.GoogLeNet achieved the highest training accuracy of 100%, while ResNet50 and ResNet152 obtained the highest validation accuracy of 98.07%.Overall, the models exhibited high training accuracies, but varied in validation   accuracies and training durations.These training characteristics are presented in Supplementary Table 1.
In the test phase, we analyzed the performance metrics of our six deep learning models in terms of test accuracy, recall, precision, F1 score, and the difference in F1 score compared with the AlexNet.AlexNet exhibits the highest F1 score of 98.95%, test accuracy of 99.52%, recall of 99.3%, and a precision of 98.6%, setting the baseline for F1 score difference as 0. AlexNet_v2 and ResNet152 have deviations in F1 score from AlexNet, with 1.71% and 2.35%, respectively.ResNet101 achieved the closest F1 score to AlexNet with only a 0.7% difference.All models showcased high performance, with test accuracies ranging from 96.15% to 99.52%, and ResNet152 achieved a recall of 100%.The confusion matrix indicating the evaluation of all models is presented in Table 2.
Three images were misclassified by the AlexNet (one image with no hydronephrosis as hydronephrosis and two images with hydronephrosis as no hydronephrosis; Fig. 3). Figure 3A shows a hypoechogenic structure in the middle of the renal calyx system, which could be misinterpreted as a dilated middle calyx.Figure 3B was misinterpreted as normal, although with a low probability (65%) of a correct diagnosis.Figure 3C had unclear margins of the kidney, which eventually mislead the diagnosis.
In order to understand and interpret the predictions of the models, we used four different types of filters (activation maps).The activation maps for the AlexNet model highlight the focus of the NN when it analyzes the image for hydronephrotic characteristics (Supplementary Fig. 2).
For the segmentation task, we trained two fully convolutional networks: deeplabv3_resnet50 and deeplabv3_res-net101. DeeplabV3 with ResNet50 has test accuracy of 90.21% and a dice score of 94.74.In comparison, DeeplabV3 with ResNet101 scores slightly better in test accuracy at 90.37%, has nearly identical precision of 99.28%, and a marginally lower dice score of 94.48.The confusion matrix of the NNs in the test phase is shown in Supplementary Table 2.
Two examples of automated segmentation with the deeplabv3_50 are presented in Figure 4.

Discussion
In addition to utilizing deep learning techniques for predictive purposes, such as estimating the success rate of stone   removal with shockwave lithotripsy and anticipating the need for readmission to the intensive care unit following ureteroscopy, artificial intelligence (AI) models have made significant advancements in risk stratification and the field of image recognition [4][5][6][21][22][23][24].These models have demonstrated state-of-the-art performance in image recognition and segmentation due to their robust and accurate classification capabilities [25,26].The application of deep learning for image segmentation in computer or magnetic resonance imaging has already been reported for the prostate, bladder, kidney, and urolithiasis [27][28][29][30].This study illustrates the feasibility of employing AI for the recognition of ultrasound images depicting kidneys with hydronephrosis.Smail et al [7] evaluated the efficacy of a convolutional neuronal network model with five convolutional layers, trained to categorize prenatal hydronephrosis according to the 5-point Society for Fetal Urology classification system.This model achieved overall accuracy of 94%.Lien et al [31] adapted three deep learning models (U-Net, Res-UNet, and UNet++) and assessed their accuracy in detecting hydronephrosis through ultrasound images.The most effective model (Res-UNet) reached an accuracy of 94.6%.
Our AlexNet model accurately classified ultrasound images as not hydronephrotic or hydronephrotic, with overall accuracy of 98.5%.To our knowledge, it is the highest accuracy of an NN in the detection of hydronephrosis.The ability of the AI model to rapidly analyze large volumes of imaging data can save time, improve the efficiency of the diagnostic process, and reduce interobserver variability in the interpretation of ultrasound images.
Nonetheless, the use of AI raises concerns about the transparency and accountability of the decision-making process.Therefore, this model integrated elements of XAI to determine which features from the image contribute most to the NN output (Supplementary Fig. 2).By presenting the suspicious area of the organ, we ensure the model's transparency and transfer the decision-making to trained medical personnel.
A further section of our analysis is the automated segmentation of the kidney in ultrasound images as a key part of computer-aided diagnosis systems.This feature represents the extraction of representative regions from the ultrasound image (in our case, kidneys) to better describe the target organ and improve diagnostics.Chen et al [32] developed a fully automatic kidney segmentation with a deep NN architecture, which segmentate the kidneys from different quality classes of images with a total accuracy of 98.8%.Here, we developed a pixel-wise segmentation model, which reached a dice coefficient of 94.7%.As the main purpose of our model was the detection of hydronephrosis, we reached sufficient accuracy to detect the kidneys and, subsequently, the specific characteristics of this pathology.Nonetheless, future work is needed to improve the total segmentation accuracy of our model.
The potential limitations of our study should be considered carefully, as these may affect the analysis performance of the models.First, we used still images and not a series of images of the whole organ, which may limit the diagnostic capabilities of the algorithm.Thus, the model's accuracy could be increased further by using a series of images.This study is exclusively focused on a single kidney pathology: hydronephrosis.The exclusion of images depicting various pathologies, such as cysts, venous dilations, large cavities in junctional syndromes, or tumors, along with the decision not to analyze all available ultrasound modes/images (B mode, M mode, Doppler, three-dimensional [3D] ultrasound data, cine clips, multimodal ultrasound images, and 3D images), may impact the hydronephrosis detection rate by introducing potential FPs or FNs.Nonetheless, in an endto-end AI scenario for kidney analysis using ultrasound images, our model's role becomes pivotal in hydronephrosis detection.Specifically designed to interpret hydronephrosis, our model stands out as the preferred choice, complementing other models that focus on different entities or pathologies.Another limitation is the number of images used to train the models.Although we have tried to consider all possible variations of the kidney locations, we could not consider all possible angles and degrees of brightness of the ultrasound findings.We aimed to overcome this limitation with in silico image augmentation.

Conclusions
In summary, our model reached high accuracy in differentiating between not hydronephrotic and hydronephrotic kidneys.After further training and improvements, this deep learning model could be integrated into AI-aided imaging diagnostic tools and also standardize the diagnosis and treatment of renal colic.

Fig. 1 -
Fig. 1 -Comprehensive visual summary depicting the paper workflow structure: real ultrasound kidney images were collected with three different ultrasound devices (N1, N2, and N3) from patients with hydronephrosis (class 1) and no hydronephrosis (class 2), and used to train six neuronal networks (AlexNet, AlexNet_v2, ResNet50, ResNet101, ResNet152, and GoogLeNet) for disease classification (hydronephrosis or no hydronephrosis) and two neuronal networks (deeplabv3_resnet50 and deeplabv3_resnet101) for the segmentation task.Afterward, the neural networks were tested on unseen images, with the best F1 score in the disease classification reached by AlexNet and the best dice coefficient in the segmentation task by deeplabv3_resnet50.Using specific filters on the AlexNet, it was shown on which part of the image the neural network focuses on disease classification.AI = artificial intelligence.

Fig. 2 -
Fig. 2 -Example of all filters applied to one ultrasound image for the explainable artificial intelligence with the AlexNet model: (A) integrated gradients, (B) noise tunnel attribution, (C) gradient SHAP, and (D) occlusion based attribution (for Fig. 2A-C, the focus of the network is shown in black-more intense, more relevant the area in the diagnosis of the hydronephrosis; for Fig. 2D, the focus of the network is shown in white-more intense, more relevant the area in the diagnosis of the hydronephrosis).SHAP = Shapley Additive Explanations.

E
U R O P E A N U R O L O G Y O P E N S C I E N C E 6 2 ( 2 0 2 4 ) 1 9 -2 5

Fig. 4 -
Fig. 4 -Manually and automated segmented kidney images: (A) kidney with hydronephrosis and (B) kidney without hydronephrosis.Yellow color indicates manually segmented, blue color indicates automated segmentation with the deeplabv3_resnet50-fully convolutional network, and overlapped image indicates both manual and automated segmentation employed.

Table 2 -
Confusion matrix for the neural networks trained on the data set