Deep-learning-aided forward optical coherence tomography endoscope for percutaneous nephrostomy guidance

: Percutaneous renal access is the critical initial step in many medical settings. In order to obtain the best surgical outcome with minimum patient morbidity, an improved method for access to the renal calyx is needed. In our study, we built a forward-view optical coherence tomography (OCT) endoscopic system for percutaneous nephrostomy (PCN) guidance. Porcine kidneys were imaged in our experiment to demonstrate the feasibility of the imaging system. Three tissue types of porcine kidneys (renal cortex, medulla, and calyx) can be clearly distinguished due to the morphological and tissue differences from the OCT endoscopic images. To further improve the guidance efficacy and reduce the learning burden of the clinical doctors, a deep-learning-based computer aided diagnosis platform was developed to automatically classify the OCT images by the renal tissue types. Convolutional neural networks (CNN) were developed with labeled OCT images based on the ResNet34, MobileNetv2 and ResNet50 architectures. Nested cross-validation and testing was used to benchmark the classification performance with uncertainty quantification over 10 kidneys, which demonstrated robust performance over substantial biological variability among kidneys. ResNet50-based CNN models achieved an average classification accuracy of 82.6% ± 3.0%. The classification precisions were 79% ± 4% for cortex, 85% ± 6% for medulla, and 91% ± 5% for calyx and the classification recalls were 68% ± 11% for cortex, 91% ± 4% for medulla, and 89% ± 3% for calyx. Interpretation of the CNN predictions showed the discriminative characteristics in the OCT images of the three renal tissue types. The results validated the technical feasibility of using this novel imaging platform to automatically recognize the images of renal tissue structures ahead of the PCN needle in PCN surgery.


Introduction
Percutaneous nephrostomy (PCN) was first described in 1955 as a minimally invasive, x-ray guided procedure in patients with hydronephrosis [1]. PCN needle placement has since become a valuable medical resource for minimally invasive access to the renal collecting system for drainage, urine diversion, the first step for percutaneous nephrolithotomy (PCNL) surgery and three-dimensional (3D) images of the detected sample formed by numerous cross-sectional images can be obtained in real time [46,48]. Because of the differences in tissue structures among renal cortex, medulla and calyx, OCT has the potential to distinguish different renal tissue types. Due to the penetration limitation in biological tissues (1-2 mm), studies in kidney using OCT have mainly focused on renal cortex [44,45,[49][50][51]. OCT can be integrated with fiber-optic catheters and endoscopes for internal imaging applications [52][53][54]. For example, endoscopic OCT imaging has been demonstrated in the human GI tract [55][56][57][58] to detect Barrett's esophagus (BE) [59,60], dysplasia [61] and colon cancer [62][63][64]. In the previous study, our lab has developed a portable hand-held forward-imaging endoscopic OCT needle device for real-time epidural anesthesia surgery guidance [65]. This endoscopic OCT setup holds the promise in PCN guidance.
Given enormous accumulation of images and inter-and intra-observer variation from subjective interpretation, computer-aided automatic methods have been utilized to accurately and efficiently classify these data [66,67]. In automated OCT image analysis, convolutional neural networks (CNN) [66,68,69] has been demonstrated to be promising in various applications, such as hemorrhage detection of retina versus cerebrum and tumor tissue segmentation [67,68,[70][71][72][73].
Herein we demonstrated a forward OCT endoscope system to image the kidney tissues lying ahead of PCN needle during PCN surgery. Images of renal cortex, medulla and calyx were obtained from ten porcine kidneys using our system. A tissue type classifier was developed using the ResNet34, ResNet50, and MobileNetv2 CNN architectures. Nested cross-validation and testing [74][75][76] was used for model selection and performance benchmarking to account for the large biological variability among kidney through uncertainty quantification. The predictions from the CNN models were interpreted to identify the important regions in the representative OCT images used by CNN for the classification.

Experimental setup
In our project, the OCT endoscope was built based on the swept-source OCT (SS-OCT) system. Figure 1 shows the schematic of our forward endoscopic OCT system. The light source of the SS-OCT system has the center wavelength of 1300 nm and bandwidth of 100 nm [39]. The wavelength-swept frequency (A-scan) rate is 200 kHz with ∼25 mW output power. As demonstrated in Fig. 1, laser provided by the light source was first split by a fiber coupler (FC) into two paths with 3% and 97% of the whole laser power, respectively. The 3% of the laser power was delivered into the Mach-Zehnder interferometer (MZI), which generated a frequency-clock signal for triggering the OCT sampling procedure and provided it to data acquisition (DAQ) board. The remaining 97% laser transmitted through a circulator in which the light runs only in one direction. Therefore, the light entering port 1 only emitted out from port 2, and then it was evenly split to the reference arm and sample arm of a fiber-based Michelson interferometer. Backscattered light from both arms formed interference fringes at the FC and transmitted to the balanced detector (BD). The interference fringes from different depths received by the BD were encoded with different frequencies. The output signal from BD was further transmitted to the DAQ board and computer for processing. Cross-sectional information can be obtained through Fourier transform of interference fringes [65].
A gradient-index (GRIN) rod lens with a diameter of 1.3 mm was stabilized in front of the galvanometer scanning mirror (GSM). The endoscope we used has the diameter of 1.3 mm, the length of 138.0 mm and the view angle of 11.0°. The total outer diameter (O.D.) including the GRIN rod lens and the protective steel tubing is around 1.65 mm. The proximal GRIN lens entrance of the endoscope was placed close to the focal plane of the objective lens. The GRIN lens can preserve the spatial relationship between the entrance and the output (distal end) and further to the sample. Therefore, one or two directional scanning can be readily performed on the proximal GRIN lenses surface to create 2D or 3D images. In addition, the same GRIN rod lens was put in the light path of reference arm for the purpose of compensating light dispersion and expanding the length of reference arm. Polarization controllers (PCs) were put in both arms to decrease the background noise. The system had the axial resolution of ∼11 µm and lateral resolution of ∼20 µm in tissue. The lateral imaging field-of-view (FOV) was around 1.25 mm. The sensitivity of system was optimized to 92 dB, calculated using a silver mirror with a calibrated attenuator.

Data acquisition
Ten fresh porcine kidneys were obtained from a local slaughterhouse. The cortex, medulla and calyx of the porcine kidneys were exposed and imaged in the experiment. Renal tissue types can be identified from the anatomic appearance. The OCT endoscope was placed against different renal tissues for image acquisition. To mimic the clinical situation, we applied some force during imaging the ex-vivo kidney tissues to generate tissue compression. The 3D images of 320×320×480 pixels on X, Y and Z axes (Z presents the depth direction) were obtained with the pixel size of 6.25 µm on all three axes. Therefore, the size of the original 3D images is 2.00 mm×2.00 mm×3.00 mm. For every kidney sample, we obtained at least 30 original 3D OCT images for each tissue type and each 3D tissue scanning took no more than 2 seconds. Afterwards, the original 3D images were separated to 2D cross-sectional images as shown in Fig. 2. Since the GRIN lens is cylindrical, the 3D OCT images obtained were also in the cylindrical shape. Therefore, not all the 2D cross-sectional images contained the same structural signal of the kidney. Only the 2D images with sufficient tissue structural information (cross-sectional images that close to the center of the 3D cylindrical structures) were subsequently selected and utilized for the image preprocessing. At the end of imaging, tissues of cortex, medulla and calyx of the porcine kidneys were excised and processed for histology to compare with corresponding OCT results. The tissues were fixed with 10% formalin, embedded in paraffin, sectioned (4 µm thick) and stained with hematoxylin and eosin (H & E) for histological analysis. Images were taken by Keyence Microscope BZ-X800. Sectioning and H & E staining was carried out by the Tissue Pathology Shared Resource, Stephenson Cancer Center (SCC), University of Oklahoma Health Sciences Center. The Hematoxylin (cat#3801571) and Eosin (cat# 3801616) were purchased from Leica biosystems and the staining was performed utilizing Leica ST5020 Automated Multistainer following the HE staining protocol at the SCC Tissue Pathology core.

Fig. 2. Illustration of data acquisition and processing steps
Although the three tissue types showed different imaging features for visual recognition, it will take time and expertise for doctors to differentiate them during surgeries. In order to improve the efficiency, we developed deep learning methods for automatic tissue classification based on the imaging data. Figure 2 shows the overall process of the data acquisition and processing. In total, ten porcine kidneys were imaged in this study. For each kidney, 1,000 2D cross-sectional images were obtained for cortex, medulla, and calyx, respectively. For the purpose of convenient analysis and increasing the speed of deep-learning process of the OCT images, a custom MATLAB algorithm was designed to recognize the surface of the kidney tissue on the 2D cross-sectional images. It automatically cropped the images from the size of 320×480 to 235×301. Therefore, all the 2D cross-sectional images have the same dimensions and cover the same FOV before deep-learning processing.
Pre-trained ResNet50 and MobileNetv2 models on the ImageNet dataset [80] were imported from the Keras library [81]. The output layer of the models was changed to one containing 3 softmax output neurons for cortex, medulla, and calyx. The input images were preprocessed by resizing to the 224 × 224 resolution, replicating the input channel to 3 channels, and scaling the pixel intensities to [-1, 1]. Model fine-tuning was conducted in two stages as described in [82]. First, the output layer was trained with all the other layers frozen. The optimizer, stochastic gradient descendent (SGD), was used with a learning rate of 0.2, a momentum of 0.3, and a decay of 0.01. Then, the entire model was unfrozen and trained. The SGD with Nesterov momentum optimizer was used with a learning rate of 0.01, a momentum of 0.9, and a decay of 0.001. Early stopping with a patience of 10 and a maximum number of epochs 50 was used for the Pre-trained ResNet50. Early stopping with a patience of 20 and a maximum number of epochs 100 was used for MobileNetv2.
The ResNet34 and ResNet50 architectures were also trained using randomly initialized weights. ResNet34 [77] was obtained from [82]. The mean pixel in the training dataset was used to center the training, validation, and test datasets. The input layer was modified to accept only one input channel in the OCT images and the output layer was changed for the classification of the three tissue types. For ResNet50, the optimizer SGD with Nesterov momentum with learning rate 0.01, momentum 0.9 and decay 0.01 was used. ResNet50 was trained with a maximum of 50 epochs, early stopping with a patience of 10, and a batch size of 32. For ResNet34, the Adam optimizer was used with learning rate 0.001, beta1 0.9, beta2 0.9999 and epsilon 1E-7. ResNet34 was trained with a maximum of 200 epochs, early stopping with a patience of 10, and a batch size of 512.

Nested cross-validation and testing
A nested cross-validation and testing procedure [74,76,83] was used to estimate the validation performance and the test performance of the models across the 10 kidneys with uncertainty quantification. The pseudo-code of the nested cross-validation and testing is shown below.
In the 10-fold cross-testing, one kidney was selected in turn as the test set. In the 9-fold cross-validation, the remaining nine kidneys were partitioned 8:1 between the training set and the validation set. Each kidney contained a total of 3000 images, including 1000 images for each tissue type. The validation performance of a model was tracked based on its classification accuracy on the validation kidney. The classification accuracy is the percentage of correctly labeled images out of all 3000 images of a kidney.
The 9-fold cross-validation loop was used to compare the performance of ResNet34, ResNet50, and MobileNetv2, and optimize the key hyperparameters of these models, such as pre-trained versus randomly initialized weights, learning rates, and number of epochs. The model configuration with the highest average validation accuracy was selected for the cross-testing loop. The cross-testing loop enabled iterative benchmarking of the selected model across all 10 kidneys, giving a better estimation of generalization error with uncertainty quantification.
Gradient-weighted Class Activation Mapping (Grad-CAM) [84] was used to explain the predictions of a selected CNN model by highlighting the important regions in the image for the prediction outcome. The interpretation implementation on ResNet50 was based on [85].
All the model development was performed on the Schooner supercomputer at the University of Oklahoma and the Summit supercomputer at Oak Ridge National Laboratory. The computation on Schooner used five computational nodes, each of which had 40 CPU cores (Intel Xeon Cascade Lake) and 200 GB of RAM. The computation on Summited used up to 15 nodes, each of which had 2 IBM POWER9 processors and 6 NVIDIA Volta Graphic cards. The source code of the model training is available at https://github.com/thepanlab/FOCT_kidney.

Forward OCT imaging of different renal tissues
The imaging setup our OCT endoscope was demonstrated in Fig. 3(A). An adapter was used to stabilize the endoscope in front of the OCT scan lens kit. From the kidney sample shown in Fig. 3(A), we can visually distinguish different tissue types. Renal cortex was the brown tissue on the edge of the whole kidney; Medulla can be recognized by its red renal pyramid structures distributed on the inner side of the cortex; Calyx was featured by its obvious white structure in the central kidney. Three tissue types were imaged respectively following the procedure described in Section 2.2.

Figures 3(B)-3(D)
show representative 3D OCT images, 2D cross-sectional images and the histology results of three renal tissues. They were featured with different imaging depth and brightness. The renal calyx had the shallowest imaging depth, but the tissue close to the surface showed the highest brightness and density. Cortex and medulla both presented relatively homogeneous tissue structures in OCT images, and the imaging depth of medulla was larger than cortex. Furthermore, compared to cortex and medulla, calyx was featured with horizonal stripes and layered structure. The transitional epithelium and fibrous tissue in the calyx may explain the strip-like structures and significantly higher brightness in comparison to the other two renal tissues. This is the significant part for PCN insertion since the goal of PCN is to reach the calyx precisely. These imaging results demonstrated the feasibility of distinguishing renal cortex, medulla, and calyx with the endoscopic OCT system. Table 1 shows the average validation accuracies and their standard errors for the pre-trained (PT) or randomly initialized (RI) model architectures after hyperparameter optimization. RI MobileNetv2 frequently failed to learn, so only the PT MobileNetv2 model was used here. The PT ResNet50 models outperformed the RI ResNet50 models in 6 of the 10 testing folds, which indicated only a small boost by the pre-training on ImageNet. For all the 10 testing folds, the validation accuracies of the ResNet50 models were significantly higher than those of the MobileNetv2 and ResNet34 models. Thus, the characteristic patterns of the three kidney tissues may require a deep CNN architecture to recognize. The detailed results from the 9-fold cross validation of RI ResNet34, PT MobileNetv2, RI ResNet50, and PT ResNet50 can be found in Supplementary Table 1-Table 4, respectively.  Table 2 shows the test accuracy of the best-performing model in each of the 10 testing folds. The output layer of the CNN models estimated three softmax scores that summed up to 1.0 for the three tissue types. When the category with the highest softmax score was selected for an image (i.e., a softmax score threshold of 0.333 to make a prediction), the CNN model made a prediction for every image (100% coverage) at a mean test accuracy of 82.6%. This was substantially lower than the mean validation accuracy of 87.3%, which suggested the overfitting to the validation set by the hyperparameter tuning and early stopping. The classification accuracy can be increased at the expense of lower coverage by increasing softmax score threshold, which allowed the CNN model to make only confident classifications. When the softmax score threshold was raised to 0.5, 89.9% of the images on average were classified to a tissue type and the mean classification accuracy increased to 85.6%±3.0%. For the uncovered images, the doctors can make a prediction with the help of other imaging modality and their clinical experience.

CNN development and benchmarking results
There was substantial variability in the test accuracy among different kidneys. While three kidneys had test accuracies higher than 92% (softmax score threshold of 0.333), the kidney in the sixth fold had the lowest test accuracy of 67.7%. Therefore, the current challenge in the image classification mainly comes from the anatomic differences among the samples. For instance, Figs. 4(A) and 4(B) shows the receiver operating characteristic (ROC) curves of the prediction results from kidney No. 5 and No. 10 (ROC curves of all the 10 kidneys in the 10-fold cross-testing can be found in the Supplementary data). It is clear that the prediction of kidney 5 is much more accurate than that of kidney 10. Our nested cross-validation and testing procedure was designed to simulate the real clinical setting in which the CNN models trained on one set of kidneys need to perform well on a new kidney unseen by the CNN models until the operation. When a CNN model was trained on a subset of images from all kidneys and validated on a separate subset of images from all kidneys in cross-validation as opposed to partitioning by kidneys, it achieved accuracies over 99%. This suggested that the challenge of image classification mainly stemmed from the biological differences between different kidneys. The generalization of the CNN models across kidneys can be improved by expand our dataset with kidneys at different ages or physical conditions to represent different structural and morphological features.    Table 3 shows the average confusion matrix for the 10 kidneys in the 10-fold cross-testing with a score threshold of 0.333 and the average recall and precision for each type of tissue. Confusion matrix from the 10-fold cross-testing for each of the 10 kidneys is shown in the Supplementary data. Cortex was most challenging tissue type to be classified correctly and often mixed up with medulla. From the original images we found that the penetration depths of medulla were much larger than cortex in seven of the ten imaged kidneys. While in other three samples, these differences were insignificant. This may explain the challenging classification between cortex and medulla. To better understand how a CNN model classified different renal types, the class activation maps were generated to visualize heatmaps of class activation over input images. Figure 5 shows the class activation heatmaps for 3 representative images of each tissue type from the RI ResNet50 model and the PT ResNet50 model. The models and the representative images were selected from the fifth testing fold. The RI ResNet50 model performed the classification by paying more intention to the lower part of the images of cortex, to both the lower part and near the upper part of the medulla images, and to the area of the calyx images with high intensity near the needle tip. The PT ResNet50 model focused on both the upper part and lower part of the cortex images, on the middle part and/or lower part of the medulla images, and on the region close to the needle tip of the calyx images. Compared to the RI ResNet50 model, the PT ResNet50 model shifted its attention closer to the signal regions. The class classification heatmaps provided an intuitive explanation of the classification basis for the two CNN models.

Discussion
We investigated the feasibility of OCT endoscopic system for PCN surgery guidance. Three porcine kidney tissue types: cortex, medulla and calyx were imaged. These three kidney tissues show different structural features which can be further used for tissue type recognition. To increase the image recognition efficiency and reduce the learning burden of the clinical doctors, we developed and evaluated CNN methods for image classification and recognition. ResNet50 had the best performance compared to ResNet34 and PT MobileNetv2 and achieved an average classification accuracy of 82.6%±3.0%.
In the current study, the porcine kidneys samples were obtained from a local slaughterhouse without controlling the sample preservation and time after death. Biological changes may have occurred in the ex-vivo kidneys, including collapse of some structures of nephrons such as the renal tubules. This may make the tissue recognition more difficult, especially the classification between cortex and medulla. Characteristic renal structures in the cortex can be clearly imaged by OCT in both well-preserved ex-vivo human kidneys and living kidneys as previously reported [44,50,86] and verified in an ongoing study in our lab using well-preserved human kidneys. Additionally, nephron structures distributed in renal cortex and medulla are different [87]. These additional features in renal cortex and medulla will improve the recognition of these two tissue types and increase the classification accuracy of our future CNN models when imaging in-vivo samples or well-preserved ex-vivo samples. The current study established the feasibility of automatic tissue recognition using CNN and provided information for the model selection and hyper-parameter optimization in future CNN model development using in-vivo pig kidneys and well-preserved ex-vivo human kidneys.
For translating the proposed OCT probe into clinics, we will assemble the endoscope with appropriate diameter and length into the clinical used PCN needle. In current PCN puncture, a trocar needle is inserted into the kidney. Since the trocar has a hollow structure, we can fix the endoscope within the trocar needle. Then our OCT endoscope can be inserted into the kidney together with the trocar needle. After the trocar needle tip arrives at the destination (such as the kidney pelvis), we will withdraw the OCT endoscope from the trocar needle and other surgical processes can be continued. During the whole puncture, no extra invasiveness will be caused. Since the needle will keep moving during the puncture, there will be a tight contact between the needle tip and the tissue. Therefore, the blood (if any) will not accumulate in front of the needle tip. From our previous experience in the in-vivo pig experiment guiding the epidural anesthesia using our OCT endoscope, the presence of blood is not a big issue [65]. The diameter of the GRIN rod lens we used in the study is 1.3 mm. In the future study, we will further improve the current setup with smaller GRIN rod lens that can be fit inside the 18-gauge PCN needle which is clinically used in PCN puncture [88]. Furthermore, we will miniaturize the GSM device based on microelectromechanical system (MEMS) technology, which will enable ease of operation and is important for translating the OCT endoscope to clinical applications. The current employed OCT system has a scanning speed up to 200 kHz, the 2D tissue images in front of the PCN needle can be provided to surgeons in real time. Using ultra-high speed of laser scanning and data processing system, 3D images of the detected sample can be obtained in real time [46,48]. In the next step, we will acquire 3D images that may further improve our classification accuracy, because of the added information content in 3D images. For example, Kruthika et al. detected Alzheimer's disease from MRI images and showed improved performance of 3D Capsule Network (CapsNet) and 3D CNN over previous 2D approaches [89].
The CNN model training in this study used significant computational power. Each fold of the cross-validation took ∼25 minutes for RI ResNet50 and ∼45 minutes for PT ResNet50 using one NVIDIA Volta GPU. The 90 folds of the nested cross-validation for each model configuration were performed in parallel across multiple compute nodes on the Summit supercomputer. The inference was computationally efficient. It took ∼50 seconds using one NVIDIA Volta GPU to perform inference on 1000 images (i.e., ∼0.05 seconds per image on average), including model loading, image preprocessing, and the ResNet50 classification. The ResNet50 classification used ∼16 seconds for 1000 images or ∼0.016 seconds per image. In future, the inference can be further accelerated through algorithm optimization and parallelization to make it more practical for surgical applications.