Connectivity-based deep learning approach for segmentation of the epithelium in in vivo human esophageal OCT images

: Optical coherence tomography (OCT) is used for diagnosis of esophageal diseases such as Barrett’s esophagus. Given the large volume of OCT data acquired, automated analysis is needed. Here we propose a bilateral connectivity-based neural network for in vivo human esophageal OCT layer segmentation. Our method, connectivity-based CE-Net (Bicon-CE), defines layer segmentation as a combination of pixel connectivity modeling and pixel-wise tissue classification. Bicon-CE outperformed other widely used neural networks and reduced common topological prediction issues in tissues from healthy patients and from patients with Barrett’s esophagus. This is the first end-to-end learning method developed for automatic segmentation of the epithelium in in vivo human esophageal OCT images.


Introduction
Esophageal cancer is the seventh most common cancer and the sixth most common cause of cancer mortality worldwide [1]. Early detection of precancerous esophagus and dysplasia can help reduce the morbidity and mortality of esophageal cancer [2]. Systematic biopsies can detect high-grade dysplasia [3] but are limited by their invasiveness and by blind sampling of tissue. Optical coherence tomography (OCT) allows non-invasive cross-sectional imaging of soft tissues [4] and has been used for examining the upper gastrointestinal tract [5][6][7][8][9]. Accurate interpretation of esophageal OCT images is essential for detection of dysplasia [10,11] and for diagnosis of esophageal diseases such as eosinophilic esophagitis [12] and Barrett's esophagus (BE) [13,14]. In BE, the normal stratified squamous epithelium of the esophagus is replaced by a specialized columnar epithelium [2,15]. However, due to the large volume of OCT images and patient-specific confounders such as tissue folding and mucus covering, analysis of OCT images by gastroenterologists is time-consuming and subjective. To enable high-throughput clinical use of OCT in esophageal cancer screening, an automated segmentation algorithm to accurately quantify tissue characteristics such as thickness and shape is needed.
Existing automated esophageal OCT segmentation algorithms fall into two main categories: traditional image processing-based methods and deep learning-based methods. Zhang et al. [16] and Gan et al. [17] proposed traditional image processing algorithms to alleviate speckle noise and improved graph searching-based methods for esophageal layer segmentation in guinea pigs. Wang et al. [18] combined a sparse Bayesian classifier with graph theory and dynamic programming [19] to segment esophageal layers, also in guinea pigs. Since these traditional image processing-based methods rely on predefined features, they are less reliable in dealing with variation among images. Subjects were recruited from patients undergoing routine care endoscopy at UNC Healthcare. Of 54 patients initially recruited, 30 were successfully imaged using OCT (784 B-scans total); of these 30 subjects, six had non-dysplastic BE and one had BE with low-grade dysplasia.
The imaging technique design is described in [7]. Briefly, a paddle-shaped probe that is attached externally to an endoscope provides cross-sectional spectral domain OCT images of the esophageal mucosa to supplement standard video endoscopy, in a form factor compatible with existing workflow and clinical practice.
Acquired OCT data were cropped manually around the region of interest (ROI; the esophagus) and were labeled independently by three graders who are experienced at evaluating OCT images. The annotations of the most experienced grader (Grader #1) served as the gold standard, and those of the other two graders were used to test inter-grader variability. We note that Graders #2 and #3 were trained by Grader #1 prior to labeling the images presented in the paper, with in vivo human esophageal OCT images that were not part of the dataset included in the manuscript. Since segmentation is challenging even for experts, the graders were asked to segment the epithelium only in regions in which they were confident about the accuracy of their annotation. Therefore, manual segmentations often did not span the entire B-scan. We refer to the horizontal range of an OCT B-scan that was manually segmented by graders as their "trainable interval". To avoid interference from unsegmented epithelium regions in network training, we cropped the Grader #1's trainable intervals of the OCT ROI images and used only the resulting cropped images for training ( Fig. 1 (a)). In the testing phase, we used the original uncropped ROI images as the network input. However, the performance of different methods and graders was compared to the gold standard only over the specifically-defined trainable interval. We define two intervals for evaluation: first is the trainable interval of Grader #1, which is used for comparing the Bicon models with the baseline methods in Section 3.3. The second one is the graders' consensus interval ( Fig. 1(b)), defined as the overlap of all three graders' trainable intervals. The graders' consensus interval is used in Section 3.2 and 3.4 for experiments that include inter-grader analysis to provide a fair comparison among different graders. All OCT images and their corresponding manual annotations by the three graders used in this paper are available at [29].

Bilateral connectivity network with CE-Net backbone
Building upon our recent work [28], we constructed a bilateral connectivity network to fully model pixel connectivity along with pixel-wise tissue classification. Figure 2 provides an overview of the method. Our network contains three parts: a connectivity-based CE-Net [30] backbone, a bilateral voting (BV) module, and a region-guided channel aggregation (RCA) module. As we use CE-Net as the backbone of our model, we refer to it as Bicon-CE. For the loss function, we adopted a modified Bicon loss [28]. We describe each component in detail in the following sections.

Connectivity mask
Classic convolutional neural network (CNN)-based methods treat image segmentation as a pure pixel label assignment problem. We refer to these methods as pixel classification-based methods. Since this modeling strategy neglects inter-pixel relationships [28], it can result in inconsistent boundaries and topological issues in layer segmentation (examples in Section 3.3). We propose an alternative model to address these problems. Unlike general semantic segmentation tasks in which a single class may contain multiple instances in a dataset, in esophagus layer segmentation each layer class contains only one simply connected region -the pixels from the same layer are all topologically connected. Therefore, strong inter-pixel coherence exists between pixels of the same layer. Inspired by this feature, we design our CNN such that it learns to classify image pixels in concert with modeling the connectivity between pixels from the same class. In the manually labeled binary masks, areas belonging to the epithelial layer are marked with 1 and are referred to as positive pixels, while all other pixels are marked with 0. We define two pixels as connected if and only if they are adjacent and both are positive pixels. As shown in Fig. 3 (a), given a pixel in a binary mask G S , we find its 8 neighboring pixels (C1-8) using the 8-neighborhood [31] system. Then, we construct an 8-entry connectivity vector for the center pixel in which each entry represents the connectivity between the center pixel and one neighboring pixel in a specific direction. Thus, given a binary mask G S , we generate an 8-channel mask by deriving connectivity vectors for all of its pixels. We call this 8-channel mask the connectivity mask (G C ). For each two neighboring pixels in G S , there are two specific paired elements in G C that represent the mutual connectivity between them. We call these two elements a connectivity pair ( Fig. 3 (b)). We use the connectivity mask as the label to model pixel connectivity for supervised learning.  Then we convert the pixel into a connectivity vector (yellow vector in G C ). The subfigure on the top shows an example of converting the pixel GS (1,4) into a connectivity vector. G S is zero-padded at the boundaries. G C is obtained after all pixels of G S are converted. (b) Example of a connectivity pair. The connectivity pair corresponding to the two green pixels in G S is shown as the red boxed pair in G C . G C1 (2,2) represents the top-left connectivity of G S (2,2) and G C8 (1,1) represents the bottom-right connectivity of G S (1,1).

Connectivity-based CE-Net
We used CE-Net [30] as our network's backbone. Like U-Net, CE-Net has a U-shaped encoderdecoder structure. To extract high-resolution feature maps, CE-Net consists of a dense atrous convolution (DAC) block at its last encoder block. To maintain the multi-scale information from the DAC block, CE-Net includes a residual multi-kernel pooling (RMP) block. For single-class tasks like epithelial layer segmentation, the original output layer of CE-Net is a single-channel fully connected (FC) layer. To introduce the pixel connectivity information, we replace the output layer with an 8-channel FC layer and use connectivity masks as the training labels. By doing so, we construct a connectivity-based CE-Net which, given an OCT esophagus image, outputs an 8-channel map which we call the connectivity map (Conn map, C). Every pixel in C represents the unidirectional connection probability of a pixel in a specific connectivity pair, and every channel represents the unidirectional pixel connection probability in a specific direction.

BV and RCA modules
After obtaining the Conn map, we enhance the coherence between neighboring pixels via the bilateral voting module (Fig. 4). In the BV module, we multiply the two elements in every connectivity pair and assign the resulting value to both elements, yielding a new map called the bilateral connectivity map (Bicon map,C): where j is the j th channel, and a, b ∈ {0, ±1} index the front-view (spatial-view) relative position of the two pixels in this connectivity pair. Every channel ofC represents the bidirectional pixel connection probability of a specific direction. The connectivity modeling process. In the BV module, every connectivity pair in C is multiplied to generate a new map (C). In the RCA module, a channel-wise aggregation function f is applied to every connectivity vector (highlighted in yellow) to generate a single-channel map.
As stated in Section 2.3, connectivity is defined only for adjacent positive pixels. Thus, pixel connectivity and pixel positivity are closely correlated. If two neighboring pixels are positive, then they are connected; conversely, if we know two pixels are connected, then both are positive. Therefore, the probability of a pixel being connected with others is the probability of it being positive. This reverse inference is done in the RCA module (Fig. 4). Given the Bicon mapC, we derive the overall connectivity probability map via a channel-wise aggregation function f : where f is an adaptive aggregating operation that varies with location (x, y), i is the i th channel, and C i is one channel ofC showing the bidirectional connection probability of the i th direction.S is a single-channel map representing the overall aggregated probability of each pixel being positive.
Here we use two types of aggregation methods to define f, resulting in two single-channel output maps. In the first method, which is different from the method in [28], we take the maximum value among channels to construct the global mapS global : Using the maximum connection probability across channels, we enforce a high probability for the pixel at (x, y) ofS global to be positive as long as it is connected with at least one of its neighbors. This strategy encourages the pixel to focus on learning its connectivity with its highest likely connected neighboring pixel, thus alleviating the effects of noisy pixels. Next, to emphasize the boundary of the epithelial layer, we use a second method, called edge-guided aggregation, which combines the channels differently at edge and non-edge locations. This yields a new map called the edge-decouple map,S decouple , as we described previously [28]: where P edge is the set of ground truth edge pixels which are obtained from the connectivity mask [28]. BothS global andS decouple are used during the training process, andS global is used as the final prediction in the testing phase.

Loss function
As in [28], we define the overall loss function of our network as The first term, L decouple , is the edge-decoupled loss, which is the binary cross-entropy (BCE) loss betweenS decouple and the ground truth segmentation mask G s : L con_const is the connectivity consistency loss. Unlike in [28], where this loss is defined as a weighted sum of BCE losses applied to both the Conn map (L conmap ) and the Bicon map (L bimap ), here we define the connectivity consistency loss only for the Conn map: We define L con_const only for the Conn map because L bimap automatically gives greater weight to boundary pixels and less weight to background regions [28]. This strategy works well in natural images because the inter-class difference between background and positive pixels is relatively large, while the intra-class difference between positive pixels is small. However, in in vivo human esophageal OCT images, due to the limited information in grey level pixels, the lower image quality, and the complex characteristics of tissues, it is usually hard to observe a large overall inter-pixel difference between epithelial layer pixels and pixels of other layers. Therefore, we use L con_const = L conmap to give the same weight to positive pixels and background pixels. The third term in Eq. (5), L dice , is the dice loss [32], defined as where H and W are the height and width of the input image, respectively.

Training
We used 10-fold cross-validation, where for each fold we randomly chose 3 subjects that were not previously selected for testing, and used the remaining 27 subjects for training. There was no overlap between the training and testing sets. We pretrained the CE-Net backbone on ImageNet, as in the original CE-Net paper [29]. We trained Bicon-CE on cropped data, which contained only the trainable interval. We did not perform data augmentation during training and kept the original aspect ratio of the ROI images for training and testing. We used a mini-batch with batch size of 8 to train the network. To use the mini-batch while keeping the original aspect ratio, for every batch, we first found the maximum width (x max ) and height (y max ) of the images in the batch. Then, we zero padded every image in the batch to the size of (x max , y max ). We used the Adam optimizer with (β 1 , β 2 ) = (0.9, 0.999) and weight decay = 0.0001. We trained our network for 45 epochs in total. The learning rate was initially 2e-4, and was reduced by a factor of 0.2 at the 30 th epoch.

Prediction
In the testing phase, we used uncropped ROI images as the network inputs. We obtained the final prediction by thresholding the output global mapS global at 0.5 (Fig. 2).

Evaluation metrics
We calculated the dice coefficient (DSC) [33] for each prediction as where TP is the number of true positive pixels, FP is the number of false positive pixels, and FN is the number of false negative pixels in the predicted binary map. DSC ranges from 0 to 1, where a higher value means the prediction is closer to the gold standard. While DSC is a commonly used metric in medical segmentation, it is not sensitive to small outliers. Therefore, we also calculated the mean total error, E t , and the mean net error, E n [34], of the predicted epithelial layer: where N is the total number of B-scan columns and k = 6.5 is the scaling factor for converting pixels to microns. To quantify tissue characteristics, we calculated the overall thickness of the predicted epithelial layer. To evaluate the statistical significance of results, we calculated p-values using the Wilcoxon signed-rank test, where p < 0.05 indicated statistical significance.

Comparison with alternative methods
We compared our approach with three widely used medical image segmentation models, U-Net, UNet++ [35], and CE-Net (Table 1). Bicon-CE outperformed U-Net, U-Net++, [35] and CE-Net across all listed metrics. The average DSC of our method was significantly higher (p < 0.0001) than that of CE-Net (5.1% higher), U-Net++ (5.2% higher), and U-Net (6.7% higher). The overall layer thickness values of Bicon-CE were closer to the gold standard (Grader #1) than those of the other two methods. Both the net error and the total error of Bicon-CE segmentations were lower than those of U-Net, U-Net++, and CE-Net. Other than Bicon-CE, we constructed two other Bicon version models: Bicon-UNet and Bicon-UNet++, and reported their results. The analysis of the alternative Bicon version models is included in Section 3.3. We also compared the labels of Graders #2 and #3 with those of the more experienced Grader #1 (Table 1, row 2-3). Grader #2 could not confidently grade one of the subjects and therefore graded only 29 of the 30 subjects; Grader #3 graded all subjects. The results show variability and disagreement between human graders, reflecting the subjectivity and challenging nature of this task. We can also see that none of the baseline models outperformed the manual segmentation results of Grader #3, while Bicon-CE significantly outperformed it, demonstrating the effectiveness of our model. Note that the results in Table 1 were evaluated over the graders' consensus intervals to provide a fair comparison between different graders as they had different trainable intervals.

Performance of connectivity modeling
In the previous section, we showed that Bicon-CE performed better than CE-Net, U-Net++, and U-Net. Here, we show that for the tested models: (1) connectivity modeling is superior to pixel classification-based methods in general and is independent of the backbone selection; (2) our Bicon-based method is compatible with other image segmentation models. For this, we constructed the Bicon enhanced models of U-Net (Bicon-UNet) and U-Net++ (Bicon-UNet++). In Table 2, we compare the Bicon enhanced models with the corresponding baseline networks. We used Grader #1's trainable interval for evaluation since it reflects the performances over a wider region. All three Bicon enhanced models significantly (p < 0.0001) outperformed their corresponding baseline methods, indicating the effectiveness of the connectivity modeling method. Moreover, the extra computational cost of our method is negligible; for example, Bicon-UNet increased DSC by 6.3% compared to U-Net with only 455 extra parameters. These results also show that the connectivity modules are compatible with other pixel classification-based neural networks.
This connectivity modeling method reduces topological problems such as outlier prediction, disconnected prediction, and non-simply connected prediction (examples in Fig. 5). As shown in (a), both CE-Net and U-Net generated an outlier region and a non-simply connected region, which did not occur in Bicon-CE or Bicon-UNet. UNet++ predicted an incorrect boundary likely due to artifacts, while Bicon-UNet++ avoided it. In (b), likely due to the non-uniform intensity of the esophagus, none of the three baselines made a horizontally continuous prediction. But the three Bicon enhanced models all made continuous predictions that covered the layer region. In (c), likely due to strong ring artifacts, the baselines were negatively affected around the artifact area; however, three Bicon enhanced models made a continuous prediction despite these artifacts. The feature space of Bicon-CE also demonstrates its ability to extract layered features while avoiding artifacts (See Supplement S1 for supporting content).
Comparisons to the results of Bicon-UNet and Bicon-UNet++ demonstrate the effectiveness and efficiency of Bicon-CE. Compared to Bicon-UNet, Bicon-CE achieved significantly lower net thickness error. Compared to Bicon-UNet++, Bicon-CE had significantly less net and total thickness errors. As shown in Fig. 5, although all three Bicon models largely avoided the topological issues, Bicon-CE predicted a smoother boundary and a more continuous shape even in the presence of strong artifacts compared to the other two. To show the efficiency of Bicon-CE, we reported the processing speeds of all networks when tested on OCT B-scans of size 512×512 pixels in Table 2. Bicon-CE was faster than Bicon-UNet and Bicon-UNet++ even though it had a larger number of parameters than Bicon-UNet. Thus, we chose CE-Net as the backbone as it  enabled the extraction of high-resolution multi-level features while maintaining a fast-processing speed.

Robustness analysis
Automated segmentation of clinical in vivo human esophageal OCT images is challenging not only due to imaging artifacts and noise, but also due to patient-specific outliers such as irregular tissue shape, mucus, and in-layer image intensity non-uniformity [36]. Examples of the robustness of our method are shown in qualitative comparisons between our method and others under different scenarios in Fig. 6: (a) mucus covering; (b) OCT discontinuity; (c-d) imaging artifacts; (e) low contrast imaging; (f) non-uniform intensity. Mucus is produced by glands in the esophageal lining to keep the passageway moist; however, due to variable thickness and scattering content, mucus does not always appear in OCT images. Therefore, it is important for the automated method to handle segmentation with and without mucus. As shown in Fig. 6 (a), neither U-Net nor CE-Net could accurately exclude mucus from the epithelial layer, while Bicon-CE made a precise prediction. Figure 6 (b) shows an example of OCT discontinuity, which can be caused by non-uniform refractive index in an overlying layer (such as a bubble) causing a sudden apparent change in depth of the tissue. Again, only Bicon-CE predicted an accurate segmentation. Figure 6 (c-d) show examples of two imaging artifacts: (c) a ring artifact caused by internal reflection in the probe, and (d) an artifact (top left) caused by the adhesive on the inside of the probe paddle window. Both U-Net and CE-Net were misled by these artifacts and gave erroneous segmentations, whereas Bicon-CE gave an accurate segmentation. Figure 6 (e) shows an example of low contrast imaging, caused inadvertently by a non-optimized selection of the OCT reference position. Again, only our method robustly handled this situation. Lastly, Fig. 6 (f) shows a case of non-uniform intensity (highlighted by yellow arrows), which may be due to a duct or vessel that was fluid-filled and weakly reflective. Both U-Net and CE-Net were affected by this non-uniformity in the tissue, whereas Bicon-CE made a continuous prediction. These results validate our motivation: by focusing more on inter-pixel relationships, Bicon-CE makes a connected prediction and avoids topological errors. We further tested the robustness of the algorithm with respect to the human grading used for training and testing. In supplementary material S2, we used the majority-based markings of all graders as the gold standard. These experiments demonstrate that Bicon-CE still performed superior to other techniques (and is close to human grading).

Clinical potential
Bicon-CE was robust under different clinical conditions, as shown in Section 3.4. Here we demonstrate the potential applicability of our segmentation model in detecting BE by assessing the performance of our method on images from diseased patients (6 non-dysplastic BE and 1 had BE with low-grade dysplasia) and healthy subjects. The results are summarized in Table 3. For both patient groups, our model outperformed the baseline neural networks and the human graders by achieving significantly higher DSC scores (p < 0.0001). Representative examples are visualized in Fig. 7, where Bicon-CE shows a smoother and more continuous prediction of the segmented layer. With the new low-cost imaging device [7], we were able to quantify the changes in epithelial layer thickness due to BE. The results in Table 3 show that healthy subjects' mean overall epithelial layer thickness was significantly larger (p < 0.001) than those with BE. Compared to the baselines, Bicon-CE's estimated difference in mean layer thickness (28.2 µm) between the normal and BE subjects was closer to the gold standard (21.8 µm) while maintaining significantly lower (p < 0.0001) net error and total error. Although the BE-related change in epithelial layer thickness has not been proven in a large-scale randomized clinical trial, we believe this pilot observation can provide a potential guide for further studies of BE.
Lastly, we investigated the computational cost of our method for clinical application. In our experiments, the processing time for a single esophageal OCT ROI image in our dataset was 0.024 ± 0.005 seconds (median, 0.023 seconds) on an Ubuntu system with a GTX 2080Ti GPU running on Pytorch with data loaded from a solid-state drive. The average acquisition time for the full field-of-view OCT B-scan was ∼0.05 seconds. Thus, our method can potentially be utilized in the clinic for real-time segmentation of the epithelial layer.

Discussion
In this work we proposed a bilateral connectivity-based neural network to accurately segment the epithelial layer from in vivo human esophageal OCT images. This network, Bicon-CE, models the single-class segmentation task as a combination of pixel connectivity modeling and pixel-wise tissue classification. Bicon-CE significantly outperformed popular alternative segmentation models (U-Net, U-Net++, and CE-Net), and outperformed human graders. Connectivity-based versions of these models were superior to the baseline methods, indicating the general superiority of a connectivity modeling approach. The robustness of Bicon-CE was shown by testing it under different image artifacts and variants. The potential clinical application of Bicon-CE was shown by its ability to segment the epithelium in samples from patients with BE and to detect the potential thickness changes due to BE, suggesting that it can be used as a part machine learning approach for accurate real-time monitoring of esophageal diseases.
To the best of our knowledge, Bicon-CE is the first end-to-end layer segmentation algorithm for in vivo human esophageal OCT images. Processing in vivo OCT images is more challenging than processing ex vivo OCT images due to the sometimes-lower image quality of in vivo images and due to variations between images due to imaging artifacts and disease conditions. In our study, manual segmentation of in vivo OCT data was challenging even for our most experienced grader, and the labeling by the three graders showed great disagreement. We defined a "trainable interval" over which the labels were of high confidence for training; as a result, we lost some information from the unlabeled regions. Our model could be further improved with better manual segmentation labels, with a larger dataset, or with data augmentation (e.g., by simulating the ring artifacts) in the training stage. Different methods can be used to generate the gold standard labels for training deep learning algorithms and assessing their performance. In Tables 1-3, we used the single expert Grader #1 (KKC), with significant experience in assessing esophageal OCT images, as the gold standard. In addition, in the supplementary material S2, we used the majority-based markings of all three graders as the gold standard. In both cases, Bicon-CE was shown to have superior performance. At the inference stage, while we evaluated the results only across the trainable interval, the predictions of our CNN were made on the entire image. Bicon-CE's predictions in these "non-trainable regions" were rational (see Fig. 6), suggesting that Bicon-CE has the potential to provide guidance to clinicians for interpreting data even in challenging areas.
There are a few possible avenues for improving our method. First, this work can be extended to multi-layer segmentation problems given a dataset that is labeled for multiple layers and with improvements in the model architecture. Technically, although our Bicon-CE can be readily extended to a multi-class model by changing its output layers, exploiting the special properties of multi-class data could produce even a stronger model for the multi-class segmentation problem. For example, in multi-class data, there exists not only intra-class connectivity but also inter-class relationship. To utilize this property, a channel-wise attention module could potentially capture the inter-class information. Secondly, the number of individuals with BE in our dataset was small compared to the number of healthy subjects. Our method could be improved by increasing the amount of BE training data. Furthermore, although the BE-related changes in the thickness of the epithelium as measured on OCT haven't been investigated in a large-scale randomized clinical trial, our observation provides a potential direction for future studies on the diagnosis and prognosis of BE using in vivo OCT imaging. As part of our future studies, we will utilize this technology for differentiating different stages of esophageal diseases. Last, the processing speed can be improved with a hardware update. We envision that our deep learning method will reduce the workload of human grading and improve the accuracy of segmenting the epithelial layer in in vivo human esophageal OCT images.