Tissue self-attention network for the segmentation of optical coherence tomography images on the esophagus

: Automatic segmentation of layered tissue is the key to esophageal optical coherence tomography (OCT) image processing. With the advent of deep learning techniques, frameworks based on a fully convolutional network are proved to be effective in classifying pixels on images. However, due to speckle noise and unfavorable imaging conditions, the esophageal tissue relevant to the diagnosis is not always easy to identify. An effective approach to address this problem is extracting more powerful feature maps, which have similar expressions for pixels in the same tissue and show discriminability from those from different tissues. In this study, we proposed a novel framework, called the tissue self-attention network (TSA-Net), which introduces the self-attention mechanism for esophageal OCT image segmentation. The self-attention module in the network is able to capture long-range context dependencies from the image and analyzes the input image in a global view, which helps to cluster pixels in the same tissue and reveal differences of different layers, thus achieving more powerful feature maps for segmentation. Experiments have visually illustrated the effectiveness of the self-attention map, and its advantages over other deep networks were also discussed.


Introduction
Pathological analysis using optical coherence tomography (OCT) is receiving increasing attention nowadays due to its high resolution, non-invasive characteristics [1][2][3]. The OCT technique is first proposed by Huang et al. in 1991 for lesion detection in ophthalmology [1]. On the basis of Huang's study, Tearney et al. designed the endoscopic OCT device by combining OCT with a fiber-optic flexible endoscope, which enables the equipment to enter the upper gastrointestinal tract [2]. Using the endoscopic OCT device, we can image the microstructure of esophageal tissues, which is of great significance in diagnosing diseases such as Barrett's esophagus (BE) [4][5][6], eosinophilic esophagitis (EoE) [7], and dysplasia [8]. However, the OCT equipment generates a large number of images that require experts to read and analyze, which is laborious and the diagnosis result relies on the expert's experience and subjective decision. Therefore, an automatic analysis system for esophageal OCT images is of great significance in clinical. Many esophageal diseases are manifested by changes in tissue microstructures, such as changes in the esophageal layer thickness or disruption to the layers. Accurate quantification of the esophageal layered structures from gastrointestinal endoscopic OCT images can be potentially very valuable for objective diagnosis of the diseases and assessment of the disease severity. As a result, the segmentation algorithm in an intelligent OCT image analysis system is the core, which determines whether the system can extract informative characteristics from the image.
Representative researches for automatical esophageal tissue segmentation can be summarized as follows. Ughi et al. proposed a lumen segmentation method by analyzing the A-scan [9], but their study did not include internal tissue segmentation. In 2017, Zhang et al. [10] proposed a multi-layer segmentation method based on the graph theory dynamic program [11,12], which et al. improved the image generation quality by embedded self-attention structure in generative adversarial network [41]. Wang et al. proposed non-local self-attention to capture the long range dependencies in the image and achieved higher video classification accuracy [42]. Fu et al. proposed the dual-attention network to improve network performance in scene segmentation [43].
In this study, we proposed a novel framework, called the tissue self-attention network (TSA-Net) for layer segmentation on esophageal OCT images. The TSA-Net introduces the self-attention mechanism to the segmentation network, which helps capture long-range context dependencies from the image. The entire network employed the U-Net as the backbone, and a specifically designed TSA module is embedded to accomplish tissue attention. The TSA module is composed of two main parts, namely, the position self-attention module and the channel self-attention module. The position self-attention module is designed to reveal feature similarities between any two positions in the image, thereby capturing the spatial dependencies between different pixels. The channel self-attention module behaves similarly, but it captures dependence relationship in channel dimension. By introducing the TSA module, the segmentation network is able to analyze the input image in a global view, which is beneficial for clustering pixels in the same tissue and revealing differences of different layers, thereby achieving higher segmentation accuracy. Our main contributions can be summarized as follows: • We propose a novel TSA-Net with self-attention mechanism to extract more powerful feature maps for tissue segmentation on esophageal OCT images.
• We designed the position and channel self-attention module, whose effectiveness on capturing tissue structures is visually demonstrated.
• Accuracy improvements over several popular deep networks are experimentally observed.
The rest of this study is organized as follows. Section 2 describes the related theory and detailed architecture of the proposed TSA-Net. Section 3 describes the experiment, which shows the visualization result of the attention feature map and comparisons with other deep networks. Discussions and conclusions are given in Sections 4 and 5., respectively.

Problem statement
Given an esophageal OCT image, the task is to assign each pixel to a particular label representing a certain tissue. A typical esophageal OCT image from the mouse is shown in Fig. 1(a). The target tissue layers marked in the images are the epithelium stratum corneum (SC), epithelium (EP), lamina propria and muscularis mucosae (LP & MM) and submucosa (SM), which are labeled from "1" to "4", respectively. The remaining part of the image is labeled by "0" as displayed in Fig. 1(b).

Overview of the TSA network
Typical convolutional networks only focus on local features due to the local receptive field of convolution kernels, which may cause pixels from the same tissue to be misclassified as different ones. In this study, we designed the TSA-Net to capture the global contexture information from OCT images by introducing the self-attention mechanism to the segmentation network. The network is intended to have a better feature representation, which is beneficial for esophageal tissue segmentation.
The overall framework of the proposed TSA-Net is shown in Fig. 2. We use the U-Net as the backbone network since it has achieved great success in the field of medical image segmentation [15]. In Fig. 2, the "ConvBL", "ResBL" and "TSA-BL" represent the convolutional block, residual block and the proposed TSA block, respectively, whose structures were shown in Fig. 3. Notations like "ConvBL 64" indicates the block has an output with 64 channels. The "C" in the architecture indicates the concatenate connection. As shown in Figs. 3(a) and 3(c), the convolutional layers are followed by a batch normalization layer and a PReLU activation layer. The employment of the batch normalization layer is to compensate for the covariate shifts and helps to achieve a successful training. The PReLU activation is chosen because it can introduce non-linearity in the training and prevent gradient vanishment problems. Besides, the PReLU converges faster than ReLU [25]. For the residual block, the residual layers are batch normalized and the addition is followed by the PReLU. A dropout layer with a 0.5 dropout rate is applied at the end of the encoder to prevent overfitting.
The TSA module is added to the layer after the first pooling. The reason is that we want the TSA to retain more details of the original input, which means that the module should be set close to the input layer. However, the TSA module needs large memory when calculating the self-attention map, which limits its input size. As a result, we set the TSA module as Fig. 2, which is a compromise between the size we want and the memories of the computing device. The TSA module is composed of two sub-modules, namely, the position self-attention module and the channel self-attention module as shown in Fig. 3(b). The position self-attention module is intended to capture the global contexture feature of the input in spatial dimension and the channel self-attention module is used to explore the long-range dependence of the input feature map in different channels. The final TSA feature map is obtained by aggregating outputs of these two sub-modules and the input feature map, which generates better feature representations for esophageal tissue segmentation.

Details of the TSA module
The TSA module is designed based on the attention mechanism, which can be mathematically described by Eq. (1), where Q, K and V denotes the query matrix, key matrix and value matrix, respectively [36].
In self-attention theory, Eq. (1) is transformed to Eq. (2), where X is the input matrix, W θ , W φ and W g are weight matrices that can be learned from training. In this study, the position and channel self-attention modules are designed based on Eq. (2).

Position self-attention module
The position self-attention module is constructed as Fig. 4. In this figure, X ∈ R H×W×C denotes the C-channel feature map with size H × W. As indicated by Eq. (2), the input matrix is supposed to be multiplied with different weight matrices. In this case, we utilized three 1 × 1 convolution kernels instead of fully connection to realize a similar transform with less memory request. Then, the output with size H × W × C is transformed to a matrix with size HW × C by reshaping feature maps in each channel colume-wise to a vector with length H × W. The ⊗ in Fig. 4 denotes matrix multiplication and the ⊕ indicates element-wise addition of matrices. The M s in Fig. 4 is regarded as the position self-attention map. This matrix has clear physical significance, which describes spatial relationship between any two pixels of the features. The attention map is then multiplicated with the value matrix V s , the result of which is added with the original features to generate the final representations X s . The X s is intended to capture the long range contexture information from the image. Moreover, the X s has the same size as X, indicating the module is convenient to be embedded in existing frameworks.

Channel self-attention module
The architecture of the channel self-attention module can be found in Fig. 5. The structure of channel self-attention module is similar to the position case. The channel attention map M c is calculated based on the input feature map X. This attention map represents the relationship between any two channels, which is able to capture the long range dependence in channel dimension. The output X c has the same size as the input feature map, indicating this module can be embedded into existing frameworks without additional change of the original structure.

Loss function
The overall loss function of the TSA-Net can be expressed as Eq. (3).
In this equation, L CE denotes the cross entropy loss which is a measurement for classification accuracy as described by Eq. (4), where N is the pixel number, g l (x) is the target probability that pixel x belongs to class l with one for the true label and zero entries for the others. p l (x) is the L Dice represents the dice loss, which is intended to evaluate the spatial overlap between the predicted mask and the ground truth as defined in Eq. (5), where the parameters are defined in the same way as those in Eq. (4). The parameter λ is a self-defined weight to balance the mentioned two terms, which is set at 0.5 in this study.

Details about the training of the TSA-Net
The TSA-Net was trained end-to-end using the Adam [44] optimizer with Nesterov momentum of 0.9. An initial learning rate of 1 × 10 −4 is applied and is decayed by a factor of 10 if the validation loss fails to improve over ten consecutive epochs. Training is performed in batches of 16 randomly chosen samples at each iteration. After going through the entire training set, an epoch is finished, and 100 epochs are needed to accomplish the training. Finally, the model with the lowest validation loss is employed, which is used to measure the segmentation performance of the network based on the test dataset. During the training process, data augmentation is used to improve the network robustness. The data augmentation techniques used in this study include random rotation (range = 10), random shear (range = 0.05), random shift (range = 0.05), random zoom (range = 0.05) and random horizontal flip.

Data
This study was approved by the animal science center of Suzhou Institute of Biomedical Engineering and Technology under protocol number 2020-A02 (from September 1st 2020 to January 31th 2021). OCT images (1840 B-scans) of the esophagus from eight C57BL mice were used to evaluate the proposed segmentation network. These images were collected in vivo from different subjects using an 800 nm ultrahigh resolution (axial resolution ≤ 3 µm) endoscopic OCT system. The probe of the OCT device can enter the upper gastrointestinal tract and rotationally scan the esophagus noninvasively. The image is initially expressed in polar coordinates, and then converted to Cartesian coordinates by the software of the OCT system. During the experiment, 1200 B-scans were collected from six C57BL mice to establish a segmentation network, among which 800 B-scans were randomly selected for training, and the remaining 400 B-scans were used for validation. An independent test set consists of 240 B-scans were collected from two other mice, ensuring that there is no overlap between the data used for training and testing. All the algorithms are evaluated based on the performance on the independent test dataset.
The size of each B-scan in our data set is 256 × 256. For the data used for training and validation, each B-scan is split width-wise into two non-overlapped slices sizing 256 × 128 to reduce the GPU memory needed for training. Since our fully-convolutional network can process images of arbitrary size, images in the test set can be segmented without slicing.
The annotated labels were generated by an experienced grader using ITK-SNAP [45], which were used for network training and algorithm evaluation. During the process, the grader is asked to annotate twice for each image, and the average of these two annotations is used as ground truth. The TSA-Net was implemented in Keras using Tensorflow as the backend. Training of the network was performed on an 11 GB Nvidia GeForce RTX 2080Ti GPU using CUDA 9.2 with cuDNN v7.

Ablation study for attention module
We employed the TSA module in the segmentation network to capture the long-range dependencies for better structure understanding. To verify the performance of the TSA module, we conduct an ablation study on segmentation performance of network with and without the TSA module.
Intuitive visualization of the TSA effects can be found in Fig. 6. As marked by the white circle, segmentation network with TSA module generates more reasonable tissue boundaries, making the segmentation result closer to the ground truth. In the first row, the bottom of the "SM" layer is affected by adjacent tissues when segmenting without TSA module. In the second row, the network without TSA module incorrectly treated artifacts as tissues. In the last row, the TSA module help generate continuous and smooth segmentation in the two bottom layers. These results confirm that the attention modules bring great benefits to OCT image segmentation.

Visualiation of the attention feature maps
The input feature map for TSA module is in size of H × W × C, which is 128 × 128 × 128 in this case. For the position attention module, the size of self-attention map HW × HW. As discussed in Section 2., this position attention map indicates the position weight matrix (sizing H × W) for each point of the original image. In Fig. 7, for each input image, we select three points (marked as #1, #2, #3) and demonstrate their corresponding position weight matrix in columns 2 to 4. These three points represent different parts of the image. In detail, point #1 is selected from the background, point #2 is chosen from the high-reflective tissues and point #3 belongs to diagnose-irrelevant low-reflective tissues. It can be found that the position weight matrix is able to capture clear semantic similarity from the image. For instance, the position weight matrix for point #1 (column 2) highlights the upper background where the point is selected from. While for the other structures such as the high-reflective tissues, the weight value is small, indicating lower relativity. The position self-attention map for point #2 (column 3) presents large value on high-reflective regions, and the upper background has the smallest value, indicating the least relativity. As for the map corresponding to point #3 (column 4), it highlights the low-reflective tissues, and has the lowest value on high-reflective regions since they are regarded as significantly different from the selected point by the TSA module. These results confirm that the position weight matrix is able to capture meaningful position relationship of different pixels from the image. For channel attention module, the self-attention map is in size of C × C, which means for each channel, there is a corresponding weight vector. In this case, it is difficult to provide an intuitive explanation by directly displaying the weight vector. Instead, we show two selected channels from the feature map generated by the channel attention module. In Fig. 8, we present the 20th and 29th channel in columns 2 and 3. For the four target tissues, both the two channels can distinguish the low-reflective tissue and the high-reflective one. The difference is that the 20th channel highlights the plastic sheath of the image while the 29th channel behaves contrarily. The other channels behaved in a similar pattern as either of these two cases. The channel attention map is not as intuitive as the position attention map, but it still includes specific semantics, which is helpful for tissue segmentation.

Evaluation metrics
The following metrics are employed to evaluate different deep networks, including the precision, recall, Dice coefficient, the Hausdorff distance (HD) and the average distance (AVD) [46,47]. The precision, recall and Dice coefficient are used to evaluate the overlap areas between the predicted mask and the ground truth, while the HD and AVD measure the tissue boundary accuracy of the segmentation. Detailed definition of the first three metrics can be found in Eqs. (6) to (8), where S R and S G represent the binary segmentation and ground truth areas, respectively. Definition of HD is given by Eq. (9), The AVD is less sensitive to outliers than the HD, which is defined as Eq. (9), where d a (A, B) is called the directed Average Hausdorff distance given by

Comparing results
The following state-of-the-art deep networks were used as comparisons with the proposed TSA-Net, including the Segnet [48], PSPNet [16], U-Net [15,49] and Pix2Pix [50]. The number of parameters and the training time for each network are listed in Table 1. It can be found that all networks except Pix2Pix can complete the training process in 1.5 hours. The Pix2Pix takes much longer to train since it is composed of two deep networks called generator and discriminator, which lead to more trainable parameters. In addition, the adversarial training strategy also makes the computing device spend more time on the training process. The proposed TSA-Net introduces a new self-attention architecture to the U-Net, which brings about 400,000 more parameters. Compared with the total number of parameters, the increase is not obvious, which only slows the training process by about 10 minutes than the U-Net. Benefiting from the parallel computing capabilities of GPU, all these trained deep networks segment the new input image fast, which means they can process the test set with 240 B-scans in 2 seconds. The overall segmentation accuracies for the test set are listed in Table 2. In this table, the "U-Net+CSA" denotes U-Net with the proposed channel self-attention module, "U-Net+PSA" indicates U-Net with the position self-attention module. It can be found these additional attention module improves the overall segmentation accuracy, and the TSA-Net achieves the highest accuracy in the test set. However, the advantage of these attention modules in segmentation accuracy is not obvious. One reason for this is that accuracy cannot measure the topology performance of the result and the tissue area is small compared to the whole image. As a result, changes in tissue labeling will not cause an obvious difference in accuracy.  Tables 3 to 6 with the best performance bolded. It can be found that Segnet and PSPNet have Dice coefficients similar to other methods. However, HD and AVD are significantly larger than the others, indicating the segmentation result of these two networks may generate a label mask out of the target tissue region, leading to more topological errors. The Pix2Pix performs better than Segnet and PSPNet in HD and AVD, but the overall accuracy and Dice coefficient are lower than those two networks. One possible reason is that Pix2Pix is more focused on generating "real" label masks with smaller topological errors, rather than accurately classifying pixels. The U-Net used in this study has the same structure as the proposed TSA-Net without the TSA module. Compared with the three deep networks mentioned, it has achieved more satisfactory segmentation results, indicating its advantages in segmenting esophageal OCT images. The "U-Net+CSA" and "U-Net+PSA" achieved better segmentation results than the U-Net, indicating the effectiveness of the proposed attention module. Besides, "U-Net+PSA" achieved smaller HD and AVD than the "U-Net+CSA" in most cases, indicating the position self-attention module is more effective in capturing structure information of the input, which is consistent with the feature map performance of these two modules in Figs. 7 and 8. The combination of position and channel self-attention modules generates the TSA module, which adaptively captures the tissue structure in the OCT image, and achieves the best performance in almost all cases. The higher Dice coefficient of TSA-Net indicates that the proposed network labels the target tissue more accurately. In addition, the smaller HD and AVD values confirm that the segmentation result of TSA-Net generates continuous tissues with less topological errors.

Discussions
Automatical segmentation of clinical-relevant esophageal tissues is always affected by speckle noise, disturbance structures and unfavorable image qualities. An effective solution to this problem is extracting more powerful feature maps for the deep networks. In this study, we proposed the TSA-Net, which introduces the self-attention mechanism to capture long-range dependencies of the pixels from the image. The core of the TSA-Net is the TSA module, which consists of a position self-attention module and a channel self-attention module. As observed from the experiments, the feature map generated by the position self-attention module is able to describe relationships between any two pixels of the image, while the channel self-attention feature map captures long-range dependencies of features in the channel dimension. In this case, pixels from different structures in the image will be clustered in a global view before segmentation, thus generating more discriminative feature maps for tissue identification. Comparisons with other popular segmentation networks confirmed the advantages of the proposed TSA-Net, which achieved the best performances on most quantitative indicators. In this study, the loss function is composed of cross-entropy and dice coefficient as shown in Eq. (3). To confirm the effectiveness of these two items, we carried on an ablation study that used only the cross-entropy loss and dice loss to train the TSA-Net. Results showed that the average accuracies of segmenting the test set using cross-entropy and dice coefficient are 0.9192 and 0.8951, respectively, which is much lower than the combined result 0.9681 (Table 2). One reason for this is that the esophagus tissue is located in a small area compared to the entire image, which makes it difficult for the algorithm to find the correct optimization direction only based on cross entropy or dice coefficients. The weight in Eq. (3) is set to 0.5, which is chosen based on experiments. The model generates similar performance when the weight is set at the range from 0.4 to 0.6.
In the TSA-Net architecture shown in Fig. 2 the dropout layer with rate 0.5 is added between encoder and decoder. The reason is that given the tens of millions of parameters, the network is prone to be over-trained. Adding the dropout layer in the latent space is most effective since the latent representation significantly affects the network output, which means a single dropout layer in that location is sufficient to alleviate the overfitting problem. To verify the effectiveness of the structure, we conducted an ablation study that removed the dropout layer. Result shows that the segmentation accuracy on the test dataset is reduced from 0.9681 to 0.9562.
The TSA module is designed to generate an output of the same size as the input, which makes it convenient to be embedded into existing frameworks. In this study, we use the U-Net as the backbone since it has achieved great success in medical image segmentation [15,25] and its performance in this task is also superior to other testing deep networks as presented in the experiments. The proposed TSA-Net has the potential to be further improved when more powerful segmentation networks are designed.
A limitation of the TSA module is that it requires a large memory to compute the self-attention feature maps. In this case, a 256 × 256 input generates a 65536 × 65536 position self-attention map, which is a heavy burden for the computing device. In order to alleviate this problem, this study designed the network in a fully convolution pattern. In this case, we can train the network using image slices instead of the original image, and the trained network can be directly used for the image with the original size.
In this study, due to GPU memory limitations, we only added a TSA module after the first pooling layer of U-Net. In order to verify the performance of embedding multiple TSA modules at different stages of the network, we added two TSA modules after the first two pool layers of U-Net, and trained the new network using 16G NVIDIA Tesla P100 GPU. The results show that the segmentation accuracy is improved from 0.9681 to 0.9684. When we try to add three TSA modules to U-Net, the network cannot be trained because the GPU has insufficient memory. It can be found that adding more TSA modules to the network has the potential to improve segmentation performance. However, considering the large memory requirements of computing devices, this may not be worth all the effort.
Evaluation of the TSA-Net work is accomplished using esophageal OCT images from mice. The dataset is able to intuitively illustrated the effectiveness of the self-attention mechanism and confirm the advantages of the proposed TSA-Net. To move the proposed network from laboratory to clinic, esophageal OCT images from human with various health conditions will be collected.

Conclusions
In this study, we proposed a TSA-Net for esophageal tissue segmentation on OCT images. The TSA-Net introduces self-attention mechanism to capture long-range feature dependencies in a global view. The core TSA module is composed of a position self-attention module and a channel self-attention module. The position self-attetnion module is designed to describe relations between any two pixels from the image, while the channel self-attention module is supposed to reveal long-range dependencies among different channels. In this case, the TSA-Net is able to generate an attention feature map that clusters pixels from the same tissue and presents discriminability for pixels from different structures. The experiment visually demonstrated the effectiveness of the self-attention map and confirmed the advantage of TSA-Net over other popular deep networks in segmenting esophageal OCT images. The TSA-Net is convenient for further improvements, such as using a more powerful backbone other than U-Net. Besides, it is also easy to be applied to esophageal OCT images from humans since it is an end-to-end fully convolutional network. These characteristics make TSA-Net clinically attractive. Disclosures. The authors declare that there are no conflicts of interest related to this article.