U-Net Plus: Deep Semantic Segmentation for Esophagus and Esophageal Cancer in Computed Tomography Images

The effective segmentation and 3-D rendering of the esophagus and esophageal cancer from the computed tomography (CT) images can assist doctors in diagnosing esophageal cancer. Irregular and vague boundary causes great difficulty in the segmentation of esophagus and esophageal cancer. In this paper, U-Net Plus is proposed to segment esophagus and esophageal cancer from a 2-D CT slice. In the new network architecture, two blocks are employed to enhance the feature extraction performance of complex abstract information, which can effectively resolve irregular and vague boundaries. A block is a skip connection operation that is similar to convolution. The architecture is trained through a dataset of 1924 slices from 10 CT scans and tested through 295 slices from 6 CT scans. The training and test datasets are expanded tenfold to simulate the segmentation of the 3-D CT image. Using the new framework, we report a 0.79 ± 0.20 dice value and 5.87 ± 9.91 Hausdorff distance. A semi-automatic scheme is then designed for the 3-D segmentation of the esophagus or esophageal cancer. The 3-D rendering of the esophagus or esophageal cancer is implemented to assist in the diagnosis of esophageal cancer.


I. INTRODUCTION
Eastern Asia and Eastern and Southern Africa show a high rate of esophageal cancer [1]. Esophageal cancer ranks seventh in terms of incidence and sixth in overall mortality [2]. Computed tomography (CT) is a very important diagnostic technique for esophageal cancer because CT images of chest, abdomen, and pelvis can be used to evaluate tumor metastasis to adjacent tissues or distant organs, such as liver and lymph nodes; this information is important for the diagnosis and prognosis of esophageal cancer [3], [4]. Accurately extracting the region of esophagus and esophageal cancer from CT images shows high clinical application value.
The automatic segmentation of esophagus and esophageal cancer is very difficult. The wall of the esophagus from the The associate editor coordinating the review of this manuscript and approving it for publication was Hengyong Yu. lumen outward consists of mucosa, connective tissue, layers of muscle fibers between layers of fibrous tissue, and an outer layer of connective tissue. As shown in Fig. 1, the boundary of the esophagus or esophageal cancer is irregular and vague, and air holes occasionally exist in the esophagus. Even experienced medical experts cannot accurately delineate the boundary of esophagus or esophageal cancer through singleslice CT images.
Automating the contour extraction for esophagus has been attempted. Rousson et al. [5] described a method that combines a spatial prior of the esophagus centerline with a histogram-based appearance model. The method is semiautomatic and requires two manually placed points on the centerline and a segmentation of the left atrium and the aorta as input. Markov chain model, followed by a ''detect and connect'' approach, has been used to obtain the maximum a posteriori estimate of the approximate esophagus shape [6]. Yang et al. [7] developed an online atlas selection approach to select a subset of optimal atlases for multi-atlas segmentation to automatically delineate the esophagus. The presence of air bubbles is an obstacle in the atlas selection method. The skeleton-shaped model has been used to guide 3-D esophagus segmentation. This method is sensitive to noise in the image, resulting in over-segmentation for the target object [8].
Deep learning has been recently used for the segmentation of normal esophagus or esophageal cancer. A convolutional neural network is trained to generate an esophagus probability map, and the active contour model and random walk are used to estimate the location of the esophagus [9]. Hao et al. [10] utilized fully convolutional network (FCN) to establish an esophagus-tumor classifier on the training dataset with expert-labeled tumor regions. Trullo et al. [11] changed the FCN architecture to improve the localization accuracy and performance for esophagus segmentation.
FCN is an end-to-end network that can effectively solve the problem of image semantics segmentation. The deep architecture achieves improved performance. However, the training error rate in a deep plain network is high because the gradient disappears easily in a deeper architecture [12]. LinkNet, SegNet, and U-Net have been developed on the basis of FCN. SegNet provides good performance with competitive inference time and provides the most efficient memory-wise inference [13]. SegNet is used in medical image segmentation, such as in the segmentation of lung field [14], cross-sectional brain MRI [15], and lumbar spinal stenosis [16]. LinkNet utilizes an information theoretical approach based on minimum description length to automatically adjust the number of regions and the temporal relationships among them [17]. U-Net has been developed on the basis of FCN and considers skip connection between encoder and decoder. This process can effectively combine the features from shallow and deep layers through multipath confusion, which solves the spatial loss of feature map and improves the accuracy of semantic segmentation. U-Net has been applied in semantic medical image segmentation [18]- [23]. However, it has failed to produce satisfactory results because of the vague and irregular border of esophagus and esophageal cancer. To address the inefficiency of feature extraction of skip connection layer in U-Net network, we design U-Net Plus as the repeated process of U-Net feature extraction in the present study.
In summary, the main contributions of this study are as follows.
First, the network architecture of two encoder-decoder blocks with skip connection is adopted to improve the capability of complex abstract information processing. This characteristic helps improve the segmentation capability of irregular and vague boundary of esophageal or esophageal cancer.
Second, a semi-automatic scheme is designed to segment the esophagus or esophageal cancer from 3-D CT images. In this scheme, only few image parameters must be provided.
Third, 3-D rendering esophagus or esophageal cancer will help doctors diagnose esophageal cancer.

II. PROPOSED METHOD
In this section, the proposed method is described in detail. As shown in Fig. 2, the new method involves two stages. The first stage is training. The images in the training set are the input of the network after preprocessing. Data augmentation is implemented to enhance robustness of the network.  The network is trained several times to obtain the best parameters. In the testing stage, the preprocessed test images are used to test the performance of the network.

A. IMAGE PREPROCESSING
The window center and window width in digital imaging and communications in medicine (DICOM) information are used to adjust the CT image with DICOM format to bitmap format. As shown in Fig. 3, a rectangular area of 80×80 pixels, which includes esophagus or esophageal cancer, is selected as input to the network for each CT slice. The delineated region of esophagus or esophageal cancer is considered as the desired output of the network.

B. ARCHITECTURE OF U-NET PLUS
The network architecture of U-Net Plus is illustrated in Fig. 4. The encoder-decoder architecture is the main structure of many semantic segmentation networks and performs well in pixel-level classification. U-Net Plus uses two blocks of encoder-decoder architecture to enhance the capability of feature extraction.
U-Net Plus adopts the FCN mode and is an end-to-end network architecture. The convolutional operation is applied to extract the features. All convolutions involve one step in all directions. All convolutions in all layers, except for the final layer, comprise 64 channels, and the rectified linear unit is connected [24]. In the last layer, the convolutional size is 1×1, whereas the channel involves two, which mainly maps the component features to the required number of classes. The 1×1 convolution can be used to improve the robustness of the network and is used in deep networks [25], [26].
Down-sampling is carried out by the max pool operation. The size of the pooling filter is 2×2. The plane and channel involve two and one steps, respectively. After downsampling, the receptive field of the convolution is increased, which contributes to obtaining the global features of the image and expressing abstract information.
Up-sampling is carried out by deconvolution. The size of deconvolution is 3×3, and the plane and channel involve two and one steps, respectively. The deconvolution aims to match the size of the feature map after down-sampling with the size of the current feature map. Up-sampling conveys information from a large field to a small field of view, allowing information in different fields to communicate with one another.
The skip connection links the corresponding downsampling and up-sampling feature maps. This process can solve the problem of spatial information loss caused by down-sampling and improves focus on convolution feature extraction.
A block in the network architecture includes skip connection and other operations, such as convolution, pool operation, and deconvolution. Although numerous feature extraction operations are performed in block, feature extraction after skip connection must be based on previous input information obtained after skip connection. Think of skip connection and operations between it as skip connection operation. Skip connection operation extracts the features of the input information, and the generated output information must be connected to the input information before it can be used as the input for the next operation.
The skip connection combines the feature information of small field of view and large field of view, which can substantially use the global features. However, the feature maps of the small field of view are from very early features. Even after many convolutions, the features of the large field of view still have strong correspondence with the features of the  small field of view. This process is similar to a convolution operation. The convolution operation only extracts features from its input information. A block is a skip connection operation which is similar to convolution, but its feature extraction capability is much stronger than convolution. Two blocks are used to enhance the feature extraction performance of complex abstract information.

C. TRAINING AND TESTING
The preprocessed esophageal cancer image is used as the direct input information of the network, and the corresponding label image is used as the supervision of the network. The label image is a binary image. The targets are the esophageal and esophageal cancer area, which are the foreground in the label image and the background in other parts.
The energy function SoftMax is used to obtain the probability distribution map; and the cross-entropy loss function [27] and Adam optimizer [28] are used to train the network.
The size of the input image and the probability map are the same, and their positions on the pixels correspond one by one because the network fills the boundary when convoluting.
The model fits rapidly on some data but slowly on others. If all data are used for each iteration in training, the model may show over-fitting on some features. To reduce overfitting, we raise the threshold of the loss value and perform training only when the loss value of a batch of data is greater than the threshold value. The initial threshold is set to 0.04. The weight parameters of the trained model are saved for testing.  from patients diagnosed with esophageal carcinoma are used for experimental esophagus segmentation. The imaging parameters are as follows: field of view = 376 mm × 376 mm and matrix = 512 × 512, slice thickness = 1 mm, and bits stored =12. Two doctors with extensive experience in imaging diagnosis of esophageal cancer delineate the location of esophageal cancer and esophagus in all CT images as the region of interest (ROI). A total of 2,219 slices are labeled manually.

III. EXPERIMENTS
Among the 16 CT scans, 10 are selected as training set, and 6 are selected as test set. The esophagus or esophageal cancer areas in the CT images in training set CT scans are delineated. A total of 295 slices in test set are also delineated.
The data are augmented to improve the robustness of the segmentation algorithm. The region of esophagus or esophageal cancer does not always occur at the center of the selected rectangle but may be located anywhere in the rectangle. Therefore, 10 samples are used for one CT slice to improve the adaptability of the network. For one CT slice, all ROIs of 10 samples are located inside of the rectangle, and the centers of the ROI do not overlap with the center of the rectangle but randomly locate at different positions in the rectangle. Data augmentation is implemented on the training and testing data. Thus, the training and the test samples are expanded tenfold, that is, 19,240 samples in the training set and 2,950 samples in the test set.

B. EVALUATION METRIC
The following two pixel-level measurements are employed to compare the segmentation performance of the proposed approach with those of other methods: dice value (DV) and Hausdorff distance (HD) [29].
DV is defined as: where A T represents the target area, which is delineated manually by the doctors. A S represents the segmented automatically area of esophagus or esophageal cancer, which is the result of the model testing. A TS represents the overlap between A T and A S . DV measures the overlapping degree between target area and segmented area, and it always ranges between [0, 1]. The larger the DV, the higher is the consistency between manual and automated segmentation.
Another evaluation metric is HD, which is expressed as:   the target contour (T ). HD measures how far two regions of a metric space are from each other. HD is the maximum of the two values and is always at [0, ∞]. As HD increases, the performance degrades.

C. EXPERIMENTAL PROCESS
Numerous experiments are carried out to demonstrate the effectiveness and feasibility of U-Net Plus from three aspects. First, the performance of U-Net Plus is evaluated on our augmented dataset of 2,950 CT 2-D images for esophagus and esophageal cancer. The result of U-Net Plus is compared with that of U-Net, LinkNet, and SegNet.
Second, the performance of the proposed method is evaluated by comparing the results of segmentation for esophagus or esophageal cancer from existing literature.
Finally, a 3-D rendering experiment for esophagus and esophageal cancer is performed, and a semi-automatic segmentation scheme is designed.

IV. RESULTS AND ANALYSIS A. SEGMENTATION RESULTS
Figs. 5 and 6 illustrate the segmentation results of six representative CT images of esophagus and six representative CT images of esophageal cancer for different methods, respectively. The delineated region and the segmented region are presented together to intuitively observe the segmentation effect. The red area represents the overlapped region between the delineated and segmented regions. The blue area represents the delineated region that is not located in the segmented region. The green area represents the region in  the segmented region that is not located in the delineated region.
As shown in Figs. 5 and 6, the best and worst segmentation effect is achieved by the proposed method and SegNet, respectively. U-Net Plus can achieve satisfactory segmentation results in the presence or absence of holes. The region of esophagus cancer with two holes can also be segmented by the proposed method but not by the other methods. The U-Net Plus results show the largest intersection of segmented and manual region and the highest shape consistency. The visualization results clearly illustrate that the proposed method can extract the region of esophagus or esophageal cancer from CT images. Table 1 shows the average segmentation results for 2,950 test samples. U-Net Plus achieves the best segmentation performance with the highest DV (0.79) and lowest HD (5.87 mm), followed by U-NET and LinkNet. SegNet obtains the lowest DV (0.61) and highest HD (14.45 mm). The segmentation result of U-Net Plus shows the highest consistency on the delineated region and the nearest boundary of result region to that of the delineated region.
The effectiveness and robustness of the proposed network are discussed from two aspects. First, the rectangular areas of esophagus or esophageal cancer in CT image are cut out as input features of the network, greatly reducing the effect of other tissues and organs of the human chest on target region segmentation. Second, the benefit from the optimized network architecture is discussed. In the new network architecture, we implement two U-Net feature extraction processes that improve the efficiency and capability of feature extraction. Consequently, the proposed network realizes accurate and robust segmentation for esophagus or esophageal cancer in thoracic CT images.

B. PERFORMANCE COMPARISON WITH EXISTING METHODS
The methods and effects of segmentation for esophagus or esophageal cancer in CT images have been previously reported. The performances of previously reported segmentation methods are summarized in Table 2. The DV metric is employed to evaluate the performance of segmentation in all of these methods. Three groups also evaluated segmentation performance using HD. The proposed method achieves the best performance, highest DV, and lowest HD.

V. SEMI-AUTOMATIC 3-D SEGMENTATION
In the previous section, we proposed a method to segment esophageal or esophageal cancer in 2-D CT images. However, the structure of esophagus or esophageal cancer is 3-D. The diagnosis of esophageal cancer must be supported by segmenting the esophagus or esophageal cancer from 3-D CT images. As shown in Fig. 7, a semi-automatic segmentation scheme is designed for 3-D esophagus or esophageal cancer in CT images. In the semi-automatic segmentation method, the following five parameters must be manually provided before image segmentation: begin and end slices of esophagus, begin and end slices of esophageal cancer, and start point for image segmentation. The parameter of start point helps in image segmentation, and the other parameters assist in the 3-D rendering of esophagus and esophageal cancer.
This method assumes that in thoracic CT acquisition, slice thickness is sufficiently small such that the center of the upper esophagus is in the next esophageal region. This condition is met in most cases. Therefore, the feature image of each slice for segmentation can be obtained through the center of esophagus in the upper slice. As shown in Fig. 7, C_ROI_1 is the center of segmented ROI in a slice, and the rectangular region can be defined by C_ROI_1. The main steps to segment the 3-D esophagus and esophageal cancer are as follows: Step 1: A rectangle region of 80×80 pixels centered on the previously given start point is extracted.
Step 2: The cut CT image in the extracted rectangle region is used as the feature input of the trained U-Net Plus network. The output of U-Net Plus is the segmented region of esophagus or esophageal cancer in the given slice.
Step 3: If the slice is the end slice of esophagus, then the process of image segmentation is finished. Otherwise, the center of the segmented region is extracted and mapped to the next slice.
Step 4: In the new slice, another rectangular region located on the center of the segmented region of the previous slice is redefined.
Step 2 is then performed.
For a CT slice, the region and center of segmented esophagus or esophageal cancer are saved for 3-D rendering.

VI. 3-D RENDERING
2-D segmented results are stacked before performing rendering operation. All slices between begin slice and end one are used for stacking. An empty space is first created for the 3-D model. Each pixel on 2-D segmented result image is then transferred to the empty space. The pixel's X and Y coordinates are adjusted to the center of segmented esophagus or esophageal cancer. The slice number with respect to the distance between each slice is considered as the Z coordinate. The stacking process is repeated until all slices are completed. All voxels in the 3-D space are connected as in the 2-D slices. Fig. 8 shows the 3-D rendering for esophagus or esophageal cancer. Begin and end slices of esophagus and esophageal cancer are defined by the doctor. As shown in Fig. 8A, normal esophagus is shown in blue and esophageal cancer in red. The 3-D images can be rotated arbitrarily. In addition, the esophagus or esophageal cancer can be cut and displayed according to the center information of the segmentation region (Fig. 8B). Doctors can clearly observe the internal condition of esophageal or esophageal cancer and receive further assistance in esophageal cancer diagnosis.

VII. CONCLUSION
The effective segmentation and 3-D rendering of esophagus and esophageal cancer can assist doctors in diagnosing esophageal cancer. The irregular and vague boundary causes difficulty in the effective segmentation of esophagus and esophageal cancer. In this study, a U-Net Plus network, which is characterized by high feature extraction capability, is proposed for the segmentation of esophagus or esophageal cancer. The result of the proposed method is better than that of other method for semantic segmentation based on deep learning and that of the existing literature by the evaluation of DV and HD. A semi-automatic scheme is designed for the 3-D segmentation of esophagus or esophageal cancer. The 3-D rendering of esophagus or esophageal cancer is implemented to assist in the diagnosis of esophageal cancer.
Future works must focus on three aspects. First, the architecture of U-Net Plus must be modified to improve the capability of feature extraction for objects with irregular and vague boundaries. Second, the training set and testing set must be expanded to further improve the performance of the network for esophagus or esophageal cancer segmentation. Finally, a method for automatic recognition of esophageal and esophageal cancer to reduce manual intervention and improve the automation of assistant diagnosis system for esophageal cancer must be developed.