Comparative analysis of automatic segmentation of esophageal cancer using 3D Res-UNet on conventional and 40-keV virtual mono-energetic CT Images: a retrospective study

Objectives To assess the performance of 3D Res-UNet for fully automated segmentation of esophageal cancer (EC) and compare the segmentation accuracy between conventional images (CI) and 40-keV virtual mono-energetic images (VMI40 kev). Methods Patients underwent spectral CT scanning and diagnosed of EC by operation or gastroscope biopsy in our hospital from 2019 to 2020 were analyzed retrospectively. All artery spectral base images were transferred to the dedicated workstation to generate VMI40 kev and CI. The segmentation model of EC was constructed by 3D Res-UNet neural network in VMI40 kev and CI, respectively. After optimization training, the Dice similarity coefficient (DSC), overlap (IOU), average symmetrical surface distance (ASSD) and 95% Hausdorff distance (HD_95) of EC at pixel level were tested and calculated in the test set. The paired rank sum test was used to compare the results of VMI40 kev and CI. Results A total of 160 patients were included in the analysis and randomly divided into the training dataset (104 patients), validation dataset (26 patients) and test dataset (30 patients). VMI40 kevas input data in the training dataset resulted in higher model performance in the test dataset in comparison with using CI as input data (DSC:0.875 vs 0.859, IOU: 0.777 vs 0.755, ASSD:0.911 vs 0.981, HD_95: 4.41 vs 6.23, all p-value <0.05). Conclusion Fully automated segmentation of EC with 3D Res-UNet has high accuracy and clinically feasibility for both CI and VMI40 kev. Compared with CI, VMI40 kev indicated slightly higher accuracy in this test dataset.


INTRODUCTION
Esophageal cancer (EC) is the seventh most common cancer and sixth leading cause of cancer deaths worldwide in 2020, accounting for one in 18 cancer deaths (Sung et al., 2021). At present, surgical treatment remains the main choice for EC, unfortunately EC is frequently found at an advanced stage when surgery alone cannot achieve cure. Despite recent medical advances that have enabled a remarkable increase in overall cancer survival, the prognosis of EC remains poor, with a 5-year survival rate of 20% (Yan et al., 2019;Pennathur et al., 2013). Although concurrent chemoradiotherapy has been established as a standard treatment option for unresectable EC, radiotherapy is also one of the most important treatment strategies for EC (Garg et al., 2016;Babic, Fuchs & Bruns, 2020). Thorax CT images are usually used for EC radiotherapy planning. The critical step in the planning pipeline involves the extraction of the esophageal gross tumor volume (GTV) from CT data by manual segmentation. However, manual or even semi-automatic algorithms are usually operator-dependent, time-consuming and subject to high inter and intra-observer variability in clinical practice. Thus, developing an automatic and reliable esophageal GTV segmentation approach is desirable. However, automatic esophageal GTV segmentation of CT scans has been rarely implemented, and is known to be challenging due to various shapes, the poor contrast of the tumor with surrounding tissues, and the existence of foreign bodies in the esophageal lumen.
Currently, the main methods for automatic segmentation of esophageal cancer on CT images include the following: (1) Atlas-based method: by registering the normal atlas to the image to be segmented, the corresponding labels in the atlas are used to segment the lesion. This method is simple and easy to implement, but there are large registration errors and poor segmentation results (Yang et al., 2017;Wang et al., 2015).
(2) Feature-based method: use features such as morphology, gray level, and texture in the image to construct a classifier or cluster for lesion segmentation. Commonly used features include morphological features, texture features, and gray level histogram features. This method is easily affected by image quality and feature selection, and the final effect is difficult to guarantee (Brady et al., 2021). (3) Deep learning-based method: use convolutional neural networks for end-to-end learning and segmentation. This method can automatically learn image features for segmentation, and the effect is good, but it requires a large amount of training data. In recent years, with the development of deep learning technology, deep learning algorithms have been widely used in the field of medical image segmentation. U-Net has gradually become a hot spot in the field of image segmentation due to its good segmentation performance (Ronneberger, Fischer & Brox, 2015). Several studies have developed specific deep learning-based models relying on a U-Net convolutional architecture for automatic segmentation of the lung, heart and liver, yielding promising results, with a reported Dice Similarity Coefficient (DSC) between 0.92 and 0.95 (Zhu et al., 2019;Ammar, Bouattane & Youssfi, 2021;Jin et al., 2020b). Although there have been some studies on CT-based automatic segmentation of esophageal tumors, automatic segmentation of esophageal tumors in CT images is still a challenging task due to the low contrast between the tumor and surrounding tissues. This field is still in the exploratory stage and requires further research to improve segmentation accuracy and better apply it in clinical practice (Yousefi et al., 2021;Jin et al., 2022;Yue et al., 2022).
Dual-layer spectral-detector computed tomography (SDCT) can acquire temporal and spatial matched high-and low-energy photons from the same X-ray source, which also enables material decomposition in the projection domain. The anti-correlated noise suppression and iterative reconstruction algorithms are also applied, resulting in image noise reduction and image quality improvement (Ozguner et al., 2018;Lu et al., 2019;Kim et al., 2019). Virtual monoenergetic imaging (VMI) is a technique used in computed tomography (CT) imaging to generate images at a specific energy level, also known as a monoenergetic image. VMI images are derived from a spectral CT acquisition that captures the attenuation of X-ray photons at different energy levels. VMI are equivalent to single-energy radiographs, including 161 energy levels ranging from 40 to 200 keV. Certainly, the VMI 40 keV has a good soft tissue resolution and maintains low noise, which makes it suitable for routine lesion detection and observation of subtle differences in soft tissue. Bruns et al. (2020) showed that virtual mono-energetic images improve the accuracy of a fully convolutional network used for myocardial segmentation on cardiac CT angiography (CCTA) images, with the DSC increased from 0.846 to 0.901 compared with only conventional CCTA images. Dima et al. (2021) trained a U-Net based model to segment peripancreatic arteries on iodine material decomposition images, and the final DSC exceeded 0.95. These studies implied that spectral images obtained using SDCT can provide additional information to optimize the algorithms. To our knowledge, it has not been investigated whether the automatic segmentation model performance can be improved using VMI 40 keV compared to CI.
The main purpose of this study was to evaluate the performance of the 3D Res-UNet for fully automated segmentation of EC on SDCT scans acquired for radiotherapy treatment planning, and compare performance between CI and VMI 40 keV data.

MATERIALS AND METHODS
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional. Informed consent was waived by Institutional Review Board due to retrospective study characteristics.

Patient enrollment
This retrospective study was approved by the institutional review board of Zhongshan Hospital affiliated to Xiamen University, and informed consent was waived. The Ethical Approval number was XMZSYY-AF-SC-12-03. 259 Patients with histologically confirmed EC by biopsy or surgery, who underwent enhanced CT with SDCT from May 2019 to December 2020 were selected for this study. Exclusion criteria were: (a) chemotherapy or radiotherapy or other anticancer treatments before the IQon CT scans; (b) a history of other malignancies; and (c) incomplete medical records. After excluding 99 cases, 160 patients were finally enrolled in this study and randomized divided into the training dataset (104 patients), validation dataset (26 patients) and test dataset (30 patients). Figure 1 depicts the process used to select the images.

CT image acquisition
Chest CT was performed in both the arterial and venous phases on a SDCT (IQon Spectral CT, Philips Healthcare, Best, The Netherlands). All the patients were in the supine position, with both arms raised and breathing calmly; while holding the breath, scanning range was from the epiglottis to the costophrenic angle to ensure coverage of all esophageal and lung tissues. Before initiating contrast-enhanced scanning, 70 ml contrast medium (Ultravist injection at a concentration of 300 mg/ml; Bayer, Leverkusen, Germany) was injected through the elbow vein at a rate of 2.5 ml/s using a power injector. After the injection, 20-30 ml of normal saline was injected for flushing at a rate of 3 ml/s. Arterial and venous phase images were acquired at 25s and 60s. The following scanning parameters were used: detector collimation, 64 × 0.625 mm; tube voltage, 120 kV; tube current, 180-250 mA; rotation speed, 0.33s/rot; helical pitch, 0.671; matrix, 512 × 512.

Image reconstruction
Spectral base image (SBI) datasets were reconstructed, with a spectral level of three, a slice thickness of one mm, a slice increment of one mm, standard kernel (B) with a window width of 400 HU & a level of 40 HU, and sharp kernel (YB) with a window width of −1400 HU & a level of −600 HU. All arterial phase images were transferred to the dedicated workstation (IntelliSpace V9, Philips Healthcare) for further analysis. Both conventional images (CI) and virtual mono-energetic images at 40 keV (VMI 40 keV ) were derived from SBI.

Esophageal cancer annotation
An open-source software (3D slicer, version 4.11, https://www.slicer.org/) was used to delineate EC contours manually. All the tumor contours were drawn on each slice of VMI 40 kev by a radiologist with 8 years of experience and peer-reviewed by a senior radiologist with 15 years of experience. The adjacent airway, lymph nodes, heart, vascular structures and thoracic vertebrae were avoided during contour drawing. Dice similarity coefficient (DSC) was performed to assess consistency between the two observers for EC segmentation.

Image pre-processing
Image pre-processing was performed as follows.
(1) All voxel sizes were reset to 1 mm × 1 mm × 1 mm by linear interpolation, and matrices were defined as n ×150×180 (slices ×columns ×rows) by Simple ITK (version 2.1, https://simpleitk.org/) just for the inclusion of the EC and a small amount of adjacent tissues, to reduce the calculation burden and graphic card's memory storage.
(2) The ScaleIntensityRanged function in MONAI (version 0.6, https://monai.io/) was used to rescale the density values of our images and all density values in our images which ranged from −50 HU (minimum) to 400 HU (maximum), were normalized and artificially scaled to a range of 0 to 1 using this function. (3) Before sending the images into a convolutional neural network, all images in the training dataset were randomly cropped to size of 96 × 96 × 96 mm 3 and data augmentation was achieved in the training dataset by image flipping and rotation using the RandCropByPosNegLabeld and RandAffined functions in MONAI.

Automatic segmentation model and training
In this study, 3D Res-UNet, the classical U-net scheme combined with a residual connection unit (RCU), was used for the automatic segmentation task. Five residual units were added to the automatic segmentation model in this study, The model uses a kernel size of (3, 3, 3) and varying number of channels (16,32,64,128,256) to build the layers of the encoder and decoder. It uses a stride of two to downsample the spatial dimensions of the encoder and residual units with batch normalization to improve the model's robustness. The architecture of the automatic segmentation model is shown in Fig. 2 The Adam optimizer was used to train the automatic segmentation model with the Dice loss function as the cost function. The batch size was set to eight, the learning rate was set to 0.0001. The model was trained for 600 epochs. The curves depicting the training process of the networks are presented in both Fig. 3. The model was built using PyTorch (version 1.8, https://pytorch.org/) and MONAI (version 0.6, https://monai.io/) and trained on a Linux workstation (Ubuntu version 20.04) with one NVIDIA GeForce GTX3090 GPU with 24 GB memory (NVIDIA, Santa Clara, VA, USA).

Evaluation of segmentation performance in the test dataset
After training the model, we examined its performance in predicting the ROIs in the test dataset. Segmentation accuracy was evaluated using the DSC as follows. (1) A statistical measure of spatial overlap also defined as DSC was derived as 2TP/ (FP+2TP+FN

Statistical analysis
Statistical analyses were performed with R (version 4.0; R Core Team, 2020). Patients' age in the training plus validation datasets versus test dataset were compared with two-tailed Independent t -test. The sex, histopathological features, differentiation grade and T stage of EC were compared by the two-sided Fisher's exact test. The independent sample rank-sum test was used for comparing tumor size. DSC, IOU, ASSD and HD_95 in the test dataset with VMI 40 keV and CI as input data were compared by the paired Wilcoxon signed-rank test. P < 0.05 was considered statistically significant.

Patient characteristics
A total of 160 patients were included in the analysis. They were aged from 39 to 88 years (mean of 64 years).

Model performance
3D Res-UNet was applied to automatically segment the target volumes of EC on CT images. Segmentation results for a representative patient for VMI 40 keV and CI are shown in Fig. 4. The median results of DSC, IOU, ASSD and HD_95 statistics for the segmentation indexes VMI 40 keV and CI in the test dataset are shown in Table 2, while Fig. 5 shows the distribution of the results. We trained and evaluated 3D Res-UNet on VMI 40 keV and CI, separately. There were statistically significant differences in DSC, IOU, ASSD and HD_95 between the models with VMI 40 keV and CI. The results of VMI 40 keV were slightly better for the CI with the automatic segmentation model in DSC, IOU, ASSD and HD_95. According to the good agreement threshold of the DSC (DSC ≥0.7) (Zijdenbos et al., 1994;Bartko, 1991), automatic segmentation results for the EC of VMI 40 keV and CI were consistent with manual segmentation results.

DISCUSSION
In this study, the 3D Res-UNet convolutional neural network (CNN) algorithm was trained for EC segmentation using enhanced VMI 40 keV . After training, the CNN model could be used for the automatic segmentation of EC on both enhanced VMI 40 keV and CI with a median DSC greater than 85%. CNN architectures are supervised models that are trained end to end to learn a hierarchy of features representing different levels of abstraction in a data-driven manner; therefore, good data determine a good model. There is a lack of soft tissue contrast on CT images, some of which are unsatisfactory. Yousefi et al. (2018) presented a 3D end-to-end method based on a CNN for esophageal gross tumor volume segmentation, which achieved a DSC value of 0.73 ± 0.20. Jin et al. (2020a) introduced a progressive semantically-nested network segmentation model for gross tumor volume segmentation of esophageal cancer with a DSC value of 0.751. Huang et al. (2020) investigated a Channel-Attention U-net for semantic segmentation of the esophagus and EC, with a value for IOU in EC of only 0.625. Table 2 shows our model had a median IOU of 0.777 for VMI 40 keV and 0.755 for CI, indicating its superiority over the previously reported models. Our model achieved a median DSC of 0.874 for VMI 40 keV and 0.859 for CI, further improving the accuracy of automatic segmentation, which may be explained by that better data with best contrast to noise ratios (CNRs) were obtained in this study to train a better model drawing the tumor contours on VMI 40 keV arterial phase images (Kang et al., 2019;Mochizuki et al., 2022).
Although SDCT can provide both VMI and CI, the patients were scanned by conventional CT during daily work. In this research, we trained the CNN with same label on both VMI 40 keV and CI produced by SDCT, and compared their segmentation accuracies, achieving median DSCs of 0.874 for VMI 40 keV and 0.859 for CI. This means the model trained on CI also had a good segmentation performance, supporting the use of the trained CNN for automatic segmentation of organs or tumors on outside-sourced conventional CT images. According to this research, the difference in the segmentation performance of DSC was statistically significant (p < 0.05) between VMI 40 keV and CI (median DSC 0.874 vs. 0.859), and similarly as IOU, ASSD and HD_95. The possible reasons are that low-keV VMI significantly improved attenuation and signal-to-noise ratio for the primary tumor, and the CNR of the primary tumor vs. circumjacent anatomical structures (esophageal wall, diaphragm and liver parenchyma) was significantly higher in low-keV VMI, peaking at VMI 40 keV (Lee et al., 2018). Figure 6 shows the segmentation contour of EC between VMI 40 keV and CI, as VMI 40 kev highlighted the edges of the soft tissue, especially when the obstruction of esophageal cancer leads to the dilatation and effusion of the upper esophagus; thus, the DSC of EC was improved by 0.148. However, this may cause another problem: other tissues such as lymph nodes adjacent to the EC may be regarded as part of the EC. Using VMI 40 keV as input data improved the accuracy of segmentation.
The overall effect of the experiment was good both for VMI 40 keV and CI. According to Table 2, automatic segmentation had a more stable performance for VMI 40 keV compared with CI. There was a segmentation effect of our network on some small boundaries when the EC had a cavity, especially on conventional images. Figure 7 shows a patient in whom the upper half of the EC like the wall around was discarded. Our network achieved the worst segmentation performance with a DSC of 0.606 for CI, which was improved to 0.841 for VMI 40 keV . Unfortunately, the part of the EC adjacent to the thoracic vertebrae was too thin, and was abandoned both on VMI 40 keV and CI.
Besides its retrospective design, this study had some limitations. First, it had a singlecenter design and used a single CT scanner. As only data from our institution were applied, generalization to other scanners and sites requires the testing of the generalizability of our automatic segmentation model. Secondly, the sample size of this study was limited. A network is affected by the sample size, and the generalization ability of the model has not been fully verified. Some cases with poor segmentation were found in EC; because of the limited cases included in the current study, more cases are needed to improve the accuracy of automatic segmentation in these cases. In the future, we plan to apply our model to datasets obtained with various CT scanners in various institutions. Thirdly, metastatic lymph node was also required to delineate along with EC lesion for radiotherapy treatment planning. While our primary study only focused on the EC lesion segmentation.

CONCLUSION
In conclusion, automatic EC segmentation is clinically feasible with 3D-ResUNet, and using VMI 40 keV can increase its accuracy. Fully automated localization and segmentation of EC based on the present study could be valuable for detection, diagnosis and chemoradiation therapy planning in the future.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
The authors received no funding for this work.