Fully Automated Delineation of Gross Tumor Volume for Head and Neck Cancer on PET-CT Using Deep Learning: A Dual-Center Study

Purpose In this study, we proposed an automated deep learning (DL) method for head and neck cancer (HNC) gross tumor volume (GTV) contouring on positron emission tomography-computed tomography (PET-CT) images. Materials and Methods PET-CT images were collected from 22 newly diagnosed HNC patients, of whom 17 (Database 1) and 5 (Database 2) were from two centers, respectively. An oncologist and a radiologist decided the gold standard of GTV manually by consensus. We developed a deep convolutional neural network (DCNN) and trained the network based on the two-dimensional PET-CT images and the gold standard of GTV in the training dataset. We did two experiments: Experiment 1, with Database 1 only, and Experiment 2, with both Databases 1 and 2. In both Experiment 1 and Experiment 2, we evaluated the proposed method using a leave-one-out cross-validation strategy. We compared the median results in Experiment 2 (GTVa) with the performance of other methods in the literature and with the gold standard (GTVm). Results A tumor segmentation task for a patient on coregistered PET-CT images took less than one minute. The dice similarity coefficient (DSC) of the proposed method in Experiment 1 and Experiment 2 was 0.481∼0.872 and 0.482∼0.868, respectively. The DSC of GTVa was better than that in previous studies. A high correlation was found between GTVa and GTVm (R = 0.99, P < 0.001). The median volume difference (%) between GTVm and GTVa was 10.9%. The median values of DSC, sensitivity, and precision of GTVa were 0.785, 0.764, and 0.789, respectively. Conclusion A fully automatic GTV contouring method for HNC based on DCNN and PET-CT from dual centers has been successfully proposed with high accuracy and efficiency. Our proposed method is of help to the clinicians in HNC management.


Introduction
Head and neck cancer (HNC) is a type of cancer originating from the tissues and organs of the head and neck with high incidence in Southern China [1]. Radiation therapy (RT) is one of the most effective therapies, which relies heavily on the contouring of tumor volumes on medical images. However, it is time-consuming to delineate the tumor volumes manually. Besides, the manual delineation is subjective, and the accuracy depends on the experience of the treatment planner. Compared to manual delineation, automatic segmentation can be relatively objective. Nowadays, there have been studies reporting the automatic segmentation of tumor lesions on magnetic resonance images of HNC using different methods [2][3][4][5][6][7][8][9][10].
Positron emission tomography-computed tomography (PET-CT) has played an important role in the diagnosis and treatment of HNC, providing both anatomical and metabolic information about the tumor. e automatic or semiautomatic segmentation of tumor lesions on PET-CT or PET images of HNC has been reported, using machinelearning (ML) methods such as k-nearest neighbor (KNN) [11,12], Markov random fields (EM-MRFs) [13], adaptive random walker with k-means (AK-RW) [14], decision tree algorithm [15], and active surface modeling [16]. e segmentation of tumor lesions on the coregistered PET and CT images has shown better results than those on solely PET or CT images [17,18]. However, the application of PET-CT has increased the amount and the complexity (multimodality) of the image data. Also, to propose a robust and practical MLbased automatic segmentation method, it is often necessary to train and test the method with heterogeneous image data from multicenter [19], which makes the training and testing of ML systems more challenging.
Compared to the traditional ML methods, deep learning (DL) allows extracting the features automatically instead of subjective feature extraction and selection in conventional ML techniques, which may be more appropriate for automatic segmentation in multimodality data and multicenter data. DL can easily recognize the intrinsic features of the data [20]. DL techniques, such as stacked denoising autoencoder (SDAE) [21] and convolutional neural network (CNN) [22][23][24], have been used in tumor segmentation successfully with improved accuracy.
No studies have been reported to apply the deep convolutional neural network (DCNN) in the automatic GTV delineation for HNC patients on PET-CT images. In our study, we proposed an automatic method of GTV delineation for RT planning of HNC based on DL and dual-center PET-CT images, aiming to improve the efficiency and accuracy.

Materials and Methods
In brief, our methodology included the contouring of the gold standard, training and testing of the DL model, and evaluating the performance of our trained model. After reviewing the MRI, CT, and PET images, an oncologist and a radiologist decided the contouring of GTV by consensus which was treated as the gold standard in the following training and testing of our method. We developed a deep convolutional neural network (DCNN) for HNC tumor lesion segmentation, and then we trained the network based on the PET-CT images and the gold standard of GTV in the training dataset. In the testing step, we input the testing dataset to the network, and it automatically contoured the GTV. To test the accuracy of this automated method, we compared the results of our method with those of other methods in the literature and with the gold standard.

Structure of Our DCNN Model.
Inspired by the fully convolutional network [25] and U-net [26], we designed a DCNN model for GTV delineation. e structure of our proposed DCNN model is shown in Figure 1. is network consisted of two stages: feature representation phase and scores map reconstruction phase.

Feature Representation Phase.
e main purpose of the feature representation phase was to extract the feature information of PET images and CT images, by combining the low-level features to represent the high-level features with semantic information. e feature representation phase contained 5 downsampling blocks, 4 convolution (conv) layers, and 4 rectifier linear unit (ReLU) layers ( Figure 1). A downsampling block included a convolution layer, an ReLU layer, and a pooling (pool) layer. e first convolution layer was to extract the low-level features of PET images and CT images, respectively, by filters of 5 × 5 voxels and to fuse them together. We were able to fuse the features because the PETand CT images were input simultaneously with the same gold standard. We in the next 4 convolution layers applied the convolutions for the permutation and combination of the low-level features to obtain more high-level features with semantic information. In all the 5 downsampling blocks, the convolution layers were followed by a pooling layer. We applied pooling with 2 × 2 filters and 2 strides which decreased the length and width of the feature map by 50%.
us, it could reduce the number of connection parameters and the computational time and provided the position invariance and more global information. e use of unaltered filters on a smaller image may contribute to the larger local receptive fields, and these enlarged local receptive fields could extract more global features. After each convolution layer, we used an ReLU layer as an activation layer to increase the nonlinearity of our network and to accelerate the convergence. e length and width of the feature maps were reduced by 50% after a downsampling block. After the feature map size was reduced to 16 × 16, it was then connected with a convolution layer with 16 × 16 filters. It means that every neuron in the following layer was connected with all the neurons in the previous layer to imitate the fully connected layer in the traditional classification network. e size of the feature maps was 1 × 1 pixel after this convolution layer. en, we used 2 convolution layers with 1 × 1 filters for the permutation and combination of these features to obtain more abstract information. e finally acquired 1 × 1 scores maps were used as the input in the scores map reconstruction phase.

Scores Map Reconstruction Phase.
e main purpose of the scores map reconstruction phase was to reconstruct the scores map into the same size of input images by upsampling.
is reconstruction phase consisted of 5 upsampling blocks, a convolution layer, and an ReLU layer. An upsampling block was composed of a deconvolution (deconv) layer, a concatenation (concat) layer, a convolution layer, and an ReLU layer. e deconvolution layer was designed for upsampling. e first deconvolution layer reconstructed the 1 × 1 scores map to 32 × 32 by 32 × 32 filters. However, we found that deconvolution would cause the loss of the high-resolution information in images. To overcome this problem, we utilized the concatenation layer to fuse the feature maps in the previous pooling layers or convolution layers with the current feature maps in the deconvolution layer. We believed that these skip-layer designs could capture more multiscale contextual information and improve the accuracy of segmentation. To fuse the lowand high-resolution information pixel by pixel, we set the filters of all the following convolution layers at 1 × 1.
With all the upsampling blocks, we finally reconstructed the scores maps to an output image with a size of 512 × 512, the same as in the input PET or CT images. In order to optimize the network, we estimated the loss by calculating the Euclidean distance between the gold standard and the reconstructed tumor lesions [27,28]. en, the parameters of the network were iterated and renewed by backpropagation from the loss. In our experiment, we decided to use the Euclidean distance to estimate the loss because it had shown better performance than the cross entropy loss that was used in some other studies of Ronneberger et al. [26].  PET-CT scans in both centers were from the top of the skull to the shoulder. Acquisition time of PET for each bed position was 2.5 minutes. e patients from center 1 were scanned with Discovery STE (GE Healthcare, Milwaukee, USA); the spatial resolution and image matrix of most CT images were 0.49 × 0.49 × 2.5 mm 3 and 512 × 512 × 63, respectively, while the spatial resolution and image matrix of the PET images were 1.56 × 1.56 × 3.27 mm 3 and 256 × 256 × 47, respectively. e PET scan in center 1 was acquired in 3dimensional mode and reconstructed using the orderedsubset expectation maximization iterative algorithm. e patients from center 2 were scanned with Discovery 690 PET-CT scanners (GE Healthcare, Milwaukee, USA); the spatial resolution and image matrix of the CT images were 0.59 × 0.59 × 3.27 mm 3 and 512 × 512 × 47, respectively, while the spatial resolution and image matrix of PET images were 1.17 × 1.17 × 3.27 mm 3 and 256 × 256 × 47, respectively. In center 2, the PET scanning was acquired in 3-dimensional mode and reconstructed using the VPFXS reconstruction method.

Training of
To make use of the information of both the PET image and the CT image, we performed coregistration of PET to CT images by sampling the PET images using linear interpolation in SPM8 (Wellcome Department of Imaging Neuroscience, London, United Kingdom). Finally, we had 934 samples (one sample includes one slice of the CT image and one coregistered slice of the PET image, both with a matrix size of 512 × 512) for the 17 patients from center 1 as Database 1 and 200 samples for the 5 patients from center 2 as Database 2.
e primary GTVs were manually outlined by an experienced radiologist and double-checked by an experienced oncologist on the registered PET/CT with reference to MRI, PET, and CT images using the ITK-SNAP software (http:// www.itksnap.org) [29]. e resultant GTV contouring was used as the gold standard in training and testing of our proposed model and for the comparisons with automatic segmentation in terms of their volume and geometrical overlap. Specifically, we discarded the images in which the tumor size was smaller than 0.5 cm 2 (in the 2-dimensional images) by considering the partial-volume effect (PVE) in the PET image, suggested by the radiologist. PVE could affect the imaging accuracy of small tumor lesions whenever the tumor size is less than 3 times the full width at half maximum (FWHM) of the reconstructed image resolution [30].
We performed two experiments with our data. In Experiment 1, we evaluated the proposed method using only the data in Database 1. We evaluated the proposed method using a leave-one-out cross-validation (LOOCV) strategy, leaving the images of one patient for testing and the images of all other patients for training. To balance the positive and negative samples in the training dataset, we selected all the slices with tumor lesions as positive samples and randomly selected the same number of slices without tumor lesions as positive samples. To satisfy the need of huge training data in DL, we augmented the training dataset to nearly 15,000 samples by rotating the images, horizontal mirroring, changing the contrast, and image scaling. In Experiment 2, we used the two databases (1134 samples) and augmented the training dataset to nearly 18,000 samples and also evaluated the method by using the LOOCV strategy similarly. Before training and testing in both Experiments 1 and 2, all data were normalized by performing min-max normalization.

Network Training.
e training of the whole network was composed of three stages. At the first stage, we obtained an output image after the third upsampling block (Figure 1), and the size of the output image was 128 × 128. In the second stage, which was initialized by the network parameters in the first stage, a 256 × 256 scores map was obtained. Finally, we based on the network parameters in the second stage trained the whole network, and the scores maps were used to reconstruct an output image with a size of 512 × 512 (the same as the size of the input PET or CT images). e model was trained by using an Adam optimizer for 200,000 iterations with a fixed learning rate of 0.00001. We used a GPU NVIDIA GeForce GTX 1080 Ti equipped on an Intel Xeon E5-2650 2.30 GHz × 16 machine and the DL framework Keras for training [31]. e whole training procedure took about 24 hours.

Evaluation of Automatic GTV Delineation Performance.
After the successful training of the DCNN model, we used the testing dataset to evaluate the segmentation performance of our method by calculating the dice similarity coefficient (DSC) as follows: where true positive (TP) denotes the correctly identified tumor area, false positive (FP) denotes the normal tissue that is incorrectly identified as tumor, and false negative (FN) denotes the tumor area that is incorrectly predicted as normal tissue. DSC describes the overlap between the gold standard and the automatic segmentation result.

Comparison with Other Methods in the Literature.
To instigate the improvement of our method, we also compared our results with the previous studies. We tried to apply these previous methods on our database; however, the performance was all lower than the published results. Hence, we directly compared our results with those in these publications, in terms of DSC. Although they may not be reasonably comparable, these comparisons to some extent provide insights about how our method outperformed the similar studies. Note that for a fair comparison, we used the results of median performance in Experiment 2 for the comparison.

Comparison with the Gold Standard of GTV.
Although we repeated our experiments for several times, for a fair comparison, we used the results of median performance in Experiment 2 with dual-center data for the comparison, and the results were recorded as GTVa. e gold standard by manual contouring was recorded as GTVm. Pearson's correlation was performed between GTVa and GTVm. To further evaluate the accuracy of GTVa against GTVm, we calculated mean surface distance (MSD), sensitivity, and precision as follows: where X and Y denote the boundary of autosegmentation and the gold standard, respectively (y 1 i , i � 1, . . . , N ∈ X is the boundary points of X; y 2 j , j � 1, . . . , M ∈ Y is the boundary points of Y, respectively). MSD describes the mean Euclidean distance between GTVa and GTVm along their boundaries. Sensitivity describes how much the overlap of GTVa and GTVm was included in GTVm. Precision describes how much the overlap of GTVa and GTVm was included in GTVa. e absolute difference between GTVa and GTVm was also estimated by calculating GTVa − GTVm [32].

Automatic GTV Delineation
Performance. With our trained model, a tumor segmentation task for a sample (a coregistered PET image and a CT image, two-dimensional) took about 0.28 seconds; thus, for an HNC patient with around 50 slices of coregistered PET/CT images, our method took about 14 seconds for GTV segmentation. An example of segmentation with high accuracy is shown in Figure 2, in which the DSC was 0.943. Two typical examples of the poor results and their corresponding PET images are shown in Figures 3 and 4, in which the DSC was 0.610 and 0.408, respectively. As shown in Figure 5 e segmentation results in Experiment 2 were recorded as GTVa and used for the following comparisons.

Comparison with Other Methods in the Literature.
e results of previous studies about HNC segmentation based on PET-CT are shown in Table 1. e mean DSC of our method in Experiment 2 for 22 patients was 0.736. Stefano et al. [14] achieved a high DSC of 0.848; however, their method was on PET images only and was semiautomatic.

Discussion
We proposed an HNC automated GTV contouring method based on DL and PET-CT images, with encouraging segmentation results. Most of the studies on HNC delineation were based on PET images only [14][15][16][17], in which the anatomical information was insufficient due to the low spatial resolution compared to CT or MRI [13]. Yang et al. [13] achieved similar segmentation accuracy (DSC � 0.740); however, their method was based on three modalities (PET, CT, and MRI). e methods of Stefano et al. [14] and Song et al. [17] were all semiautomatic. Berthon et al. [15] reported a higher accuracy of 0.77; however, their gold standard for performance evaluation incorporated the information of automatic segmentation results. Compared to these studies [13][14][15][16][17] with the data from one center only, our proposed method shows stable performance on dual-center data. To summarize, our proposed method has shown relatively high accuracy and is fully automatic, making use of both the metabolic and anatomic information.
Either in Experiment 1 or in Experiment 2, the performance was high and stable for Database 1.
is may suggest that the proposed DCNN model was effective and robust. Note that in Experiment 2, the DSC was higher than that in Experiment 1.
is may be because with more samples, more features can be learned by our DCNN model, and thus, the segmentation accuracy could be improved. However, we also in Experiment 2 observed that the accuracy for Database 2 was lower than that for Database 1.
e reason may be that the features were somehow different between these two databases. e features learned from Databases 1 and 2, mainly from Database 1, were probably not suitable enough to be applied to Database 2. Note that with only 22 patients, we already achieved such good performance of automatic contouring. However, we may recruit more data to further verify the robustness of our model. e image features are critical in machine-learningbased segmentation tasks. We used the multimodality images, namely, PET and CT images, as the input of our DCNN model, and this may improve the segmentation than with PET or CT images alone. is finding echoed the results    reported in the study of Song et al. [17] or Bagci et al. [18]. As shown in Figure 2, since the metabolism is significantly different between tumor regions and normal tissues, the contrast of the tumor region to the adjacent tissues is high; thus, the location of tumor is easily detected in PET images. However, the spatial resolution of PET images is low; thus, the tumor boundary is unclear in PET images. In CT images with higher spatial resolution, the anatomical information is more sufficient for detecting the boundary of tumors. By using both PET and CT images, our method extracted and combined both metabolic and anatomical information as the efficient features for more accurate segmentation. e DL technique we used to extract the features has shown more advantages than traditional machine-learning    (Table 1). As shown in Figure 5, the region marked by the blue circle with high metabolism was actually an inflammation region, which looks very similar to the tumor lesions. e inexperienced clinicians may incorrectly consider this region as a tumor lesion, while our trained model was able to learn the difference between these inflammation regions and the tumor lesions and correctly recognized this as a nontumor region. Such an example showed that our DCNN method can extract the intrinsic features of tumor lesions and finally achieve better GTV contouring results. Besides, we used a skip-layer architecture for the fusion of the feature maps at the feature representation phase and scores map reconstruction phase, which can be another technical improvement in our method. As shown in Figure 7, although the semantic information of the features in the feature representation phase was worse than that of the features in the scores map reconstruction phase, it could help fix the problem of information loss in the reconstruction procedure. Compared to the feature map fusion method used by Long et al. [25], our method successfully incorporated the more useful features during the process of feature fusion. We believed that this fusion improves the accuracy of segmentation by using the skip-layer architecture.

Contrast Media & Molecular Imaging
e comparisons between GTVa and GTVm ( Figure 6 and Table 2) indicated that GTVa was similar and close to GTVm. However, there were still some shortcomings in our automatic method. Firstly, the GTVa was unsatisfactory in some tumors. As shown in Figure 3, the tumor in the PET image was large, but the boundary was unclear; thus, part of the tumor was incorrectly identified as normal tissue. As shown in Figure 4, the low metabolism region, which was within the region where tumor lesions were often seen in some other patients, was incorrectly detected as tumor lesions. As shown in Figure 6(a), two patients showed a large difference between GTVm and GTVa. e tumors of these two patients were large with lots of lymphatic metastasis. is kind of tumor was few in our database; thus, our method failed to learn the features of these kinds of tumors. Secondly, we discarded the images in which the tumor size was smaller than 0.5 cm 2 (in the 2-dimensional images) because such tumor lesions were difficult to detect in PET images by visual assessment. In addition, the imaging accuracy of small tumor lesions could be affected due to PVE. us, the performance of our method for such small tumors remains unclear.
Our results may be improved in future studies in the following aspects. Firstly, more data should be recruited for training a better model and to test-retest the performance. Especially, the data from different centers should be better balanced. Also, the MRI images may be employed as they provide better soft tissue contrast and may improve the performance. Secondly, in the training and testing, only the 2-dimensional images were used and the volumetric information was abandoned. We would carefully improve the network architecture and also adjust the training parameters for better segmentation results. Finally, for successful application of our method in the radiotherapy of HNC, the automatic contouring of organs at risk should also be incorporated, and the clinical target volume (CTV) and planning target volume (PTV) should also be drawn.

Conclusion
In this study, we successfully proposed and verified a robust automated GTV segmentation method for HNC based on DCNN and dual-center PET-CT images. With multimodality images, both anatomic and metabolic features are extracted automatically and objectively, which contribute to the increased accuracy.
e DL algorithm showed good potential in GTV segmentation. All these contributed to the high accuracy and efficiency of our method compared to manual contouring. Our method may be helpful in aiding the clinicians in radiotherapy of HNC; thus, it is of great potential in HNC patient management. Future studies may aim to improve further the segmentation accuracy with more training data and optimized network structure, to draw CTV/PTV, and to verify our method with data from multicenters.
Data Availability e authors do not have permission to share data.

Conflicts of Interest
e authors declare that they have no conflicts of interest.