The impact of training sample size on deep learning-based organ auto-segmentation for head-and-neck patients

To investigate the impact of training sample size on the performance of deep learning-based organ auto-segmentation for head-and-neck cancer patients, a total of 1160 patients with head-and-neck cancer who received radiotherapy were enrolled in this study. Patient planning CT images and regions of interest (ROIs) delineation, including the brainstem, spinal cord, eyes, lenses, optic nerves, temporal lobes, parotids, larynx and body, were collected. An evaluation dataset with 200 patients were randomly selected and combined with Dice similarity index to evaluate the model performances. Eleven training datasets with different sample sizes were randomly selected from the remaining 960 patients to form auto-segmentation models. All models used the same data augmentation methods, network structures and training hyperparameters. A performance estimation model of the training sample size based on the inverse power law function was established. Different performance change patterns were found for different organs. Six organs had the best performance with 800 training samples and others achieved their best performance with 600 training samples or 400 samples. The benefit of increasing the size of the training dataset gradually decreased. Compared to the best performance, optic nerves and lenses reached 95% of their best effect at 200, and the other organs reached 95% of their best effect at 40. For the fitting effect of the inverse power law function, the fitted root mean square errors of all ROIs were less than 0.03 (left eye: 0.024, others: <0.01), and the R square of all ROIs except for the body was greater than 0.5. The sample size has a significant impact on the performance of deep learning-based auto-segmentation. The relationship between sample size and performance depends on the inherent characteristics of the organ. In some cases, relatively small samples can achieve satisfactory performance.


Introduction
Segmentation of the target volumes (TVs) and regions of interest (ROIs) are essential steps in radiotherapy (Citrin 2017). Compared to 2D-or 3D-conformal radiotherapy, intensity-modulated radiotherapy (IMRT) can deliver a more conformal dose distribution to the TVs and notably spare ROIs with fewer radiation-related toxicities in head and neck cancer (HNC) (Kosmin et al 2019). The gradients of the dose distribution are steep outside the planning target volume, and the structures can receive fewer absorbed radiation doses if they are particularly delineated as ROIs in IMRT (2010). It is obviously important to delineate the ROIs accurately, and tumor control and radiation toxicity have shown a high correlation with the accuracy of TV and ROI delineation (Mukesh et al 2012, Walker et al 2014. However, manual delineation that is performed by a physician on computed tomography (CT) and/or magnetic resonance (MR) images is challenging, time-consuming, and subjective, with potentially large inter-and intraobserver variability (Brouwer et al 2015). This variation in ROIs delineated by different physicians has been Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
shown to be not only theoretical but also has significant dosimetric effects on patients (Voet et al 2011). Moreover, adaptive radiotherapy requires that the treatment be adapted anatomically or biologically to the changes of patients or tumors during the course of treatment (Witt et al 2020). This requires rapid recontouring of the TVs and ROIs to re-evaluate the dosimetry and consider whether to replan during a fraction of radiotherapy in real-time.
The time required for delineating ROIs and TVs is an obstacle to adaptive radiotherapy, which hinders the development of radiotherapy to some extent. Rapid auto-segmentation can alleviate these problems and promote the development of adaptive radiotherapy (Kosmin et al 2019).
Head-and-neck ROI segmentation is a typical application scenario of auto-segmentation. The efficacy and safety of head-and-neck radiotherapy requires accurate ROI segmentation. More recently, deep learning-based auto-segmentation techniques have been shown to provide significant improvements over many traditional approaches (Cardenas et al 2019). Many studies have investigated using deep learning techniques to segment these organs (Ibragimov and Xing 2017, Ren et  Deep learning-based auto-segmentation relies on the availability of annotated datasets. It follows the general principle of deep learning: the more data there are, the better the results (Cho et al 2015). However, for medical images, high-quality annotated datasets are scarce and require specialized medical knowledge, standardized protocols and considerable time and effort. The development of transfer learning (Torrey and Shavlik 2009) and data augmentation (Goodfellow et al 2016) may reduce the data size requirement. However, annotation is still one of the key issues in medical image auto-segmentation. In some cases, it may take more time and effort than the algorithm development itself (Tajbakhsh et al 2020).
Therefore, it is important to estimate the required training sample size before deep learning-based autosegmentation application development. If the required sample size can be estimated by setting performance goals and preliminary experiments, it may help researchers and developers design better research and software. Moreover, the impact of the ROI characteristics on the relationship between model performance and sample size needs to be further explored.
Three stuies ( However, these two studies were mainly focused on algorithm development, and no further analysis was performed. Narayana systematically investigated the impact of training size on 4 brain tissue segmentations on MRIs and indicated that the Dice similarity coefficient (DSC) of all tissues increased at the beginning and then stagnated (Narayana et al 2020).
Compared to brain tissue auto-segmentation, head-neck ROIs auto-segmentation is a common task in radiotherapy. It involves more than 10 organs with different structural characteristics, such as the spinal cord, which is a long organ extending across many slices, and the eyes, which only appear on a few slices with a small volume. This task may be more suitable for training sample size investigations.
In this study, a large head-and-neck dataset with 14 annotated ROIs was used to investigate the effect of the training sample size. Auto segmentation models are established with different training sample sizes and evaluated on an independent test dataset. Furthermore, we try to establish a prediction model that can be used to estimate model performance or the training sample requirement with some preliminary experiments.

Images and segmentation
From February 2009 to April 2016, a total of 1160 patients with HNC who received radiotherapy at the Fudan University Shanghai Cancer Center were consecutively enrolled in this study. The treatment planning CT images and ROI delineation were collected in DICOM format. None of the CT images were contrast-enhanced. All patients' scans were obtained with the same CT scanner using the same imaging protocol (350 mA tube current, 120 kVp tube voltage, 0.92×0.92 mm pixel size, 5 mm thickness, 512×512 matrix). No additional CT preprocessing was performed before physician delineating. Fourteen ROIs were segmented by dozens of physicians in clinical practice with same consensus guidelines (Brouwer et al 2015), including the brainstem, spinal cord, eyes, lenses, optic nerves, temporal lobes, parotids, larynx and body. Paired ROIs were divided into left and right (figure 1). All images and their delineations were imported into MIM (MIM Software Inc., Cleveland, OH) to be manually checked by a physicist to avoid significant data errors, such as incorrect ROI naming.

Deep learning algorithm for segmentation
The model we used in this study was the U-net (Ronneberger et al 2015), which has been the 'baseline net' for organ and tumor segmentation in recent studies (Vrtovec et al 2020). The U-net's structure is shown in figure 2. The details of the U-net are provided in our previous manuscript (Wang et al 2018). The input was 5 CT slices and the output was 14 ROIs in 14 channels. Figure 3 shows the process of importing data into the model. First, the original CT is preprocessed, including scaling resolution and density normalization, and then all the images between the third layer and the third layer from the end are used to train the model layer by layer. When segmenting a CT slice, we input a 5 CT slices centered on it, which enables the model to make use of part of the spatial information about this CT slice. The 5 CT slices and the manual delineation corresponding to its middle slice are used together to train the model to segment on the middle CT slice.
All models in all training processes use the same hyperparameters. For iteration, all models had trained 50 epochs and performed 12 500 iterations in each epoch. The summation of all ROIs' 1-Dice index was employed as the loss function. The learning rate was 1e −4 , and RMSprop was used as the optimizer. All models were converged with these settings. The model was implemented in Keras (Chollet 2015), and all calculations were performed with a GeForce GTX1080Ti GPU.

Dataset, data augmentation and model training
The total workflow is presented in figure 4. First, an independent test dataset was created with 200 patients randomly selected from the whole dataset for performance evaluation. Then, eleven different training datasets, which included 10,20,30,40,80,120,160,200,400,600, and 800 patients, were generated by randomly selecting them from the remaining 960 patients with different random seeds.
Data augmentation (Goodfellow et al 2016) is a common strategy used to alleviate the training size requirements. We implemented two argumentation processes in this study, including gray level disturbance and shape disturbance. The CT images' gray values multiplied a number that was randomly selected from 0.9 to 1.1 and added another random number from −0.1 to 0.1 to the gray level disturbance. Then, the CT images and binary contour images were deformed using affine transform. The deformation algorithm used in this study was divided into two steps. First, we obtained the coordinates of the three vertexes (top left, top right, and bottom left), and then each point was shifted randomly in the range of [−1, 1] * image length (512 in our study). All CT images and binary contours were affine transformed by these transformations. Data augmentation was only used in the process of network training.
Each training dataset was used to train a deep learning-based segment model with the same network structures and hyperparameters (such as epochs and iterations). Eleven auto-segmentation models (m10, m20, m30, m40, m80, m120, m160, m200, m400, m600, and m800) were developed corresponding to these training datasets. For example, m10 and m800 are established by this method; that is, a patient is randomly selected from a training set of 10 and 800 samples, respectively, and after data augmentation, it is input into U-net for training, and all processes are repeated 12 500 times as an epoch. All models were trained with a total of 50 epochs. Then, the test dataset was applied to all of the models. For robustness against varying conditions, the whole process, including test dataset sampling, was performed four times.

Performance evaluation
The performance was evaluated by computing the DSC as below: where A is the volume of the ground-truth segmentations (manually delineated by the physician); B is the volume of the auto-segmentation contours; and A∩B is the volume that the manual segmentation and autosegmentation contours have in common. The calculation of volume was based on pixels in our study. The DSC can range between 0 and 1 (0=no overlap, 1=complete overlap). A higher DSC value indicates that the corresponding model works better. Meanwhile, to demonstrate the increasing trend, a normalized DSC was calculated by dividing each ROI's best performance. . The process of data preprocessing. The rescaling resolution was 512×512 and the image density will be normalized to between 0 and 1.

Performance estimation model
An inverse power law function was used to establish the relationship between the mean DSC and the training sample (equation (2) To investigate the accuracy of the performance estimation model, we used the performance results from fewer than 200 training samples (m10∼m200) to create the model and compared them to the model with full performance results (m10∼m800). The prediction power was evaluated by R square and root mean square error (RMSE). The estimation model fitting and evaluation were performed in R (version 3.0). Table 1 shows the DSCs of the 11 deep learning models for the 14 ROIs. The DSCs increased with the increase in training sample size for all ROIs for the entire trend. Meanwhile, the change pattern varied between different ROIs. The m800 model showed the best performance for six ROIs, including the body (DSC: 0.988), left eye    Figure 5 presents the relationship between the normalized DSC value and the training sample. Lenses and optic nerves need 200 samples to achieve 95% of the best performance. The other ROIs require 40 samples to achieve 95% of the best performance. Figure 6 shows the results of the two performance estimation models. The red curves were fitted with m10∼m800, and the blue curves were fitted with m10∼m200. The fitted R square of all ROIs except the body were greater than 0.54 (body: 0.22), and the RMSE of all ROIs except the left eye were less than 0.01 (left eye: 2.17×10 −2 ).

Discussion
In this study, we systemically investigated the impact of sample size in automatic segmentation on HNC ROIs. Our results showed that sample size has a significant impact on the performance of deep learning-based autosegmentation. Different organs may have different patterns in performance changes with increasing sample size.
In figure 5, all ROIs except the external body first showed a trend of rapid growth and then slow growth. Lenses and optic nerves need 200 samples to achieve 95% of the best performance. The other ROIs require 40 samples to achieve 95% of the best performance. This may be because the volume of the optic nerve and lens is smaller than that of the other ROIs, so a larger sample size is needed for training. Although the volume of the eye is relatively small, it has a relatively fixed anatomical position compared with the optic nerve and lens, so it may be able to obtain a better segmentation effect by training with a small sample size.
In addition, the absolute performance increase may be relatively small for some ROIs; the difference between m40 and the best models is less than 0.02 for ROIs except lenses and optic nerves, and the difference is less than 0.03 when comparing m200 and the best model for lenses and optic nerves. Detailed statistics can be found in the attachment ( figure A1 and figure A2 in appendix).
Based on these data, we recommend using 40 samples for brainstem, spinal cord, eye, temporal lobe, parotid, larynx and body auto-segmentation model training and using more than 200 samples for lens and optic nerves auto-segmentation model training.
In figure 6, a good fitting effect was observed for all ROIs except the body with full data. Compared to the other ROIs, the performance increase of the body was relatively small. This may cause a decrement of the fitting effect. However, the results of the performance estimation model using fewer than 200 training samples deviates from the full data model. The absolute deviation is small. The absolute deviation between the predicted value and the observed value in m400, m600 and m800 was 9.64×10 −3 (1.18×10 −4 (minimum, body in m400)∼3.09×10 −2 (maximum, left lens in m600)). We believe it is suitable for rough performance estimation. The dataset in our study was a clinical-level delineation dataset that was contoured by dozens of physicians over many years. There may be large interobserver variability How this variability affects automatic segmentation results is unknown. In this study, we repeated the whole process 4 times to reduce the impact of the random sample.
In this study, we used a 2.5D U-net network to segment organs. And (Vu et al 2020) also called this multilayers input network pseudo-3D network, which employed a stack of adjacent slices as input and predicted the contours on the central slice. This approach enables the network to capture 3D spatial information around slice with less computational cost. Vu et al (2020) found that the pseudo-3D approach greatly surpassed the fully 3D CNN in computational efficiency and was significantly better than a regular 2D CNN. But the U-net network in our research is relatively native with little parameter optimization, which may lead to the model performance not being as excellent as others with high data consistency and elaborative model optimization. All models are trained with the same hyperparameters, which were set according to the experience of a large training sample size. For example, the epoch number may be too large for a small sample. Overfitting phenomena were observed for small sample sizes (figure 7). Figure 7 also demonstrates that the fall of the training dataset performance is more obvious than the performance increment of the test set.
Not all ROIs got the best results at m800. The DSC of m800 was significantly less than that of m600 for the brain stem, spinal cord, left lens and larynx. The DSC decrease is approximately 0.01. This may be caused by the inconsistent annotation in our dataset. Meanwhile, the standard deviation did not decrease with increasing sample size. It may also be caused by the inconsistent annotation. However, we cannot verify this hypothesis in this study. Further study is required to quantify the delineation consistency. There are also some limitations in our study. First, the consistency of samples is difficult to evaluate and control. The samples in our study were manually delineted by many physicans and this may mean that the inconsistency of the data is relatively high. In addition, our estimation of the relationship between sample size and model effect is an empirical assessment. As a black box, deep learning needs further theoretical research on the relationship between sample size, sample quality and model effect.

Conclusions
The sample size has a significant impact on the performance of deep learning-based auto-segmentation. The relationship between sample size and performance depends on the inherent characteristics of the organ. In some cases, relatively small samples can achieve satisfactory performance.