A Survey on Automatic Delineation of Radiotherapy Target Volume based on Machine Learning

ABSTRACT Radiotherapy is one of the main treatment methods for cancer, and the delineation of the radiotherapy target area is the basis and premise of precise treatment. Artificial intelligence technology represented by machine learning has done a lot of research in this area, improving the accuracy and efficiency of target delineation. This article will review the applications and research of machine learning in medical image matching, normal organ delineation and treatment target delineation according to the procudures of doctors to delineate the target volume, and give an outlook on the development prospects.


INTRODUCTION
To estimate the global burden of Cancer-based on the cancer and mortality information provided by the International Agency for Research on Cancer in GLOBOCAN 2020 [1], by 2020, Globally, there are an estimated 19.3 million new cancer cases (18.1 million excluding non-melanoma skin cancer) and nearly 10 million cancer deaths (9.9 million excluding non-melanoma skin cancer). The global cancer patients will be expected to reach 28.4 million cases by 2040, a 47% increase from 2020. Malignant tumors will surpass all other chronic diseases and become the "number one killer" that threatens human life and health.
In the 1930s, radiation technology has been used to treat tumor patients [10], and in the 1960s with the widespread applications of medical linear accelerators [11]. However, X-ray simulation localization is used for tumor localization during radiotherapy in this period. The doctor obtains the location of the tumor from the patient's fluoroscopic image, and marks the irradiation range on the patient's body surface according to the localization image, and performs treatment through the body surface projection field. Due to the failure to clearly define the tumor and normal tissue, and the poor uniformity of radiation dose distribution, it is easy to miss the tumor or normal tissue is irradiated with a higher dose, resulting in a lower cure rate and higher complications. In 1959, Takahashi et al. [12] proposed the concept of three-dimensional conformal radiation therapy (3D-CRT). The prototype is based on the three-dimensional morphological structure of the tumor, using lead blocks to irradiate in multiple radiation directions through the blocking part field, so that the shape of the irradiated area is the same as that of the tumor target, while reducing the radiation dose received by the blocked area. In the 1970s, the widespread application of computer systems and the emergence of computed tomography (CT), magnetic resonance imaging (MRI) and other equipment promoted radiotherapy to three-dimensional space, enabling 3D-CRT to be realized.
In recent years, three-dimensional digital precise radiotherapy technology has gradually replaced traditional two-dimensional radiotherapy technology, and has become an important development direction of tumor radiotherapy in the 21st century. The three-dimensional digital precise radiotherapy technology focuses on precise positioning and precise treatment, and performs conformal or intensity-modulated radiotherapy at the three-dimensional level through dose segmentation, so that the internal irradiation dose of the lesion in the target area is the largest, and the surrounding normal tissue is the smallest, the irradiation dose is evenly distributed, and has the advantages of high precision, high efficacy and low damage [13]. In addition to 3D-CRT, the currently recognized precision radiotherapy techniques also include stereotactic body radiotherapy (SBRT), intensity modulated radiotherapy (IMRT), and image guided radiation therapy (IGRT), etc. The technique system of precise radiotherapy for tumor is gradually perfection, and the treatment accuracy is increasingly improved.
At present, the steps of precise radiotherapy are to first obtain the anatomical images of the patient on the treatment couch by simulated positioning, then manually delineate the target area and organs at risk by the doctor, and then configure the radiation dose, number of fields, field angle and other parameters can be used to generate a radiotherapy plan suitable for the shape and dose of the tumor target. Finally, after the radiotherapy plan is verified and correct, the treatment can be carried out. Among them, target delineation is the core work of radiotherapy physicians. Accurate target delineation is the premise and crucial step of precise tumor radiotherapy. The quality of delineation has a great impact on the treatment effect of patients and the occurrence of complications [14]. If the treatment target volume is too large, it will increase the radiation dose received by the surrounding organs, thereby increasing the probability of complications [15]. Conversely, if the tumor area is not completely covered, it will lead to insufficient doses to kill all cancer cells, greatly increasing the possibility of recurrence after treatment [16].
Currently, the therapeutic target volume that needs to be manually delineated by radiologists mainly includes the gross tumor volume (GTV) visible on the image; the clinical target volume (CTV) is delineated based on the knowledge of tumor pathology, tumor invasion range, and lymph node metastasis pathway. In addition, the target area of the organ at risk (OAR) within the irradiation range also needs to be accurately delineated to avoid over-irradiation of the OARs, causing serious side effects and complications of radiotherapy [17]. The above-mentioned delineation quality of the therapeutic target volume and OARs completely depends on the professional knowledge and experience of the doctor, and certain errors will occur. Moreover, these large-scale structures are delineated manually layer by layer for the radiologists, and the time cost is also very high. With the development of artificial intelligence technology, deep learning methods based on the big data of radiotherapy patient images can automatically delineate the therapeutic target area and OARs of patients. The speed and accuracy are greatly improved, which helps to reduce the workload of doctors and reduce manual delineation. uncertainty, further improving the precision of radiotherapy [18,19].
As the main method in the field of artificial intelligence, machine learning can be divided into supervised learning, unsupervised learning, and semi-supervised learning which combines the two [20][21][22]. Specifically in the field of radiotherapy, supervised learning-assisted radiotherapy is mainly used [23]. Combining multiple simple machine learning models to obtain an ensemble learning model with better performance can design a combination scheme for specific machine learning problems to get a better solution [24]. Neural networks are a form of machine learning inspired by the way the brain works, referencing the connection structure of neurons [25][26][27]. When the neural network has many hidden layers, it is defined as a deep neural network. Deep learning methods use deep neural networks to solve various classification

A Survey on Automatic Delineation of Radiotherapy Target Volume based on Machine Learning
and prediction problems. Compared with traditional machine learning methods, deep learning methods have the advantage of being able to automatically learn features in data and avoid manual feature selection. A large amount of data accumulation and the improvement of hardware computing power have made deep learning methods more and more applied in the medical field, and they have shown better performance than traditional machine learning methods [28][29][30][31].

MEDICAL IMAGE REGISTRATION BASED ON MACHINE LEARNING
The electron density of CT images is linearly related to the density of the human body, which can be directly used to calculate the radiation dose, and has become the most commonly used radiotherapy positioning equipment. It has a good effect on bone and lung tissue observation, while soft tissue MRI images have better observation effects, and PET images can indicate areas with strong metabolism. Therefore, multi-modal imaging registration is often used in clinical assessment of disease. Medical image registration is to find the optimal spatial transformation between the source image and the target image to match all the feature points or at least all the corresponding points with diagnostic significance on the two images, and provide doctors with more abundant clinical information. Common registration methods include rigid registration and non-rigid registration.

Rigid Registration
Rigid deformation can be described by a few transformation parameters. In the field of radiotherapy, rigid registration is very common and highly accepted, and clinicians will fuse images of different modalities through this transformation to obtain more information about areas of interest. The registration method is to align the two images by finding the rotation-translation transformation matrix between the fixed image and the moving image [32]. The methods used include linear transformations such as translation and rotation, which can ensure that the overall structure or line parallelism of the image remains unchanged after spatial transformation. At the same time, it has the advantages of simple calculation and low time complexity, and is suitable for images with little deformation.
Rigid registration not only provides a prerequisite for further non-rigid registration and saves the calculation time of image optimization iterations, but also can intuitively display the anatomical structure differences between images between different modalities, assisting doctors in accurate delineation. Traditional registration methods include surface-based methods, point-based methods (usually based on anatomical markers), and voxel-based methods [33]. Among them, voxel-based methods have been widely used by virtue of the rapid development of computer technology. The goal of this method is to obtain geometric transformation parameters by computing the similarity between two input images without pre-extracting features [34]. However, these traditional registration methods often require iterative calculation of similarity measures such as mean square error, mutual information and normalized mutual information, etc. Due to the non-convexity of similarity measures in parameter space, the registration process is relatively expensive. sometimes with poor robustness [35]. Besides, other methods such as intensity-based feature selection algorithms perform image registration by extracting image features corresponding to the intensity, however, the extracted features are difficult to correspond well in anatomy [36].

Non-rigid Registration
Since medical images are affected by factors such as imaging time, imaging equipment, and patient posture, it is difficult to spatially register multimodal images. In addition, the internal tissue structure of the human body is complicated and has time-varying characteristics. For example, the tissues and organs in the lung scan images will move with the patient's breathing. For the deformation of the images with large differences in each direction, the rigid registration method cannot meet the requirements. In this case, a non-rigid registration technology needs to be used, and the same parts of different images are corresponding to each other by means of the spatial registration deformation field. The entire registration process will also introduce different degrees of registration errors due to the chosen optimization method.
Non-rigid transformation includes translation, rotation, scaling, and affine transformation based on an affine matrix and other linear and nonlinear transformation forms. Compared with rigid transformation, it has better deformation accuracy, but the calculation speed is slower. Gu et al. [37] proposed a B-spline affine transformation registration method, using affine transformation to replace the traditional displacement of each B-spline control point, and using a two-way distance cost function to replace the traditional oneway distance cost function to achieve bidirectional registration of two images. Pradhan et al. [38] used a P-spline function with a penalty added to the B-spline for brain image registration. The method based on the physical model regards the deformation of the floating image as the physical change caused by the external force, takes the original image as the input, and calculates the result of the image that is changed by the external force under the physical rules through the physical model. The physical models used are mainly viscous fluid models and optical flow field models. Wodzinski et al. [39] applied the algorithm of the optical flow field model to breast cancer tumor localization, compared it with the B-spline method, and obtained a better registration effect.
With the development of deep learning technology, significant progress has been made in the field of image processing, mainly including the use of unsupervised or self-supervised deep learning to calculate deformation parameters and similarity measures. For example, Hessam et al. [40] used a large number of artificially generated displacement vector fields for training to integrate image content from multiple scales, thereby directly estimating the displacement vector field from the input image. Hongming et al. [41] proposed a new non-rigid image registration algorithm based on a fully convolutional network, and optimized and learned the spatial transformation process between images through a self-supervised learning framework. However, until now, the non-rigid registration algorithm is still not mature enough compared with the rigid registration algorithm, and the algorithm acceptance is not enough [42].

Atlas Based Automatic Contouring
After multimodal image registration, clinicians will delineate contour information on the planned CT. The delineated targets mainly included therapeutic targets and OARs. The shape of OARs is relatively definite, and the location generally does not change much. In terms of automatically delineating OARs, the

A Survey on Automatic Delineation of Radiotherapy Target Volume based on Machine Learning
most widely used clinically is the automatic segmentation technology based on the atlas library [43]. Atlas refers to medical images and their corresponding binary delineation results, since even among different groups of people, the relative spatial positions and spatial shapes of normal organs in the body are similar, and the image textures have the same characteristics. The delineation principle is to pre-establish one or several sets of OARs templates, and machine learning methods automatically match the appropriate templates [44].
The delineation methods based on atlas libraries can be basically divided into two categories: delineation methods based on single atlases and delineation methods based on multiple atlases [45]. The delineation method based on a single map can be regarded as a deformation registration problem. First, the atlas is registered to the image to be delineated, and the transformation matrix and deformation field are obtained. All the delineated organs in the atlas will be deformed and mapped according to the same transformation parameters, and the result of the mapping is the delineation result. However, the single-atlas library delineation method may have a large difference between the input patient images and the average atlas, resulting in unsatisfactory delineation results.
The accuracy of the method based on a single atlas library depends heavily on the accuracy of image registration. When the atlas used is very different from the image to be delineated, it is difficult for the registration algorithm to achieve good results, resulting in a significant reduction in delineation accuracy. In order to improve this phenomenon, Aljabar et al. [46] proposed a multi-atlas method, which registered and fused multiple sets of reference atlases with the images to be delineated, obtained multiple sets of alternative delineation schemes, and used an algorithm to synthesize the alternative plans to form the final delineation. The performance of the multi-atlas library is often more stable than that of the single-atlas library, because the poor mapping results of some atlases in the multi-atlas will be corrected by other better-performing atlases, so that each part can be relatively reasonable. While multi-map-based methods improve the robustness of delineation compared to single-map-based methods, they are prone to topological errors because voxel voting does not necessarily result in closed surfaces. Such topological errors have a great impact on the formulation of radiation therapy plans, and are also difficult to detect, requiring timeconsuming review and manual editing by clinicians [47].

Deep Learning Based Automatic Contouring
The atlas library is essentially the operation of registering the target image and the template image through morphological features, that is, the process of searching for the most approximate shape in the atlas library. But if the shape difference of the template image OARs is too large, the volume is too small or automatically delineated inappropriate choice of deformation algorithm will affect the registration accuracy [48]. The multi-atlas library can improve the accuracy of delineation, but the amount of calculation increases and the time-consuming increases, so a balance between accuracy and speed must be balanced.
Automatic delineation based on deep learning does not require the above trade-offs. Since the key advantage of deep learning is to automatically extract labelled features through the learning of generalized features in training samples to identify new scenes, the more input templates, the more accurate the learned features [49]. Dolz et al. [50] used the support vector machine (SVM) algorithm to successfully achieve automatic segmentation of the brainstem on the MRI image of brain tumors, and then used another deep learning algorithm to segment the optic nerve, optic chiasm, pituitary and small organs such as pituitary stalk are automatically segmented, and the similarity coefficient reaches 76-83% [51]. They also used hand-extracted features, combined with unsupervised stacked denoising autoencoders for brainstem segmentation, and the classification speed was about 70 times faster than that based on SVM methods, reducing segmentation time [52]. Liang et al. [53] performed automatic segmentation on CT images based on deep learning, with a sensitivity of 0.997~1 for automatic segmentation of most organs, which can effectively improve nasopharyngeal cancer radiotherapy planning.
Currently, deep learning networks, especially convolutional neural networks (CNN), have become a common method for medical image analysis [54]. CNN is capable of processing multi-dimensional and multi-channel data, capturing complex nonlinear mappings between input and output, with advantages for image processing and classification. A Stanford University study used a CNN model to automatically segment head and neck OARs for the first time. In the automatic segmentation of organs such as bone, pharynx, larynx, eyeball and optic nerve, it is better than or equivalent to the current best technology. But for organs such as parotid gland, submandibular gland and optic chiasm whose boundaries are not easy to identify on CT images, the delineated results are not satisfactory [55]. Lu et al. [56] used a 3D CNN to automatically segment the liver, combined with a graph cut algorithm to refine the segmentation. The advantage is that no manual initialization is required, and the segmentation process can be performed by non-professionals. Also using 3D CNN for liver segmentation, Hu et al. [57] combined deep learning with global and local shape prior information, and evaluated on the same dataset, and all error indicators were significantly reduced. In a follow-up study, the target was extended to abdominal multi-organ segmentation, using 3D CNN to perform pixel-to-pixel dense prediction with higher accuracy and shorter segmentation time [58]. Therefore, the outline processing of OARs is a complex project, and it is often difficult to use a set of models to achieve the expected accuracy for different parts of the body or different modalities. In actual situations, it is necessary to combine specific factors to make certain improvements to deep neural networks.

GTV Automatic Delineation
As with normal tissue delineation, deep learning-assisted tumor target delineation helps improve execution efficiency. However, since it is often difficult to distinguish the boundary between the tumor and the surrounding tissue, the clinical information, pathological sections, and images of the patient will become the reference data for GTV delineation. Various techniques are used to aid in identification. In the Multimodal Brain Tumor Image Segmentation Challenge (BraTS) in 2013, Pereira et al. [59] used CNN to automatically segment brain tumor MRI images, which improved the network accuracy and ranked first.

A Survey on Automatic Delineation of Radiotherapy Target Volume based on Machine Learning
Since then, Kamnitsas et al. [60] proposed a dual-channel 3D CNN network for brain injury (including traumatic brain injury, brain tumor, ischemic stroke) segmentation, the first time to use fully connected conditional randomization on medical data. Both of the above studies used neural networks with small convolution kernels to make the network structure deeper without increasing the computational cost. Men et al. [61] used big data to train deep dilated residual network (DD-ResNet) for breast tumor segmentation, and the results were better than deep dilated convolutional neural networks (DDCNN) and distributed deep neural networks (DDNN), similar to Dice The dice similarity coefficient (DSC) was 91%, which was higher than the result hand-drawn by experts [62].
In addition, for the above-mentioned basic network types, studies have also shown that the improved network in [63] can improve the accuracy of network segmentation and has stronger robustness. Lin et al. [64] trained a 3D CNN to delineate the GTV of nasopharyngeal carcinoma on MRI images, and the similarity with the GTV delineated by experts was high, with the DSC reaching 79%. With the help of machine learning, doctors reduced their time by 39.4% and improved their accuracy. 3D CNN not only utilizes the CT image information of each layer extracted by traditional CNN, but also utilizes the information between layers, the information utilization rate is high, and the accuracy is improved to a certain extent. Qi et al. [65] used convolutional neural networks to delineate the target volume of nasopharyngeal carcinoma based on multimodal imaging (CT and MRI). The results show that the target area is delineated with high precision. Li et al. [66] used the U-Net to automatically delineate the target volume of nasopharyngeal carcinoma based on CT images. The results showed that the segmentation accuracy of the automatically delineated target volume was high. Li et al. [67] based on the four-dimensional computed tomography data of patients with non-small cell lung cancer, used transfer learning to automatically delineate the tumor area, which improved the accuracy and shortened the retraining time of the network. When the breathing range was 5-10 mm, the matching index improved by 36.1% on average compared with the comprehensive elastic deformation registration technique. In a recent study [68], the authors used fuzzy c-means clustering (FCM), artificial neural network (ANN), and SVM algorithms to automatically segment GTV of solid, ground-glass, and mixed lung cancer lesions, respectively. It is considered that the results of the FCM model are more accurate and efficient, and can be reliably applied to SBRT.
Delineating GTV based on deep learning can improve the work efficiency of clinicians, but this method cannot completely replace manual delineation. On the basis of automatic delineation, manual correction is still required to achieve accurate delineation effects [69].

CTV Automatic Delineation
CTV should be given a certain dose of radiation to the subclinical foci formed by infiltration around the primary tumor and the path of regional lymph node metastasis according to the requirements of radiobiology and the factors of tumor occurrence and metastasis. It is the basis for tumor regional radiotherapy to control recurrence and metastasis. The delineation needs to be judged in combination with the specific pathological conditions and the possible invasion or metastasis range of the diseased tissue, and the delineation results of different types of tumors and different stages are completely different.

A Survey on Automatic Delineation of Radiotherapy Target Volume based on Machine Learning
Specifically, Men et al. [70] used a DDCNN model to attempt automatic segmentation of CTV and OARs in 218 rectal cancer patients, and the results were accurate and efficient. Among them, the DSC of CTV reaches 87.7%, the DSC of bladder and bilateral femoral head is more than 90%, and the delineation of small intestine and colon is not accurate enough, and the DSC is 65.3% and 61.8%, respectively. It is possibly related with that they are both air-containing hollow organs. Based on deep learning with Areaaware reweight strategy and Recursive refinement strategy, called RA-CTVNet, Shi et al. [71] segment the CTV from cervical cancer CT images. Their experimental results show that RA-CTVNet improves DSC compared with different network architectures. Compared with three clinical experts, RA-CTVNet performed better than the two experts while comparably to the third expert. Shen et al. [72] modified the U-net model by incorporating the contours of gross tumor volume of lymph node (GTVnd) and designed the DiUnet model for the automatic delineation of lung cancer CTV. The results showed that the DSC of most lymph node regions was up to 70%, which was not significantly different from manual delineation.
In addition, our team [73] collected CT images of 53 cervical cancer patients. By modifying the U-net model and the training process according to the task, the automatic segmentation of images of cervical cancer CTV region and normal tissue is realized. By testing the prediction accuracy of the model and the number of required dialogue rounds, the recall rate, accuracy rate, DSC, Intersection over Union (IoU), etc. of the results were evaluated. The results show that the proposed model has good performance in all the indicators outlined in the target area. And compared with commonly used deep learning neural network models such as mask region-based convolution neural network (Mask R-CNN), speech enhancement generative adversarial network (SegAN), and U-net, the segmentation boundary of the proposed model is clearer and smoother, and the recall rate is obviously better than that of other models. Moreover, because of its very light weight, it can be adapted to the dataset size-limited case.
Due to the involvement of subclinical lesions and lymph node drainage areas, CTV automatic delineation is relatively more difficult, and the performance of deep learning delineation is still far from that of experts [74][75][76]. In the future, relying on the disease-specific big data platform to integrate multimodal radiotherapy data, imaging, genetic and other multi-omics data, as well as the experience data of senior radiotherapy physicians, physicists, and technicians, it is expected to be useful in the prediction of efficacy and complication risk. Guided by the results, individualized CTV range decisions are provided.

CONCLUSION
The research of machine learning methods in the field of radiotherapy has been fully rolled out and achieved phased results, among which the automatic delineation of normal tissues and tumor target areas has always been a research hotspot [77][78][79]. Most of the existing deep learning models are based on natural images, and there is a lack of deep learning models dedicated to medical, especially radiation oncologyrelated images. The difference between medical images and natural images is that medical images are grayscale images and generally have continuity [80,81]. In image segmentation, not only the regional structure of an image, but also the spatial structure of 3D data must be considered [82]. In addition, local and global prior information needs to be considered before it can further contribute to the segmentation

A Survey on Automatic Delineation of Radiotherapy Target Volume based on Machine Learning
of OARs and therapeutic target volume [83]. Moreover, multimodal image registration is often required to further identify the extent of tumor invasion [84,85].
Besides, radiotherapy is one of the links in tumor treatment. How to determine the appropriate radiotherapy target range and irradiation dose is a complex issue that requires system integration, such as disease characteristics and overall treatment mode, even the cross-scale issues from molecular cells to tissues and organs, and the spatio-temporal relationship of biomolecules and other factors need to be comprehensively analyzed. So that the radiotherapy plan obtained in this way is more in line with the principle of precise individualized treatment. The integration of automatic radiotherapy target delineation with artificial intelligence knowledge maps and causal analysis may play an important role in the formulation of clinical radiotherapy targets [86].
At present, most of the current applications are in the preclinical research stage, but there are still some problems in clinical application. First, high-quality clinical data is the basis for artificial intelligence to learn and judge, but the current standardization of relevant medical data for automatic target area delineation is not high. The quality of labeling is uneven, and the data of major medical centers lack a joint construction and sharing mechanism. There are data barriers, which seriously hinder the effective use of data and product development. Secondly, it is still difficult to accurately define the treatment target area. Based on the current CT, MRI, PET-CT and other means, it is generally not difficult to determine the GTV, but some lesions are still difficult to identify, such as soft tissue invasion, bone destruction degree and scope, etc. The doses of CTV are different according to the risk of recurrence and metastasis. There is no relevant research on how to determine high-, medium-, and low-risk CTV. In addition, the clinical application of artificial intelligence is directly related to life and health, and faces many ethical and legal challenges. However, the automatic delineation of radiotherapy target volume based on machine learning will be an important development direction of artificial intelligence in the medical field in the future.