Learning-based local-to-global landmark annotation for automatic 3D cephalometry

The annotation of three-dimensional (3D) cephalometric landmarks in 3D computerized tomography (CT) has become an essential part of cephalometric analysis, which is used for diagnosis, surgical planning, and treatment evaluation. The automation of 3D landmarking with high-precision remains challenging due to the limited availability of training data and the high computational burden. This paper addresses these challenges by proposing a hierarchical deep-learning method consisting of four stages: 1) a basic landmark annotator for 3D skull pose normalization, 2) a deep-learning-based coarse-to-fine landmark annotator on the midsagittal plane, 3) a low-dimensional representation of the total number of landmarks using variational autoencoder (VAE), and 4) a local-to-global landmark annotator. The implementation of the VAE allows two-dimensional-image-based 3D morphological feature learning and similarity/dissimilarity representation learning of the concatenated vectors of cephalometric landmarks. The proposed method achieves an average 3D point-to-point error of 3.63 mm for 93 cephalometric landmarks using a small number of training CT datasets. Notably, the VAE captures variations of craniofacial structural characteristics.


Introduction
Cephalometric analysis facilitates the development of morphometrical guidelines for diagnosis, planning, and treatment of craniofacial disease or for evaluations in anthropological research. Recent advances in image processing techniques have allowed the annotation of three-dimensional (3D) cephalometric landmarks in 3D computerized tomography to become an essential clinical task. Cephalometric landmarking is usually performend via manual identification of points that represent craniofacial morphological characteristics. This requires a high level of expertise, time, and labor, even for experts. Therefore, there is an increasing demand for a computer-aided automatic landmarking system that can reduce the labor-intensiveness of this task and improve workflow.
Over the past 40 years, several approaches have been proposed for automatic landmark identification based on image processing and pattern recognition (Levy-Mandel et al 1986, Cardillo and Sid-Ahmed 1994, Chakrabartty et al 2003, Giordano et al 2005, Hutton et al 2000, Innes et al 2002, Levy-Mandel et al 1986, Parthasarathy et al 1989, Rudolph et al 1998, Vučinić et al 2010, Codari et al 2017, Gupta et al 2015, Makram and Kamel 2014, Montufar et al 2018. These conventional image processing approaches generally encounter difficulties in achieving robust and accurate landmarking owing to limitations in the simultaneous viewing of local and global geometric cues. Most prior works involved two-dimensional (2D) cephalometry using plain radiographs or CT-derived scan images. However, recent advances in imaging technologies and computer-assisted surgery have facilitated a shift from 2D to 3D cephalometry, which has several advantages over 2D techniques, including accurate identification of anatomical structures, avoidance of geometric distortion in images, and the ability to evaluate complex cranial structures (Lee et al 2014). Most previous 3D approaches were based on reference models (Codari et al 2017, Makram and Kamel 2014, Montufar et al 2018, and their performance was limited by the unique structural variations of different individuals. This indicates that there are still limitations in dealing with complex 3D craniofacial model and to formulating it into well-defined mathematical formulae. Recent developments in deep learning for medical imaging applications have led to several attempts to build an automatic cephalometric landmarking system. The main advantage of these deep learning approaches compared to conventional image processing is that the experience of the experts can be reflected in the algorithm through learning training datasets. The architecture for learning datasets labeled by experts has allowed deep-learning-based methods to locate landmarks in 2D cephalograms with impressive performance (Arik et al 2017). However, there continue to be difficulties in the application of these methods to 3D cephalometry because of the required number of datasets. According to Barron's observation (Barron 1994) regarding approximations with shallow neural networks, the number of training datasets needed for learning grows tremendously as the input dimensions increase. Due to the high input dimensionality for cranial CT (typically 512 × 512 × 400 matrix size), much more training datasets are required than are currently available, especially given the legal and ethical restrictions associated with medical data. Even if a sufficient amount of data is collected, the learning process can be difficult due to the curse of dimensionality for processing high dimensional images. These issues represent a significant hindrance to the development of high-precision, automatic, 3D landmarking systems.
This paper reports the development and evaluation of an automatic annotation system for 3D cephalometry using limited training datasets. To address the dimensionality issue and limited data availability, a hierarchical deep learning method is developed which consists of four stages: 1) a basic landmark annotator for 3D skull pose normalization, 2) a deep learning-based coarse-to-fine landmark annotator on the midsagittal plane, 3) a low-dimensional representation of the total number of landmarks using variational autoencoder (VAE), and 4) a local-to-global landmark annotator. The first stage employs a shadowed 2D-image-based annotation method (Lee et al 2019) to detect seven basic landmarks, which are then used for 3D skull pose normalization via rigid transformation and scaling. In the second stage, we generate partially integrated 2D image on midsagittal plane. By applying a convolutional neural network-based landmark annotator, the system then roughly estimates the landmarks on 2D image. A patch-based landmark annotator then provides more accurate detection of the landmarks. Stage 3 applies the VAE to the affine normalized training datasets (obtained in stage 1) to extract a disentangled low-dimensional representation. This low-dimensional representation is then used for mapping from fractional information of the landmarks (obtained in stages 1 and 2) to total information of the landmarks. An additional benefit of using the VAE is that it can learn variations of craniofacial structural characteristics.
In this paper, the proposed method was evaluated by comparing the positional discrepancy between the results obtained and that of the experts. Using a small number of training CT datasets, the proposed method achieved an average 3D point-to-point error of 3.63 mm for 93 cephalometric landmarks. Therefore, the proposed method has an acceptable point-to-point error for assisting medical practice.

Method
Let x represents a three-dimensional CT image with voxel grid Ω : with d j being the number of voxels in direction v j . In our case, the CT image size is approximately 512 × 512 × 400. The value x(v) at the voxel position v can be viewed as the attenuation coefficient. The goal is to find a map f from the 3D CT image x to a group of 93 landmarks P (see table A1), where P = ( p 1 , · · · , p 93 ) and each p j = ( p j,1 , p j,2 , p j,3 ) denotes the 3D position of the jth landmark. P can be expressed as follows : P = (( p 1,1 , p 1,2 , p 1,3 ), ( p 2,1 , p 2,2 , p 2,3 ), · · · , ( p 93,1 , p 93,2 , p 93,3 )) . (1) The group of landmarks P can be viewed as a geometric feature vector that describes craniofacial skeletal morphology. Since direct automatic detection of P seems to be quite challenging, it is desirable to infer P based on minimal information, which is more convenient for automatic detection. By exploiting the similarity of facial morphology, it is reasonable to have a low dimensional latent representation for feature vectors. This could be achieved provided that the geometric feature vector can be expressed as a low dimensional representation, which retains crucial morphological features that describe dissimilarities between the data. Each step of the proposed method is illustrated in the following subsections (also see figure 1).   Figure 1. Schematic diagram of the proposed method for the 3D landmark annotation system. The input 3D image x is standardized/normalized using five reference landmarks in such that the reference landmarks of the corresponding normalized image x ♮ are fixed (to some extent) regardless of the input images. The total group of landmarks P is then detected by combining two different deep learning methods described in sections 2.2 and 2.3.

Stage 1: Choice of a reference coordinate frame and anisotropic scaling for skull normalization
The first step determines a reference coordinate frame and normalizes the data for effective feature learning. As shown in figure 2, the hexahedron made by the five landmarks (bregma, center of foramen magnum (CFM), nasion, left/right porion (L/R Po)) is normalized using a suitable anisotropic scaling. The normalization of the data is based on facial width (the distance between the x-coordinate of L Po and R Po), facial depth (the distance between y-coordinate of L Po and nasion), and facial height (the distance between z-coordinate of CFM and bregma). We normalize the data by setting the width, depth, and height as a fixed value so that each reference hexahedron has (to some extent) fixed shape and size. The hexahedron determined by geometrical transformation is not exactly in the same size since the landmark positions vary for different individuals. These reference landmarks can be automatically obtained using the existing approach of multiple shadowed 2D-image-based landmarking (Lee et al 2019), which utilizes multiple shadowed 2D images with various lighting and view directions to capture 3D geometric cues. The five reference landmarks are important components for the normalization of the skull. Also, they are chosen because they have apparent positional features which enable easy and robust detection even with a small number of data. The reference coordinate frame is selected so that the CFM is positioned at the origin and the bregma lies on the z-axis. The midsagittal plane is the yz-plane (x = 0) and is determined by three landmarks (CFM, bregma, and nasion). We denote x ♮ as the CT data with the new reference Cartesian coordinate. This normalization focuses on facial deformities (e.g. prognathic/retrognathic jaw deformities) by minimizing scale and pose dependencies, and it enables efficient feature learning of similarity/dissimilarity in the third stage when applying VAE. vs.

Stage 2: Detecting landmarks near the midsagittal plane
This step measures 8 landmarks near the midsagittal plane (see figure 3) that are used to estimate the total landmarks of the skull. Given that the method only considers landmarks that are on or near the midsagittal plane, not the entire skull, this stage uses a digitally reconstructed 2D image obtained by incorporating cross-sectional CT images taken near the midsagittal plane (i.e. via integration of truncated binary 3D skull data). The resulting 2D midsagittal image has less blurring and less irrelevant information (which is caused by overlapping contralateral structures) than a 2D cephalogram with whole-volume data. Given that landmarks are determined from the skeletal morphology, image enhancement is used to emphasize bone, as shown in figure 4. This emphasis on relevant locations helps the machine learning to focus on key information more efficiently and should improve feature learning despite the limited training datasets. Enhancement is performed via binarization of the brightness and contrast (setting bone as 1 and other areas as 0), which allows the machine to discriminate between the necessary and unnecessary image features.
The 3D CT data are first binarized by thresholding. The truncated volume is then integrated along the normal direction of the midsagittal plane. The image generated by this method, although 2D, contains 3D features near the midsagittal plane. Let x s be the 3D binary data with the v 1 -direction as normal direction of midsagittal plane. The data where v 1 -directional interval [a, b] determines the truncated volume, as shown in figure 4.
.., N}, we train a network that detects 8 landmarks on a 2D image using convolutional neural network (CNN). However, accurate detection of the landmarks directly from image is limited due to small number of data we have. To address this problem, we use coarse-to-fine detection approach using . Architecture of VAE-based low dimensional representation. VAE aims to generate highly reduced encoded vector h ∈ R 25 which will be decoded to a vector as close to input vector P ∈ R 279 . Normalized landmarks are used to generate semantically useful latent space.
entire image-based CNN and patch-based CNN. The architectures of CNNs will be explained in result section. The entire image-based CNN allows to detect P loc roughly by capturing global information. This coarse detection output is used to generate local patches, which are input of the patch-based CNN. The patch-based CNN provides P loc with improved accuracy. See figure 5.

Stage 3: Learning a low-dimensional latent representation of landmarks
In this stage, a low-dimensional latent representation of the total landmark vector P is obtained by applying the VAE (Kingma et al 2013) to the normalized data in stage 1. The change of the coordinates and data normalization in stage 1 are expected to minimize the scale and pose dependency, allowing more efficient identification of factors on the midsagittal plane related to facial deformity. This facilitates the extraction of exploitable morphological factors. In mathematical terms, the VAE is a deep learning technique that finds a non-linear expression of the concatenated landmark vector P ∈ R k by variables h ∈ R d (d ≪ k) in the low-dimensional latent space. In our experiments, we use k = 279 and d = 25. The VAE uses the training datasets {P (i) : i = 1, · · · , N} to learn two functions, the encoder Φ : P → h and the decoder Ψ : h → P using the following loss minimization over the training data: where VAE describes a class of functions in the form of the deep learning network described in figure 6. To be precise, the encoder Φ is of the following nondeterministic form: where µ = (µ(1), · · · , µ(d)) ∈ R d represents a mean vector; σ = (σ(1), · · · , σ(d)) ∈ R d is a standard deviation vector; h noise is an auxiliary noise variable sampled from standard normal distribution N (0, I); and ⊙ is the element-wise product (Hadamard product). Hence, Note that the covariance Σ and D KL (N (µ, Σ) ∥ N (0, I)) are used for smooth interpolation and compact encoding. The decoder Ψ : h → P in (3) provides a low-dimensional disentanglement representation, so that each latent variable is sensitive to changes in individual morphological factors, while being relatively insensitive to other changes. The changes are visualized in discussion section.

Stage 4: Local-to-global landmark annotation for automatic 3D cephalometry
In this final step, we detect the total landmark vector (P) from the fractional information (P ♯ ), where P ♯ is the vector composed of P loc obtained in stage 2 and the reference landmarks in stage 1. In stage 3, the VAE can find a low-dimensional latent representation of the total landmarks, i.e. Ψ(h) = P. Stage 2 detects a portion of the landmarks P loc near the midsagittal plane. Using the encoder map h (i) = Φ(P (i) ) in the result of stage 3, the training data {(h (i) , P (i) ♯ ) : i = 1, 2, ..., N} can be generated. Then, the training data can be used to learn a non-linear map Γ : P ♯ → h that connects the latent variables h and the fractional data P ♯ . The non-linear regression map Γ : P ♯ → h is obtained by minimizing the loss 1 The local-to-global landmark annotation is then obtained from This is represented in figure 7.

Dataset and experimental setting
We used two datasets, provided by one of authors. The first dataset contains 26 anonymized CT data with cephalometric landmarks that were produced for a previous study (Lee et al 2014 (b) Figure 8. Architectures of (a) entire image-based CNN and (b) patch-based CNN. The entire image-based CNN allows to detect P loc roughly. Then the patch-based CNN provides P loc with improved accuracy by using local patches generated from the entire image-based CNN.
marking. The second dataset with 3D positions of landmarks from anonymized 229 subjects with dentofacial deformity and malocclusion was also used, and they were acquired in excel format using the 3D coordinates of landmarks from the original data source. The labeling of the landmarks for both datasets was performed by one author (LSH) with more than 20 years of experience in 3D cephalometry. When training CNN, we used the first dataset of 22 subjects as training data and four as test data. Translation was applied for data augmentation. For VAE and non-linear regression, we used both the first dataset (26 subjects) and the second dataset (229 subjects), having total 255 subjects. We used 230 subjects as training data and 25 as test data. For each experimental setting, we set learning rate as 0.0001 and batch size as 8, and went through 3000 iterations when training CNN. For VAE and non-linear regression, we set learning rate as 0.0001 and batch size as 50, and performed 30 000 iterations. Adam (Kingma et al 2014), which is an adaptive gradient method, was chosen as the optimization algorithm.

Experimental results
In our proposed method, we aimed to locate 3D coordinates of 93 landmarks from fractional information consisting of P loc obtained in stage 2 and the reference landmarks in stage 1. We normalized the data by changing the coordinate and rescaling the size of the data. To set the new coordinate system, seven landmarks (CFM, bregma, nasion, left/right porion, and left/right orbitale) were used. Then, using these landmarks, we applied anisotropic scaling by fixing the height as 145 mm, the width as 110 mm, and the depth as 80 mm,. These values represented the average value height, width, and depth, respectively, of the sample. The normalization of the skull was empirically performed based on tables 3 and 4, which show the variances of landmark positions for three types of scaling methods. The anisotropic scaling has the smallest variance for landmarks on neurocranium, compared to the other scaling methods.
For the detection of 8 landmarks (see table A2) on the midsagittal plane using truncated 2D image, the interval for truncation of the 3D data was set at 3 cm (dist(a, b) = 3 cm), v 1 -directionally ±1.5 cm from the midsagittal plane. The overall architecture for image-based CNN is shown in figure 8(a). With the input data of size 512 × 512, the first three layers were convolutional layers with kernel size of 3 × 3 pixels with stride 1 and 8 channels. On the fourth layer, we used kernel size of 2 × 2 convolution with stride 2 for spatial downsampling. By applying either convolution or pooling layer to each layer, the last four layers were fully connected layers with 1024-512-256-16 neurons in each layer. Rectified Linear Unit (ReLU) activation was performed after each pooling layer to solve vanishing gradient problem, dropout rate of 0.75 was chosen to alleviate overfitting problem (Srivastava et al 2014), and Adam optimizer was used for learning. Additionally, we extracted local patch using the output obtained from entire image-based CNN. For fine detection of the 8 landmarks on the midsagittal plane, CNN architecture was again designed for the patch-based detection. The architecture is similar to that of entire image-based CNN as shown in figure 8(b). Note that the size of the patch (as input for patch-based CNN) was chosen by the characteristics of morphological structures in the . Results of coarse-to-fine landmark detection on a 2D image. The yellow dot is the output of the entire image-based CNN, which determines each patch. The green dot is the output for detection using patch-based CNN. The red dot is the ground truth.
vicinity of the landmarks. With small amount of data at our hands, it was inevitable to apply landmark detection additionally on the small patches for better feature learning for each landmark. The effectiveness of additional detection on a small ROI is shown in figure 9, and table 1 shows the prediction error value of the landmarks. It is observed that detection on the small patch captures more accurate features for each landmark so that the patch-based CNN output becomes closer to the ground truth compared to the detection of entire image-based CNN.
As a key aspect of this research, we used VAE to find the latent representation of the landmark feature vector. The objective was to find the low dimensional representation of high dimensional landmark feature vectors. The latent dimension of the latent space was empirically chosen to be 25, which indicates that the landmark feature vectors (∈ R 279 ) can be nonlinearly expressed using 25 variables. Next, we connected the landmarks detected in previous steps to the trained representation via non-linear regression with multilayer perceptron. The multilayer perceptron structure was set as 45-30-25 neurons in each layer. After the completion of training VAE and non-linear regression, the reconstructed landmark vectors were given by (Ψ • Γ)(P ♯ ). Let q j denotes the denormalized vector of jth component (jth landmark) of (Ψ • Γ)(P ♯ ). Figure 10 shows the 3D distance error (mm) for each landmark, which is calculated as 1 j ∥ indicates the error for ith patient. The localization errors of most of the landmarks were within 4 mm. We achieved an average point-to-point error of 3.63 mm for 93 cephalometric landmarks Table 1. Mean of 2D distance error (mm) for 8 landmark detection using coarse-to-fine annotator on test data. Each column indicates the errors of the coarse detection and the fine detection respectively.

Landmark
Coarse detection error (mm) Fine detection error (  which was calculated by 1 The standard deviation of the error of the 93 landmarks was 1.41 mm. The landmark point for midpoint of superior pterygoid point (mid-Pts) exhibited the highest level of accuracy with an error of 1.41 mm for a 3D distance, and the point for right coronoid point (R COR) exhibited the lowest level of accuracy with an error of 7.47 mm for a 3D distance. For the total number of points, the error was within 8 mm and 60 landmarks were within 4 mm. Figure 10 shows the test error for each landmark.

About usage of CT data
3D cephalometry serves as a powerful tool in craniofacial analysis, as compared to 2D cephalometry (Adams et al 2004, Nalçaci et al 2010. It is based on 3D CT images obtained from conebeam CT (CBCT) or multi-slice CT (MSCT). The effective dose for CBCT in craniofacial imaging is generally lower than that of MSCT (Ludlow et al 2013). Since our currently available CBCT machines do not provide the full field of view for complete 3D cephalometric analysis such as Delaire or Sassouni analysis, we applied our experiments to previously acquired MSCT images that contain cranium and vertebrae as well as the maxillomandibular facial structures. Low dose and radiation protection protocol for MSCT were applied to reduce the radiation dose during the study.

About the number of cephalometric landmarks
For general 2D cephalometric analysis, 93 landmarks can be considered to be too many. However, such number of landmarks are needed for realization and clinical application of 3D cephalometry. Among the 93 landmarks, 75 points were consisted of bilateral reference points (29 landmarks respectively for left and right) and their middle points (17 midpoints of left and right). Therefore, it matched with 47 landmarks (18 midline and 29 bilateral points) for 2D cephalometry. Due to the characteristics of the 3D analysis, the midpoints are needed to construct their related planes. Without the midpoints, some complicated problems can be expected. For example, the reference planes created only with bilateral landmarks may not have a Table 2. Comparison of errors of VAE-based expression for P by varying latent dimension. In this case, the error is the difference between the input and output that is given as 1 N vertical relationship with the midsagittal plane, which causes inconsistent cephalometric measurements. Therefore, as compared to 2D analysis, 3D cephalometrics require nearly three times more landmarks for cephalometric measurement, since bilateral landmarks are considered to be unilateral in 2D cephalometric analysis.

About data normalization
In the initial step of our proposed method, we set a new coordinate system (rotation and rescaling) for data normalization. Table 2 shows the average 3D distance error for a total of 93 landmarks by varying the latent dimension and the normalization scheme. In this case, the error is the difference between the input and output that is given by 1 N Based on a comparison of the errors of the VAE-based expression for P for various latent dimensions, we empirically chose the latent dimension to be 25 for the normalized data, which shows the lowest mean and maximum error for the test dataset.
To evaluate the scaling methods, we measured the variance of each landmark position between subjects with CFMs that are fixed as the origin. Tables 3 and 4 show the variances of landmark positions for three types of scaling methods (no scaling, isotropic scaling, and anisotropic scaling). Each of the table refers to the landmarks of the neurocranium and mandible, respectively.  Figure 12. Interpolation between two points h (i) and h ( j) in the latent space. Given two feature vectors Ψ(h (i) ) and Ψ(h ( j) ), VAE allows to generate the interpolated feature vector Ψ((1 − t)h (i) + th ( j) ) for 0 < t < 1.

About choice of local landmarks
In stage 4, we detect the 93 global landmarks from 15 local landmarks, some of which are on the surface of the skull and the others lie on the midsagittal plane. Table 5 shows the average 3D distance of point-to-point error for the 93 landmarks, for the selection of local landmarks. It is expected that a higher level accuracy will be obtained for cephalometric annotation if there are more (quantitative) local landmarks. To further detect local landmarks such as those on the surface of the skull (e.g. neurocranium landmarks), it would be possible to detect landmarks using an existing method based on multi-shaded 2D image-based landmarking (Lee et al 2019). Therefore, the proposed method could be improved by detecting more local landmarks.

About variations of craniofacial structural characteristics
The results of experiments using the VAE show that the geometric feature vector that describes facial skeletal morphology lies on a low-dimensional latent space. Regarding what each latent variable represents, we visualized that varying each variable alters the landmark positions. Among 25 latent variables, figure 11 shows visualization of two factors with the changed positions of reconstructed landmarks. Each of them seems to capture prognathic/retrognathic jaw deformity ( figure 11(b)) and the long/short face (by changed facial vertical dimension) (figure 11(c)). Since these deformity shifts can be regarded as one illustration of jaw deformity types based on the shape, size, and position of mandible and maxilla, it is interesting that VAE captures the variations of some craniofacial structure. In this work, we described only two morphological factors. We expect that facial deformities would be expressed into the latent variables. Further research is necessary to deal with it through factor analysis using VAE. Moreover, to verify one of the properties of VAE i.e. that the latent space is dense and smooth, figure 12 shows the interpolation between two randomly chosen data points in the latent space. We interpolated two encoded data (say h (i) and h ( j) ) and decoded the interpolated samples back to the original space (landmark data space). Let P (i) , P ( j) be data from the landmark data space and h (i) = Φ(P (i) ), h ( j) = Φ(P ( j) ) be the encoded data on the latent space. We linearly interpolated two latent vectors and fed them to decoder. Figure 12 shows the visualization of the decoded data. It can be observed that each of the generated images contains human-like cephalometric structure landmarks.

Conclusion
In this paper, a multi-stage deep learning framework is proposed for automatic 3D cephalometric landmark annotation. The proposed method initially detects a few landmarks (7 out of 93) that can be robustly and accurately estimated from the skull image. The knowledge of the 7 landmarks allows the midsagittal plane to be determined, on which 8 important landmarks lie. This midsagittal plane is used to accurately estimate the 8 landmarks based on the coarse-to-fine CNN in stage 2. The remainder of the landmarks (78 = 93 − 15) are estimated from the knowledge of 15 landmarks and the VAE-based representation of morphological similarity/dissimilarity of the normalized skull. This mimics the detection procedure of experts in that it first estimates easily detectable landmarks, and then detects the remainder of the landmarks. Its novel contribution is the use of a VAE for 2D image-based 3D feature learning by representing high-dimensional 3D landmark feature vectors using much lower-dimensional latent variables. This low-dimensional latent representation is achieved with the help of cranial normalization and fixed reference Cartesian coordinates. It allows all 3D landmarks to be annotated from partial information based on landmarking on a cross-sectional CT image of the midsagittal plane. The experiments confirmed the capability of the proposed method, even when a limited number of training datasets were used. Manual landmarking of nearly a hundred of landmarks is time-consuming and labor-intensive work. Compared to the manual operation, automatic landmark detection with additional fine tuning is expected to improve the work efficiency of experts. Therefore, the proposed method has potential to alleviate experts' time-consuming workflow by dramatically reducing the time required for landmarking while preserving high accuracy.
Our hierarchical method exhibited a much higher level of performance compared to the previous 3D deep learning method (Kang et al 2019). Using the same dataset, the proposed method yielded an average 3D distance error of 3.63 mm for 93 cephalometric landmarks, while the 3D deep learning model (Kang et al 2019) resulted in an average 3D distance error of 7.61 mm only for 12 cephalometric landmarks.
The proposed method exhibited relatively high performance, but the error level did not meet the requirement for immediate clinical application (such as in less than 2 mm of error levels). However this approach has a lot of room for improvement and the errors can be significantly reduced by improving deep learning performance with an increased number of training data. Although our protocol cannot intuitively determine the exact location set to achieve the expert human standards, it can immediately help guide the operator to the approximate position and image setting for 3D cephalometry. It can also reduce the burden of moving the 3D skull object and scrolling over the multi-planar image settings during landmark pointing tasks. Finally, it can be applied prior to data processing for segmentation, thus assisting in the orientation of the head to the calibrated posture.
The proposed multi-stage learning framework is designed to deal with the challenge of a small amount of data when learning 3D features from 3D CT data. Although hospitals generate many CT datasets, few datasets can be used for research for legal and ethical reasons.
This automatic 3D cephalometric annotation system is in an early stage of development, and there is potential for further improvement. The proposed method can provide excellent initial landmark estimation (that only requires small adjustments) that can be used to develop an accurate and consistent landmark detection system without using a large amount of data. More precisely, with given initial landmark estimation, the landmark detection problem can be reduced to adjusting the landmark position in each small region extracted from the initial landmarks. We think that the desired level of accuracy can be achieved if more data is available. Since the dimension of the input (extracted small region) is small, we do not need much more training data for better result. Also, it would be desirable to integrate the multi-stage hierarchical learning framework used in this study with a unified learning framework, because the errors in each step can affect the successive steps in the hierarchical structure. Me (