Automatic 3D cephalometric annotation system using shadowed 2D image-based machine learning

This paper presents a new approach to automatic three-dimensional (3D) cephalometric annotation for diagnosis, surgical planning, and treatment evaluation. There has long been considerable demand for automated cephalometric landmarking, since manual landmarking requires considerable time and experience as well as objectivity and scrupulous error avoidance. Due to the inherent limitation of two-dimensional (2D) cephalometry and the 3D nature of surgical simulation, there is a trend away from current 2D to 3D cephalometry. Deep learning approaches to cephalometric landmarking seem highly promising, but there exist serious difficulties in handling high dimensional 3D CT data, dimension referring to the number of voxels. To address this issue of dimensionality, this paper proposes a shadowed 2D image-based machine learning method which uses multiple shadowed 2D images with various lighting and view directions to capture 3D geometric cues. The proposed method using VGG-net was trained and tested using 2700 shadowed 2D images and corresponding manual landmarkings. Test data evaluation shows that our method achieved an average point-to-point error of 1.5 mm for the seven major landmarks.

where (Net) is the number of neurons in a neural network, (input) is the number of voxels of the input x, and (labels) is the number of labeled training data. For reasonable accuracy, the learning ability of landmark detection f depends on the ratio between the amount of training data and the dimension of x, so that a great amount of training data is needed for the high dimensional input x. It seems that the dimension of 3D data x is too high to learn f from the currently available training data.
To resolve this issue, we propose a new detection strategy which takes advantage of multiple 2D images containing 3D geometric morphology of the skull surface in x. The use of lighting, shadowing, and subject orientation augments 2D images with 3D geometric features. Physical interactions among the light source, skull surface and orientations generate shadows which provide geometric cues for landmark detection. Because 2D data requires fewer weight parameters in a deep learning network than does 3D data, it is possible to avoid the curse of dimensionality. That is, the representation of 3D images by 2D ones not only allows learning with relatively small data numbers, but also solves the out-of-memory problem.
The proposed landmark detection method was trained and tested using 2700 shadowed 2D images in each view direction. These were generated by 27 anonymized CT datasets, labeled with manual landmarkings (as ground-truth references) made by two experienced professionals. This paper focuses only on seven cephalometric landmarks (nasion, bregma, center of foramen magnum (CFM), left/right orbitale and left/right porion), which are essential for the construction of reference planes during the cephalometric analysis. The architecture of VGG-net (Simonyan and Zisserman 2015) with random initialization was used for training. The performance of the proposed method was evaluated with reference to the test data, which yielded an average point-to-point error of 1.5 mm.

Method
Let x represent a three-dimensional CT image defined on voxel grid Ω := {p = ( p 1 , p 2 , p 3 ) : p j = 1, · · · , N j , j = 1, 2, 3}. Here, N j is the number of voxels in direction p j , as shown in figure 1. In our case, CT image size is N 1 × N 2 × N 3 ≈ 512 × 512 × 400. The value x(p) at voxel position p can be viewed as the attenuation coefficient.
The goal is to develop an automated 3D cephalometric landmark detection algorithm for determining a group of landmarks R = (r 1 , · · · , r n ), where r 1 is the position of nasion, r 2 is the position of bregma, and so on. We aim to use a deep learning technique to learn a map f : x → R from a labeled training data {(x j , R j ) : j = 1, · · · M} made by experts. The landmark detection map f can be learned by solving where Net is a deep convolution neural network and · denotes the Euclidean norm. Unfortunately, we had experienced serious difficulty in handling high-dimensional CT data x (N 1 × N 2 × N 3 voxels) in 3D space when applying supervised learning. The proposed method is designed to overcome this dimensionality problem, as explained in the following sections. Indeed, this method mimics the detection procedure of experts. Figure 1. Illustration of 3D CT image and skull image. Our landmark detection method is based on the use of a 3D skull image which can be segmented from 3D CT data.

Generating the 2D image containing 3D geometric cues
Our landmark detection strategy is based on the use of a 3D skull image which can be segmented from 3D CT data, as shown in figure 1. Given 3D CT image x, let the skull region be occupied by three-dimensional domain, denoted by the set S x . Roughly speaking, the set S x can be extracted by a soft threshold-based weighting method (Kyriakou et al 2010). If voxels with CT values above a higher threshold h are assumed to be the attenuation coefficient of bone, then S x ≈ {p ∈ Ω : x(p) > h}. The skull geometry S x is used to generate multiple 2D images containing 3D geometric features resulting from lighting and shadowing (Lengyel 2012), as shown in figure 2.
We select a part of a plane Π d , lying outside of the skull, with its normal vector d. Let Π d be expressed as Π d = {r(s 1 , s 2 ) : 1 s 1 , s 2 512}. If we place the light source at a point q / ∈ Ω, then the intensity of light x q,d with view direction d can be expressed as where n(p) is the unit normal vector at p to the surface of the skull and p d (s) is the point on the surface of the skull given by Here, s,d is the line passing through r(s) ∈ Π d and parallel to the direction d. Figure 3 shows three different lighting positions on a 3D skull model. We confirmed that the geometric features of landmarks may or may not be clearly visible depending on the lighting position. We therefore chose three different lighting positions to detect the landmark robustly. Figure 4 shows 2D images x q,d from different views and multiple light sources. These 2D images provide 3D geometric cues, because its formula (2) is associated with the distribution of normal vector n of the skull surface.

3D landmarks detection method
This section explains how to detect landmarks R using the multiple 2D images x q,d . We generate a new training ) is a group of 2D landmarks where K d k is the number of landmarks that can be observed in the d k direction. The 2D landmarks s d (i) correspond to the original 3D landmark r i d , as shown in figure 2.
With this generated training data (the dimensionality significantly reduced without losing its 3D features), we can use a deep learning method, which aims to learn a map f d k : The architecture of VGG-net (Simonyan and Zisserman 2015) with random initialization was used to learn f d k due to its good performance in image localization. VGG-net is known to have a simpler network structure than GoogLeNet (Szegedy et al 2015), which performs similarly in image classification tasks. The architecture of the Phys. Med. Biol. 64 (2019) 055002 (10pp) network will be explained in the next subsection. Once f d k is obtained through deep learning, 3D landmarks can be computed from the formula (3). Note that this projection method cannot be applied to the CFM directly (See figure 5) because the landmark is not on the surface. This issue is resolved by generating a 2D binary image with the foramen magnum being zero, as shown in figure 5. This binary image can be generated by truncating the upper half of the skull; from the inferior view the foramen magnum area of the truncated skull is an empty hole. With this binary image, it is easy to get the boundary of the foramen magnum, the CFM landmark being obtained from these boundary points' average.

Deep learning network
To learn the map f d k for landmark detection, we use VGG-19 (Simonyan and Zisserman 2015) with random initialization. The network contains 16 convolutional layers, each followed by a rectified linear unit (ReLU), and five 2 × 2 max pooling with stride 2 for down-sampling. The last pooling layer is followed by four fullyconnected layers. All convolutional layers have 3 × 3 filters. The number of channels is given in figure 6.
To be precise, let x(s 1 , s 2 ) denote the pixel value of x at the position (s 1 , s 2 ). At the first convolution layer, the input image x is convolved with the set of 64 predetermined filters w 1 = {w 1,k ∈ R 3×3 : k = 1, · · · , 64}. This  Phys. Med. Biol. 64 (2019) 055002 (10pp) convolution provides a set of 64 feature maps (h 1 = {h 1,1 , · · · , h 1,64 }), where the kth feature map at the pixel position (m, n) is given by h 1,k (m, n) = ReLU ((w 1,k * x)(m, n) + b 1,k ). Similarly, at the second convolution layer, we compute a set of 64 feature maps (h 2 = {h 2,1 , · · · , h 2,64 }) using a set of weights (w 2 = {w 2,k,k ∈ R 3×3 : k, k = 1, · · · , 64}), where h 2,k is given by h 2,k (m, n) = ReLU 64 k =1 (w 2,k,k * h 1,k )(m, n) + b 2,k . Next, we apply 2 × 2 max pooling operation with stride 2 for down-sampling. By applying either convolution or max pooling to each layer, we obtain h 16 , a set of 512 feature maps. The output feature map h 17 is given by h 17 = ReLU w 17h16 + b 17 , where h 16 represents the max pooling of h 16 . There are two additional fully-connected layers. Note that the size of  weight w 17 differs from that of the original VGG-net, due to different size of input data. In this process, we compute the output to obtain the group of 2D landmarks R d k , as shown in figure 6.

Experiment
This section demonstrates the feasibility of the proposed method through experiment. The anonymized CT data from our previous study (Lee et al 2014) was used for this experiment. 27 normal Korean adults with skeletal class I occlusion (11 males and 16 females; 24.22 ± 2.91 years) volunteered for the study. The original work was approved by the local ethics committee of the Dental College Hospital, Yonsei University, Seoul, Korea (IRB  Figure 7. The output of the VGG-net for multiple 2D head skull images. The blue dot is the output of the network and the cyan dot is a labeled landmark point. number: 2-2009-0026), and informed consent was obtained from each subject. The subjects were divided into the learning data group (n = 20) and the test data group (n = 7). To locate seven landmark points for the 3D cephalometric analysis, SimPlant software was used by two experts who have been working on 3D cephalometry for more than ten years in a university hospital setting. The SimPlant coordinate values for each landmark point were converted to DICOM voxel coordinate value to construct the labeled data sets for deep learning. The cephalometric landmarks for this study included nasion, bregma, CFM, orbitale and porion. The definitions and details are given in table 1. These points were selected mainly based on the orientation of the craniofacial structure for 3D cephalometric analysis. The proper positioning of the skull structure in 3D space is referred to as the model orientation, and is essential at the initial step of 3D cephalometry. These landmarks are all necessary to construct the midsagittal, Frankfort horizontal, and coronal reference planes, which in turn guide skull positioning and proper orientation in 3D space. For training, we first generated 2D shadowed images of size 512 × 512 by changing lighting from 3D skull surface data that had been extracted from 3D volume data. For additional data collection, we used standard data augmentation techniques such as 9 rotations and 4 translations. To be precise, 9 rotation angles were selected as −20 • , −15 • , −10 • , · · · , 20 • and 4 translations were selected as (−30, 0), (30, 0), (0, −30), (0, 30), where (a, b) represents the pair of numbers for translated pixels along the x and y axis. Using the data augmentation, we produced 2700 datasets for training.
The cost function (4) was minimized using the RMSPropOptimizer (Tieleman and Hinton 2012) with a learning rate of 0.00001, weight decay of 0.9, and mini-batch size of 8 at each epoch. We used 100 epochs to train each network. Training was implemented by Tensorflow (Google 2015) on a CPU (Intel(R) Core(TM) i7-6850 K, 3.60 GHz) and GPU (NVIDIA GTX-1080, 8 GB) system. Each network took approximately 10 h for training. Figure 7 shows the test result for multiple 2D head skull images. We tested four views (anterior, superior, left, and right) of each subject to detect six landmarks besides CFM. The anterior view was used to detect nasion and orbitale, the superior view was used to detect bregma, and the left/right views were used to detect porion. The blue dot is the average of the test result for the three images generated by the three types of lighting and the cyan dot is a ground-truth reference landmark point from the experts. The output 2D landmarks closely matched the referenced landmarks.  Figure 8 shows the outcome of the proposed method for 3D head skull images and table 2 shows the mean error with standard deviation in distances between our result and the reference landmark. The distribution of error for each landmark is represented by a box plot in figure 9. The distance between the referenced and the predicted coordinate value was measured for each landmark point in the x-, y -, and z-axis as well as in 3D distance. The measurements were performed for eight subjects of the learning data group. In order to get the referenced coordinate value, two observers among the authors performed the landmark pointing two times at one week interval for each subject. Their averaged value was used for the reference and their discrepancy was calculated to have intraclass correlation coefficient (ICC) between and within observers with 95% confidence intervals. The landmark point for nasion showed the highest level of accuracy, with 0.60 mm of error in 3D distance, while the point for CFM showed the lowest level, with 4.60 mm in 3D. 90.5% of total points yielded errors ranging within 2 mm in 2D and 85.7% of total points within 1 mm in 3D.
The ground-truth coordinate value (as reference) was produced from the average value of measurement results made by two observers. The calculated mean discrepancy for all landmark points was 0.49, 1.02, 1.40 and 1.80 mm in the x-, y -, and z-axis, and 3D. ICC for intra-observer was 0.95 (as Cronbach's alpha) and for interobserver was 0.92.

Discussion and conclusion
This paper, for the first time, proposes a 2D shadowed image-based machine learning method for automatic 3D cephalometric annotation. The key idea is to generate various shadowed 2D images by changing the view and lighting direction, so that it is possible to probe 3D geometric features more efficiently from multiple 2D images through machine learning. Since this strategy allowed us to significantly reduce the size of the input dimension, it was possible to learn cephalometric landmarking with relatively small data numbers. From 27 labeled 3D CT data, we generated training and test data for shadowed 2D image-based machine learning (VGG-net). The performance evaluation of the proposed method with reference to the test data demonstrated its feasibility.
The first automatic cephalometric annotation system was 2D and used the knowledge-based line extraction technique (Levy-Mandel et al 1986). Various algorithms were introduced thereafter for automatic 2D cephalometry, including the knowledge-based, model-based, soft computing-based, and hybrid approaches (Levy-Mandel et al 1986, Parthasarathy et al 1989, Cardillo and Sid-Ahmed 1994, Rudolph et al 1998, Hutton et al 2000, Innes et al 2002, Chakrabartty et al 2003, Giordano et al 2005, Vucinic et al 2010. Early works focused on knowledgebased approaches using edge detection and image-processing techniques (Levy-Mandel et al 1986, Parthasarathy et al 1989, Giordano et al 2005, Vucinic et al 2010, while the later works implemented model-based approaches (Cardillo and Sid-Ahmed 1994, Rudolph et al 1998, Hutton et al 2000. The popularity of 3D cephalometry was accompanied by the new development of automatic 3D cephalometric annotation systems. Most of the previous 3D studies employed model-based approaches, which were dependent on a feature-extracted reference model system (Makram and Kamel 2014, Gupta et al 2015, Codari et al 2017, Montúfar et al 2018. The geometrical shape and structure of each data set are unique due to the craniofacial structural variations, making it difficult to detect the precise location of landmarks with a model-based approach. Recent research in computer vision and biomedical applications has focused on machine learning due to its outstanding performance in solving a range of problems. Recently, machine learning has been used for 2D cephalometry (Pei et al 2013, Linder et al 2016, Arik et al 2017, mainly because it employs a hierarchical structure to propagate information on salient features to the subsequent layers, while exploiting the spatially local correlation between them. There has been some automated anatomical landmark detection in 3D CT using machine learning methods (Dabbah et al 2014, Ghesu et al 2017. However, difficulties arise in using machine learning algorithms for 3D cephalometry owing to the dimensionality problem mentioned in section 2. Previous 3D cephalometric annotation studies have all reported excellent results in terms of error ranges; Gupta et al (2015Gupta et al ( , 2016 searched for 20 cephalometric landmarks in conebeam CT images using knowledgebased approaches and achieved an average accuracy of 2.01 mm, with 64.67% of the landmarks falling within a range of 0-2 mm, the highest mean error in the linear measurements being 2.63 mm. Shahidi et al (2013) also evaluated 14 cephalometric landmarks in conebeam CT using model-based rigid registration, and reporting that 63.57% of landmarks had an average error less than 3 mm. Makram and Kamel (2014) reported a localization system of 20 landmarks using model-based approaches with rigid and elastic registration, resulting in 90% of their landmarks having a localization error less than 2 mm. Codari et al (2017) presented a localization method using automatic segmentation and template-based non-rigid holistic registration which yielded an average localization error of 1.99 mm on 21 points. Finally, Montúfar et al (2018) used conebeam CT scans and their orthogonal coronal and sagittal projections for active shape model-based registration for 18 landmarks, and got a 3.64 mm mean error with the highest localization errors being found at the porion and sella regions.
All these recent automatic 3D cephalometry studies showed less than 2-4 mm of error levels; our results are similar, with some differential points of consideration. Our 2D errors for most landmarks were less than 1 mm, except for two points, CFM (in z-axis) and left porion (in x-axis). 3D errors were also similar in that errors exceeding 1 mm were confined to two landmark points, right porion and CFM. The reason for the higher error level of these points is not empirically clear, but they seem unique, as compared with others, in that they are all tube-like foramen structures. Even their manual landmarkings regarded as the gold standard for this kind of study, can be erroneous due to the anatomical characteristics of these points. Further work should reduce the error level of these landmarks of anatomical features.
For readers' convenience, we briefly address a technical issue in using the formula (3). To compute p d (s) in (3), we extract the triangular surface j T x,j representing the skull S x using a surface tessellation algorithm such as Marching Cube (Lorensen and Cline 1987), where T x,j is a triangle element of the surface tessellation. Then, p d (s) is determined by choosing one from a few triangles being touched with the line s,d .
The proposed approach using shadowed 2D images in machine learning to determine 3D relations is just in the initial stage, with considerable potential for improvement. The feasibility of the proposed method was tested for 7 main landmarks at the beginning of the study. This new strategy may open a new avenue of research regarding automated 3D cephalometric analysis by virtue of addressing the curse of dimensionality in training high-dimensional data.