A benchmark for comparison of dental radiography analysis algorithms

Dental radiography plays an important role in clinical diagnosis, treatment and surgery. In recent years, efforts have been made on developing computerized dental X-ray image analysis systems for clinical us-ages. A novel framework for objective evaluation of automatic dental radiography analysis algorithms has been established under the auspices of the IEEE International Symposium on Biomedical Imaging 2015 Bitewing Radiography Caries Detection Challenge and Cephalometric X-ray Image Analysis Challenge. In this article


Introduction
Dental radiography analysis plays an important role in clinical diagnosis, treatment and surgery as radiographs can be used to find hidden dental structures, malignant or benign masses, bone loss and cavities. During diagnosis and treatment procedures such as root canal treatment, caries diagnosis, diagnosis and treatment planning of orthodontic patients, dental radiography analysis is = L 1 L 10 / L 2 L 8 ; FHA = ∠ L 1 L 2 L 10 L 9 ; MW = | L 12 L 11 | where x ( L 12 ) > x ( L 11 ), otherwise, MW = −| L 12 L 11 | ; ODI = ∠ L 5 L 6 L 8 L 10 + L 17 L 18 L 4 L 3 , in this example, the ODI is (76 • + (−3 • ) = 73 • ) , in the normal range with a slight tendency to be an openbite; APDI = L 3 L 4 L 2 L 7 + L 2 L 7 L 5 L 6 + L 4 L 3 L 17 L 18 , in this example, the APDI is (88 • + (−6 • ) + (−3 • ) = 79 • ) , which falls within the normal range. . in Fig. 1 ) is commonly conducted during treatment planning. This procedure is time consuming and subjective. Automated landmark detection for diagnosis and orthodontic treatment of cephalometry could be the solution to facilitate these issues. However, automated landmark detection with high precision and success rate is challenging. In recent years, effort s have been made to develop computerized dental X-ray image analysis systems for clinical usages, such as in anatomical landmark identification ( Nikneshan, 2015;Zhou and Abdel-Mottaleb, 2005 ), image segmentation ( Lai and Lin, 2008;Rad, 2013 ), diagnosis and treatment ( Lpez-Lpez, 2012;Nakamoto, 2008;Wriedt, 2012 ). In 2014, we held an automatic cephalometric X-ray landmark detection challenge at IEEE ISBI 2014 with 300 cephalometric X-ray images, and the best overall detection rate for 19 anatomical landmarks was 71.48% with an accuracy of within 2mm. The 2014 challenge outcomes indicate that automatic cephalometric X-ray landmark detection is still an unsolved problem. Hence, the first part of this study is to investigate suitable automated methods in cephalometric X-ray landmark detection. In this study, a larger clinical database was built using data from 400 patients.
Furthermore, apart from anatomical landmark detection in cephalometric images, a new classification task for the clinical diagnosis of anatomical abnormalities using these landmarks was added in this study. In order to be critical and descriptive in clinical practice, it is more useful to analyse angles and linear measurements rather than just point positions. Many classification methods have been proposed for cephalometric analysis, such as Ricketts analysis ( Ricketts, 1982 ), Downs analysis ( Downs, 1948 ), Tweed analysis ( Tweed, 1954 ), Sassouni analysis ( Sassouni, 1955 ) and Steiner analysis ( Steiner, 1953 ). Therefore, the second part of this study was to automatically classify patients into different anatomical types to infer a clinical diagnosis.
Apart from the cephalometric analysis, caries detection and dental anatomy analysis are important in clinical diagnosis and treatment. Dental caries is a transmissible bacterial disease of the teeth that would destructs the structure of teeth, and the dentist has approached diagnosing and treating dental caries based mostly on radiographs. While dental caries is a disease process, the term is routinely used to describe radiographic radiolucencies.
Radiographic examination can improve the detection and diagnosis of the dental caries. In the clinical practice, caries lesions have traditionally been diagnosed by visual inspection in combination with radiography. Therefore, automated caries detection systems with high reproducibility and accuracy would be welcomed in clinicians' search for more objective caries diagnostic methods ( Wenzel,20 01,20 02 ). Several research studies focused on pattern recognition or segmentation of dental structures, such as in caries detection ( Huh, 2015;Oliveira and Proenc, 2011 ), root canal edge extraction ( Gayathri and Menon, 2014 ), identity matching ( Jain and Chen, 2004;Zhou and Abdel-Mottaleb, 2005 ) and teeth classification ( Lin, 2010 ). Automated caries lesion detection technologies provide potential diagnostic data for dental practitioners and assist identifying signs of various diseases. However, accurate and objective methods for radiographic caries diagnosis are poorly explored. Therefore, the third part of this study was to investigate possible automated methods both for detection of caries and for dental anatomy analysis in bitewing radiographs.
This paper presents the evaluation and comparison of a representative selection of current methods presented during the Grand Challenges in Dental X-ray Image Analysis held in conjunction and with the support of the IEEE ISBI 2015. There are two main challenges, the Automated Detection and Analysis for Diagnosis in Cephalometric X-ray Image and the Computer-Automated Detection of Caries in Bitewing Radiography , and the first challenge contains two challenge tasks: (i) to identify anatomical landmarks on lateral cephalograms, and (ii) to classify anatomical types based on the anatomical landmarks. Only the first task of the first challenge of this study is similar to a related challenge held at 2014 IEEE ISBI challenge. The second challenge-Computer-Automated Detection of Caries in Bitewing Radiography and the second challenge task of Challenge 1 -classifying anatomical types based on the anatomical landmarks are both completely new. In addition, for the first challenge, the dataset was enlarged to now include 400 patients. In comparison to the challenge held at IEEE ISBI 2014, this study includes a new challenge, new data and a new challenge task (see Table 1 ). The outline of the paper is organized as follows. In Section 2 , the challenge aims, participants, image datasets and evaluation approaches are described. The methodologies and detailed quantitative evaluation results of Challenge 1 and Challenge 2 are presented in Sections 3 and 4 , respectively. Finally, conclusions are given in Section 5 .

Organization
The goals of this grand challenge are to investigate automatic methods for Challenge 1-1 : identifying anatomical landmarks on lateral cephalograms, Challenge 1-2 : classifying anatomical types based on the anatomical landmarks, and Challenge 2 : segmenting seven tooth structures on bitewing radiographs. The 19 anatomical   Bitewing radiographs: (a) a raw image with (b) seven dental structures highlighted, including (1) caries with blue color, (2) enamel with green color, (3) dentin with yellow color, (4) pulp with red color, (5) crown with skin color, (6) restoration with orange color and (7) root canal treatment with cyan color. The images are captured using the SOREDEX system (SOREDEX, Finland), that is devised with an optional image plate identification system (IDOT) for quality control, and 'C3' on the image indicates the active/frontal side in the IDOT system. landmarks to be detected on lateral cephalograms are the sella, the nasion, the orbitale, the porion, the subspinale (A point), the supramentale (B point), the pogonion, the menton, the gnathion, the gonion, the lower incisal incision, the upper incisal incision, the upper lip, the lower lip, the subnasal, the soft tissue pogonion, the posterior nasal spine, the anterior nasal spine, the anterior nasal spine and the articulare as shown in Fig. 1 (a). For the classification of anatomical types based on the obtained anatomical landmarks, eight standard clinical measurement methods ( Downs, 1948;Kim, 1974;Kim and Vietas, 1978;McNamara, 1984;Nanda and Nanda, 1969;Steiner, 1953;Tweed, 1946 ) were included as shown in Table 2 and illustrated in Fig. 1 (b)-(d). For the analysis of the dental anatomy of bitewing radiographs, seven tooth structures were included: caries, enamel, dentin, pulp, crown, restoration, and root canal treatment (see Fig. 2 ).
There were two stages in both challenges. In stage 1, a training dataset and a first test dataset were released for method development. In stage 2, an on-site competition was organized for which a second test dataset was used. The results of all individual methods were compared to the ground truth data, and extensive quantitative evaluation was performed to assess the performance of all methods.

Participants
A total of 18 teams (from 12 countries) registered for the 2015 IEEE ISBI grand challenge, and the four teams listed below were accepted in stage 1 and invited to the on-site competition in stage 2. The four approaches are described in Sections 3.1 and 4.1 , respectively. In landmark detection of cephalometric radiographs, we also compare five methods submitted to the 2014 ISBI challenge, and details of the five methods can be referred to ( Wang, 2015 ).

Datasets
400 cephalometric radiographs were collected from 400 patients aged six to 60 years. The cephalograms were acquired in TIFF format with Soredex CRANEXr Excel Ceph machine (Tuusula, Finland) and Soredex SorCom software (3.1.5, version 2.0), and the image resolution was 1935 × 2400 pixels. For evaluation, 19 landmarks were manually marked in each image and reviewed by two experienced medical doctors; the ground truth is the average of the markups by both doctors. For the classifications of anatomical types, eight clinical measurement methods were used (see illustrations in Fig. 1 and classifications in Table 2 ) : 1. ANB = ࢬ L 5 L 2 L 6 , the angle between the landmark 5, 2 and 6 2. SNB = ࢬ L 1 L 2 L 6 ; 3. SNA = ࢬ L 1 L 2 L 5 4. ODI = ∠ L 5 L 6 L 8 L 10 + L 17 L 18 L 4 L 3 , the arithmetic sum of the angle between the AB plane ( L 5 L 6 ) to the Mandibular Plane (MP, L 8 L 10 )and the angle of the Palatal Plane (PP, L 17 L 18 ) to Frankfort Horizontal plane (FH, L 4 L 3 ) 5. APDI = L 3 L 4 L 2 L 7 + L 2 L 7 L 5 L 6 + L 3 L 4 L 17 L 18 6. FHI = L 1 L 10 / L 2 L 8 , the ratio of the Posterior Face Height (PFH = the distance from L 1 to L 10 ) to the Anterior Face Height (AFH = the distance from L 2 to L 8 ) 7. FHA = ∠ L 1 L 2 L 10 L 9 8. MW = | L 12 L 11 | where x ( L 12 ) > x ( L 11 ), otherwise, MW = −| L 12 L 11 | For the bitewing radiography analysis, 120 images were collected from 120 patients, acquired in TIFF format with Sirona HE-LIODENT DS SIDEXIS machine (Salzburg, Austria) and EBM Viewer software (version 4.2c). For evaluation, seven types were manually marked in each image and reviewed by two experienced medical doctors.
Both datasets were randomly divided into three subsets as Training data, Test1 data and Test2 data for two stage testing (see Table 3 ). Ethical approval (IRB Number 1-102-05-017) was obtained to conduct the study by the research ethics committee of the Tri-Service General Hospital in Taipei, Taiwan. The datasets and the evaluation software will be made available to the research community, further encouraging future developments in this field. ( http: //www-o.ntust.edu.tw/ ∼ cweiwang/ISBI2015/ ).

Evaluation approaches
In cephalometric radiography analysis, three main criteria are used to evaluate the performance of the submitted methods.

• Mean radial error
The radial error R is formulated as R = x 2 + y 2 , where x is the absolute distance in the x-direction between the obtained landmark and the referenced landmark, and y is the absolute distance in the y-direction between the obtained landmark and the referenced landmark. The mean radial error (MRE) and the associated standard deviation (SD) are defined as MRE = For each landmark, medical doctors mark the location of a single pixel instead of an area as a referenced landmark location. If the absolute difference between the detected landmark and the referenced landmark is no greater than z mm, the detection of this landmark is considered as a successful detection; otherwise, it is considered as a misdetection. The success detection rate p z with precision less than z mm is formulated as where L d , L r represent the location of the detected landmark and the referenced landmark, respectively; z denotes four precision measurements used in the evaluation, including 2 mm, 2.5 mm, 3 mm and 4 mm; j ∈ , and # represents the number of detections made.

• Confusion matrix and success classification rate
In the confusion matrix, each column of the matrix represents the instances of a predicted class, while each row represents the instances of the ground truth class. The averaged diagonal of a confusion matrix represents the success classification rate.
Confusion matrices also provide valuable information on where misclassifications occur.
In bitewing radiography analysis, three main criteria are used to evaluate the performance of submitted methods, including Sensitivity = T P T P+ F N , Specificity = T N T N+ F P and F-score = 2 T P 2 T P + F P + F N , where TP, TN, FP, FN represent true positive, true negative, false positive and false negative, respectively.

Methods
(1) Ibragimov et al. Ibragimov et al. present a novel framework for landmark detection and skull morphology classification from cephalometric X-ray images. The appearance of landmarks is modeled by a random forest-based classifier with Haar-like appearance features ( Ibragimov, 2015 ) computed from original scale and downscaled images, so that the global and local intensity appearance, respectively, are analyzed. To find optimal landmark positions in the target image, the statistic properties of the most representative spatial relationships among landmarks, defined by Gaussian kernel estimation and optimal assignment-based shape representation ( Ibragimov, 2012 ), are computed. The agreement between the appearance and shape models corresponds to optimal landmark positions in the target image, and is found by applying game-theoretic optimization framework ( Ibragimov, 2014 ). Additionally, each landmark is repositioned using random forest-based shape models considering positions of most reliable or the remaining landmarks in the system.

) Lindner and Cootes
Recent work has shown that one of the most effective approaches to detect a set of landmark positions on an object of interest is to train Random Forests (RFs) to vote for the likely position of each landmark, then to find the shape model parameters which optimize the total votes over all landmark positions. Lindner and Cootes apply Random Forest regression-voting in the Constrained Local Model framework (RFRV-CLM) ( Lindner, 2015 ) as part of a fully automatic landmark detection system ( Lindner, 2013 ) to detect the 19 landmarks on new unseen images. In the RFRV-CLM approach, a RF is trained for each landmark to learn to predict the likely position of that landmark. During detection, a statistical shape model ( ( Cootes, 1995 ) is matched to the predictions over all landmark positions to ensure consis-tency across the set. A coarse-to-fine approach is used, and at each stage, the region around the current landmark position is mapped into a reference frame using a similarity transformation. For each of N landmarks we train a separate RF, which predicts the position of the landmark relative to an image patch. Each tree in the RF is trained on patches sampled at random displacements from the known position in the training set, and at each node a left/right split decision is made based on Haar-like features ( Viola and Jones, 2001 ) from the patch. On a new image, the RF is scanned over a region around the current landmark position, and each tree in the RF votes for the likely new position. Votes are accumulated in a voting image V l () for landmark l (see Fig. 4 ). Lindner and Cootes then seek the model shape and where x l is the mean position of the landmark in a suitable reference frame, P l is a set of modes of variation, b are the shape model parameters, r l allows small deviations from the model, and T θ applies a global transformation (e. g. similarity) with parameters θ.

Quantitative evaluation and analysis
For Challenge 1, all proposed methods are evaluated against the ground truth on 250 cephalometric X-ray images, including 150 Test1 images and 100 Test2 images.
(        , SDRs have increased about 4.4% on average (6.6% for 2 mm, 5.2% for 2.5 mm, 3.9% for 3 mm, and 2% for 4 mm) in 2015. However, the experimental results show that this is still an unsolved problem and needs further investigation as the highest SDR within 2mm precision range is only 74.84%. Furthermore, to analyze the capabilities of methods in detection of individual landmarks, Fig. 9 compares the MRE values of the five 2014 methods and two 2015 methods on individual landmarks using the same 100 images. It is observed that the method of Lindner and Cootes generally performs best. Compared with the previous methods in 2014 ( Wang, 2015 ), landmark 1, landmark 2, landmark 3, landmark 5, landmark 6, landmark 7, landmark 8, landmark 9, landmark 10, landmark 11, landmark 12, landmark 13, landmark 15, landmark 16, landmark 17 and landmark 18 are successfully detected with relatively low MREs by Lindner   Table 9 presents the ranks of each landmark with seven submitted methods in the 2014 and 2015 landmark detection challenges. Lindner and Cootes method achieves the 17 best detection results on 19 landmarks. Table 10 presents the success classification rates for the five 2014 methods and two 2015 methods. It is observed that some anatomical types are difficult to classify, e.g. ANB and FHI.
The best success classification rates of ANB and FHI are lower than 70%. The reason why some anatomical types are difficult to classify is that landmarks, which are difficult to detect, are used in the classification tasks, e.g. the landmark 5 is used in ANB classification, and the landmark 10 is used in FHI classification. Overall, the two 2015 methods (Lindner and Cootes and Ibragimov et al.) perform better than the five 2014 approaches. Most methods are based on Random Forest (RF), which is an ensemble learning method that uses a combination of randomized decision trees to calculate a response. During training, the decision trees split the feature space to obtain a better representation of the data. Compared with the submitted methods in the ISBI 2014 challenge, the averaged accuracy and runtime of detecting landmarks are significantly improved by Lindner and Cootes' method in the ISBI 2015 challenge (MRE: 1.656 mm and runtime per image: < 5s, without the requirement for highperformance hardware). Furthermore, all their detectors are  •All method parameters are tuned via 10-fold cross-validation. Ibragimov et al. (2015) Random forest •Pairwise spatial relationships among landmarks through the optimal assignment-based shape representation 2 (1.851) •Multi-landmark spatial relationships through the random forest-based representation

Table 9
The ranking of each landmark for the seven accepted methods in the 2014 and 2015 automated landmark detection challenges.

Method
L Chen  trained independently, which facilitates the inclusion of additional landmarks. On the contrary, this also means that their RF-voting for the best landmark position does not take inter-landmark relationships into account. However, their method utilizes statistical shape models ( Cootes, 1995 ) to regularize the output of the individual predictions for each landmark. This combined with using Random Forests for regression rather than classification leads to significantly improved results. It is worth pointing out that, even though their system achieves high performance in the given challenges, the accuracy of this system relies on the shape and appearance of the object of interest exhibited in the training data. Hence, when training a landmark detection system based on their proposed RF-based approach, the training data needs to be representative for the unseen data to which the system is going to be applied. Furthermore, all presented landmark detection methods represent supervized learning and hence require a sufficient number of manually annotated training data. Future developments to further improve the performance of automatic cephalometric landmark detection may include algorithms that are less reliant on the shape and appearance to be exhibited in the training data, and require significantly less (none) annotated training data.

Methods
(1) Ronneberger et al. Ronneberger et al. present a pure machine learning approach using a u-shaped deep convolutional neural network ("u-net") for the fully automated segmentation of dental x-ray images. The architecture of the u-net consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. Such a network can be trained end-to-end from very few images. The network learns the desired robustness to deformations by augmenting the training data with randomly deformed images. One important modification in Ronneberger et al.'s architecture is that in the upsampling part we have also a large number of feature maps, which allows the network to propagate context information to higher resolution layers. As a consequence, the expansive path is more or less symmetric to the contracting path, and yields a u-shaped architecture.
The network does not have any fully connected layers and only uses the valid part of each convolution, i.e., the segmentation map only contains the pixels, for which the full context is available in the input image. This strategy allows the seamless segmentation of arbitrarily large images by an overlap-tile strategy (see Fig. 10 ). To predict the pixels in the border region of the image, the missing context is extrapolated by zero padding. This tiling strategy is important to apply the network to large images, since otherwise the resolution would be limited by the GPU memory. The network architecture is illustrated in Fig. 11 . It consists of a contracting path (left side) and an expansive path (right side). The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3 × 3 convolutions (only using the valid part), each followed by a rectified linear unit (ReLU) and a 2 × 2 max pooling operation with stride 2 for downsampling. At each downsampling step we double the number of feature maps. Every step in the expansive path consists of a spatial upsampling of the feature maps with a factor of 2 followed by a 4 × 4 convolution that halves the number of feature maps, a concatenation with the correspondingly cropped feature maps from the contracting path, and one or two 3 × 3 convolutions, each followed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution. At the final layer a 1 × 1 convolution is used to map each 64-dim feature vector to the desired number of classes (here 7). In total the network has 23 convolutional layers. Further details are available in Ronneberger (2015) and Fig. 12 presents the results of Ronneberger et al.'s method.
In this work, Lee et al. built a random forest based dental segmentation system, which consists of a random forest machine learning system and a post-processing model for refining the prediction output PD based on the probability maps PB s generated by the machine learning system. 275 image features categorized in 24 types are extracted for training. The data is trained using random forest ( Breiman, 2001 ) with 50 trees generated. The prediction output PD can be generated by following equation.
where i = 1 to 7, x is X -coordinate and y is Y -coordinate. The second part is a a post-processing model. In order to refine the prediction outputs, two filters and morphological operations are applied in the combined probability map. First, two filters are 3 × 3 for removing single class and 5 × 5 for removing 3 × 3 classes. In a 3 × 3 four-neighbor rule, a position has only 4 neighboring classes that share a side. If 4 neighboring classes are same and the current class is different from 4 neighboring classes, the class of current position will be changed to the same class with 4 neighboring classes. In a 5 × 5 neighbor rule, if a isolated 3 × 3 block Fig. 11. U-net architecture (example for 32 × 32 pixels in the lowest resolution). Each blue box corresponds to a stack of feature maps. The number of features is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.  with same neighboring classes, all classes of the block will be changed to the same class with the neighboring classes. Fig. 13 shows that two filters were used on the probability map. Fig. 13 (a) is the original combined probability map and Fig. 13 (b) is the combined probability map after applying 3 × 3 and 5 × 5 filters. Second, the combined PB can be separated into seven binary prediction maps PDs by following equations.

Quantitative evaluation and analysis
For Challenge 2, all proposed methods are evaluated against the ground truth on 80 bitewing X-ray images, including 40 Test1 images and 40 Test2 images.
(1) Quantitative evaluation Segmentation of dental structures in bitewing radiographs is difficult as the data variation is high and teeth are sometimes labeled as background (see Fig. 14 ), which makes the learning task difficult. There are nine teams registered to this challenge, but only two teams successfully submitted the test results. The averaged F-scores of the teams  (Ronneberger et al. and Lee et al.) are 0.560 and 0.268, respectively, and the u-shaped deep convolutional network by Ronneberger et al. performs significantly better and achieves F-scores greater than 0.7 for the three fundamental dental structures ( enamel, dentin and pulp ). The main advantage of the u-net architecture for this task is its ability to automatically learn the hierarchical structure within the images. During segmentation it uses the extracted context at all detail levels for the decision at each pixel. A critical part of Ronneberger et al.'s approach is data augmentation. As there is limited data available, Ronneberger et al. use data augmentation by applying elastic deformations to produce a large database with 20 0 0 0 training image tiles, which is essential for machine learning methods to learn invariance and produce robust models. The value of data augmentation for learning invariance has also been shown in Dosovitskiy et al. ( Dosovitskiy, 2014 ) in the scope of unsupervised feature learning. In the experiments, it is observed that the data augmentation technique helps to create reasonable additional training instances for enamel, dentin and pulp, but the other classes caries, crown, restoration and root canal treatment, appear quite different according to their relative location, so the augmentation is less successful here.

Conclusion
Computerized automatic dental radiography analysis systems for clinical use save time and manual costs and avoid problems caused by intra-and inter-observer variations e.g. due to fatigue, stress or different levels of experience. In this article, we have presented benchmarks for a number of challenging tasks in dental X-ray image analysis, including algorithms for (i) anatomical landmark detection on lateral cephalometric radiographs, (ii) anatomical abnormality classification on lateral cephalometric radiographs, and (iii) dental structure segmentation on bitewing radiographs. The presented results will allow the objective comparison of existing and new developments in the field. All methods were evaluated using a common lateral cephalometric radiography dataset repository, a common bitewing radiography dataset repository, ground truth data, and unified measurements for assessment of the detection, classification and segmentation accuracy. Based on the presented results, we can conclude that recent methods achieved significantly improved performance on these challenging tasks. However, the presented results also demonstrate that accurately analyzing dental radiographs remains a challenging problem which is still far from being solved. It is expected that this benchmark will help algorithmic developments, and that more advanced approaches will be built and tested using the provided data repositories and benchmarks.