Interpretation of Magnetic Resonance Images of Temporomandibular Joint Disorders by Using Deep Learning

In recent years, Machine Learning (ML), especially Deep Learning (DL) approaches, has attracted great attention in medical field. In this study, we proposed a deep learning-based approach in order to automatically diagnose Temporomandibular Disorder (TMD) on Magnetic Resonance (MR) images. 2576 MR images of 200 patients diagnosed with and without TMD were collected. These images were classified as 8 groups. First of all, a basic Convolutional Neural Network (CNN) was used for the problem. After that, 6 different fine-tuned pre-trained convolutional neural network models, Xception, ResNet-101, MobileNetV2, InceptionV3, DenseNet-121 and ConvNeXt were applied on data set. Finally, the accomplishment of Vision Transformer (ViT) in task solving was also discussed. Performances of the approaches were evaluated by metrics such as accuracy rate, precision, sensitivity, F1-score, Negative Predictive Value (NPV), specificity, Area Under Curve (AUC) and kappa coefficient. Grad-CAM results of the best architectures for diagnostic examination were obtained. Intraclass Correlation Coefficients (ICC) value was computed to assess correlation between the models. According to the test results, deep learning-based architectures assessed were found to be successful in the diagnosis of TMD.


I. INTRODUCTION
Temporomandibular Joint (TMJ), located between mandibular condyle and temporal bone, is the most complex joint of the human body. Temporomandibular joint consists of condyle, articular tubercle, articular disc, glenoid fossa, joint capsule, retro discal tissue and synovial membrane. Clinical problems of the temporomandibular joint and joint structures are called temporomandibular joint disorders [1].
The associate editor coordinating the review of this manuscript and approving it for publication was Marco Giannelli . Temporomandibular joint disorders do not only affect TMJ but also, they influence the masticatory muscles and other components of stomatognathic system [1]. Pain symptoms frequently arise in patients with TMJ Internal Derangement (ID). The articular disk of the TMJ is composed of a biconcave fibrocartilaginous structure. Internal irregularity, manifested as disc displacement, is a common form of TMDs. TMJ ID is defined as an abnormal positional relationship between the articular disc and the mandibular condyle and the articular eminence. Anterior disc displacement is discovered more often than medial, posterior and lateral displacements [2]. TMJ effusion, an inflammatory response, is the accumulation of excess fluid in and around the temporomandibular joint. This reaction emerges in consequence of internal disorder, arthritis, trauma, and inflammatory changes associated with rheumatoid diseases. Degenerative bony changes are more frequent in the mandibular condyle than in the mandibular fossa or the articular eminence and are characterized by the development of pathological bony changes (erosion, osteophytes and deformity) and adaptive bony changes (marginal proliferation, flattening, concavity, sclerosis and subchondral cysts) [3], [4]. These abnormalities are considered to be radiological signs of OsteAarthritis (OA) and are frequently observed in joints with long-standing anterior disc displacement without reduction [5]. TMD, involving clinical symptoms, such as; TMJ pain, joint sounds and limited jaw function, affects 28% of the world's population [6]. These disorders include those associated with articular disc structure and position, as well as alterations in synovial fluid and soft tissue [7].
Diagnosis of TMJ disorders is made by clinical examination of the patient, along with patient history and assessment of diagnostic images including Magnetic Resonance Image (MRI) when necessary. MRI allows excellent depiction of the TMJ anatomy and abnormalities because of its inherent tissue contrast and high resolution. Magnetic resonance imaging is considered as reference standard for diagnostic imaging of some TMDs including those associated with articular disc structure and position, as well as alterations in synovial fluid and soft tissue [8]. However, this procedure is considered as costly and time consuming. Therefore, it is worth introducing an automatic inference system to facilitate diagnosis of disease for physicians. Artificial Intelligence (AI), machine learning, especially deep learning-based models, are highly effective approaches in order to achieve such tasks. The thought of artificial intelligence was first mentioned at a computer science conference held in Dartmouth College, Hanover, New Hampshire in 1956 [9], [10]. Artificial intelligence is described as ''A system's ability to interpret external data correctly, to learn from such data, and to use that information to achieve specific goals and tasks through flexible adaptation.'' [11]. Artificial intelligence, which is essentially based on the way of human thinking, has subfields. Machine learning is one of these subsets. In machine learning, a mathematical model, which is dependent on training data, is designed in order to make predictions or decisions [12]. Machine learning algorithms have received much attention in recent years, and they have been used in many applications. However, these approaches have not always shown effective results as desired, especially on image and sound data. Krizhevsky et. al., achieved an 86.3% success in ''Deep Learning's ImageNet Large-Scale Visual Recognition Competition (ILSVRC)'' in 2012 [13] and studies based on deep learning were carried out by other researchers in the following years. In recent years, deep learning-based models performed better than conventional machine learning based methods in most computer vision and medical image processing problems. Accordingly, they have been utilized in several diverse fields, such as classification, disease diagnosis, face recognition, and image enhancement [14], [15], [16], [17], [18].
In this study, our aim was to interpret TMJ disorders displayed on MRI by using deep learning approaches and to assess its effectiveness. A basic CNN architecture, 6 different fine-tuned pre-trained models, Xception [29], Residual Neural Network (ResNet)-101 [30], MobileNetV2 [31], InceptionV3 [32], Dense Convolutional Network (DenseNet)-121 [33] and ConvNeXt [34] were applied on magnetic resonance images of temporomandibular joint patients to predict whether they have disc displacement, effusion or condylar bone degeneration or not. In addition, the performance of ViT, which has been very popular lately, was examined.

II. MATERIALS AND METHODS
Sample images were taken from Ankara University, Faculty of Dentistry and a data set was created. Ethical approval for the data set was also obtained from Ankara University, Faculty of Dentistry, Ethics Committee (36290600/34/2021). Informed consent was obtained from the patients.

A. MR IMAGES DATA SET
The present study included MR images of 200 patients who were referred to Ankara University Faculty of Dentistry, Dentomaxillofacial Radiology and Oral Surgery Clinics for TMJ Disorders between 2015-2021. Patients who had surgery including chin and face and patients with facial syndrome were excluded. MR images were obtained by utilizing 1.5 Tesla machine in both closed and open mouth positions. T1 W (T1 weighted), T2 W images, Multiple Echo Recombined Gradient Echo (MERGE) and Proton Density (PD), series images of bilateral TMJs (3 mm thick) were obtained with sagittal reconstructions. In order to increase the number of MR images, T1 and PD images were used for articular disc position, T2 and MERGE for effusion, and T1 and PD series for condylar degeneration assessment. Screenshots were taken as shown in Figure 1  MR images used in the present study were collected by an experienced dentomaxillofacial radiology specialist in MRI assessment. In cases where a decision could not be made by the dentomaxillofacial radiology specialist a consensus was achieved by consulting a more experienced dentomaxillofacial specialist (15 years of experience in reading MR images). 2576 images were obtained from diagnostic MR images of 200 patients. At least 3 images of the right and left regions VOLUME 11, 2023  There is no mandibular condyle degeneration 8) There is mandibular condyle degeneration Table 1 shows the distribution of number and percentage of images for all groups from a total of 2567 data images.
In Table 1, for T1 and Proton density sequences ''Positive'' corresponds to the ''disc in anterior or disc anteriorly displaced'' for ''closed mouth disc position'' and ''open mouth disc position'' groups. For ''joint cavity effusion'' group and for the ''mandibular condyle degeneration'' group, Class 5, 6, 7 and 8 were determined according to following criteria: • For 5: In the T2 and MERGE sequence, no increase in intensity is observed in the TMJ interval, • For 6: There is hyperintensity in the TMJ space on the T2 and MERGE sequence, The appearance of a smooth and rounded line on the mandibular condyle surface at T1 and Proton density sequence, • For 8: It is the appearance of irregularity, flattening and osteophyte (bird beak appearance) on the mandibular condyle surface at T1 and Proton density sequence.

B. TRANSFER LEARNING
In broad terms, Transfer Learning (TL) can be defined as applying the knowledge obtained while solving a problem to a different but relevant task. It is a very strong learning approach in order to solve problems where there are not enough training examples in order to train a model from scratch. Instead of starting the learning from scratch, transfer learning allows to create accurate models in a time saving way, starting from learned patterns while solving a task [35], [36]. In this study, transfer learning was employed as we did not have adequate sample images to train an architecture from scratch. When a pre-trained model is fine-tuned for a new task, either the entire model is trained or some layers are trained and others are left frozen, or the convolutional structure is left in its original state.
In deep learning, a CNN is a type of artificial neural network in which mathematical operations are performed. It has widespread use on image data. The primary layers of a CNN architecture are shown in Figure 2. The general description of layers in Figure 2 is as follows: The convolution layer is the first layer of a CNN. In this layer, features are extracted from the image using some filters. The pooling layer, which has a different number of functions, is a layer added between convolutional layers in a network. The aim here is to reduce the parameters and the number of calculations in the network. The Fully Connected (FC) layer is one of the last layers of a convolutional neural network. The neurons in this layer are connected to all neurons before and after them. Finally, the output layer is the layer where the result is produced. In the present study, the last layer of pre-trained models, namely, Xception [29], ResNet-101 [30], MobileNetV2 [31], InceptionV3 [32], DenseNet-121 [33] and ConvNeXt [34] were fine-tuned. The details of each DL based architecture are as follows:

1) XCEPTION
Xception is a popular CNN architecture proposed by Francois Chollet [29]. It is based on depth wise separable convolution layers with residual connections. The researcher performs a theory which is a more robust version of the basic hypothesis of the Inception model. According to this theory, the mapping of cross-channels correlations and spatial correlations in the feature maps of CNNs can be completely separable. In the Xception model, there are 36 convolution layers structured into 14 modules. Except for the first and last modules, all these modules have linear residual connections around them. The data goes through the input flow, middle flow, and output flow, respectively [29].

2) ResNet-101
ResNet architecture is one of the most popular Deep Neural Networks (DNN) available in many varieties with different numbers of layers. ResNet was introduced by He et al. [30]. They presented a residual learning framework to ease the training of networks and they reformulated the layers as learning residual functions. Due to the vanishing gradient problem, deep networks are difficult to train. Therefore, increasing the depth of the network by stacking more layers may not be effective. As the size of the net gets bigger, the accuracy becomes saturated and then reduces rapidly [30]. He et al. presented a deep residual learning framework for the problem of degradation. They introduced ''identity shortcut connections'' that skip one or more layers with identity mapping [30].

3) MobileNetV2
MobileNetV2 [31] is a convolutional neural network architecture specifically designed for mobile devices and environments with constrained resources. MobileNetV2 is based on ideas taken from MobileNetV1 [37], however; it is a significant improvement over MobileNetV1. Deep separable convolution is used as efficient building blocks in MobileNetV2. However, MobileNetV2 has the following two new features: Linear bottlenecks between layers and shortcut connections [38]. In the first place, the MobileNetV2 model has a fully convolution layer, involving 32 filters. After this layer, there are 19 bottleneck layers. ReLU is used as non-linearity, in addition kernel size is 3 × 3 and dropout and batch normalization are used during the training process. A constant rate of expansion is utilized throughout the network, except for the first layer [31].

4) InceptionV3
InceptionV3 [32], which started as a model for GoogLeNet, is a CNN to aid in image analysis and object detection [39]. It is the 3rd version from the Inception family and was first introduced during the ImageNet Recognition Challenge. InceptionV3 includes convolutions, average pooling, max pooling, concats, dropouts, and fully connected layers [32]. This model aims to prevent the number of parameters from increasing too much while the network gets deeper. Thereby, it has lower computational cost.

5) DenseNet-121
DenseNet connects each layer to every other layer in a feedforward form. At each layer, features from all previous layers are taken as input and their own features are passed to all subsequent layers. DenseNets provide significant advantages such as; they make feature propagation robust, support feature reuse, lessen the vanishing-gradient problem, and significantly decrease the number of parameters [33]. The model starts with the convolution pooling blocks, continues with the dense block transition layer, and finally ends with the global average pool and a fully-connected block.

6) ConvNeXt
ConvNeXT [34] inspired from vision transformers is a convolutional model proposed by Liu et al. They gradually VOLUME 11, 2023 ''modernized'' a standard ResNet towards a vision transformers design. They found some key components and these components contribute to the performance difference. Deep convolution which is a special case of grouped convolution was used in ConvNeXt. Accordingly, the number of groups is equal to the number of channels. In depth-wise convolution, a process similar to the weighted sum in self-attention is performed. This process works on a channel basis, in other words it only mixes spatial information.

7) ViT
ViT is an architecture with transformer. Transformer was firstly used in Natural Language Processing (NLP) tasks [40]. Basically, a transformer includes a self-attention structure. This mechanism weights the importance of each part of the input data. The main purpose of this method is to focus on significant point in data. This approach has attracted great attention due to its success in NLP applications. Accordingly, it started to find a place in image applications [41].
For equations 1, 2, 3, 4, 5 and 6. TP corresponds to true positives, FP false positives, TN true negatives and FN false negatives. TP means that the test sample has disease and is predicted to have disease. FP is the example that is healthy but classified as diseased. TN is healthy and classified as healthy. FN is the sample that has disease but is classified as not healthy.
Kappa score is a measure of reliability of a model in classification. That is, it expresses the agreement between model and real classes. Its interpretation is as follows: Kappa values from 0.0 to 0.2 shows slight agreement, 0.21 to 0.40 indicates fair agreement, from 0.41 to 0.60 is moderate agreement, 0.61 to 0.80 indicates substantial agreement, and from 0.81 to 1.0 gives almost perfect or perfect agreement.
In addition, Gradient-weighted Class Activation Mapping (Grad-CAM) [42] images of best architectures were examined. Grad-CAM is a heat map image of the tested data. It shows from which regions the architectures were inferred. Red color in the heat map image is the area where the models predict most intensely. In this way, we observed whether the models' learned parts related to the disease or not.
Correlation between architectures was also examined by using ICC value. The correlation between the metric results of models was statistically evaluated by using ICC. The relation was measured separately for each success metric. Intraclass correlation coefficients were computed for each class according to the following criteria: < 0.40 = poor agreement; 0.40-0.59 = fair agreement; 0.60-0.74 = good agreement; 0.75-1.0 = excellent agreement. Thereby, the correlation of architectures with each other in correct diagnosis was assessed.

III. RESULTS
A series of experiments were carried out to evaluate and to confirm the performance of a CNN structure, different TL based models (Xception, ResNet-101, MobileNetV2, Incep-tionV3, DenseNet-121 and ConvNeXt) and ViT in terms of inferencing TMJ disorders automatically. All experiments were run in Google Colaboratory (Colab) [43]. Colab is a product developed by Google Research, where python codes can be written and executed online. It allows the use of GPU and to import many libraries automatically, therefore it is very practical for studies on machine learning.
The images were divided into 4 groups and 8 different classes and training, and testing processes were carried out separately for each group. This allowed observation and evaluation of performance analysis of deep learning approaches in a more efficient way. Considering that the jaw structure could be in different shapes and sizes depending on gender and age groups, the data set comprised images of patients from different gender and age groups. In the first place, the data set was divided as training, validation, and test groups so as to provide a better validation processes. For each category, training data was 80% of the total number of data and test data was 20%. Also, 20% of the training data was employed for the validation process. Moreover, data augmentation techniques were applied to increase the amount of data. Data augmentation is the artificial expansion of available data by making some random (but realistic) changes to the existing images. These changes are minor ones such as rotating, increasing brightness, zooming etc. Each operation increases the number of data as much as itself. For example, if three different augmentations are applied, the number of data will triple. In this experimental study, we used 3 different augmentation methods: contrast, flip and rotation. In order to obtain more reliable results, the region of disc was cut out from MR images. In other words, the data set was cropped manually. The reason why images were cut this way is because a huge amount of data was needed for architectures to learn from the full version of the MR images (as in Figure 1). MR images as in Figure 1 included parts that were not related to the TMJ, such as the brain, eye, mouth, etc. These parts caused the network to make inferences from the wrong parts. Figure 3 represents an example for cropped versions of images.
Firstly, classification was performed with a basic CNN network. The network consists of the Convolution, Pooling and Dense layers. After that, models were applied separately for each class on the data set during the training process. Finally, experiments with ViT were carried out.
In order to obtain a fair comparison between DL architectures, parameter values were set the same for all these models. For instance, the number of epochs was 100, loss function was binary cross-entropy, optimizer was ADAM [44] and learning rate was 1 × 10 −6 . However, in some trainings, the loss value of architectures increased after the 50th iteration. Therefore, the training phase of these models was discontinued at 50. After the training process was completed, the success of models was tested through the use of test data. Accuracy rate, precision, sensitivity, F1-score, NPV, specificity, AUC and kappa score were calculated. According to these results, which architectures Grad-CAM images would be produced were determined. Finally, correlation between the metric results of architectures was examined by using ICC. Figure 4 illustrates the flow chart of steps followed to solve the problem. Tables 2, 3, 4 and 5 present the results for ''closed mouth disc position'', ''open mouth disc position'', ''joint cavity effusion'' and ''mandibular condyle degeneration'' respectively. According to our findings, it can be said that MobileNetV2 was the best architecture for ''closed mouth disc position'' and Xception for ''open mouth disc position'' groups, whereas ResNet-101 was considered as the most effective model for ''joint cavity effusion'' and MobileNetV2 was the most effective model for ''mandibular condyle degeneration'' groups, respectively. The best architectures were determined by considering all metrics. Training and validation accuracy graphics of successful architectures were presented in Figure 5.
We also mentioned that experiments were carried out with ViT. We conducted the experiments, but unfortunately the results were 50% and below. These outcomes are much lower than expected and not acceptable. Therefore, it was not included in the tables and a comparison with other models was not made.
Grad-CAM results of best architectures were also produced. The Grad-CAM results of architectures are presented in Figure 6. In Figure 6, the first row shows diseased areas marked by the dentomaxillofacial radiologist. The MRI images used in this study were images of patients diagnosed by consensus of dentomaxillofacial specialists. Therefore, it allows an accurate assessment of the success of architectures. As mentioned earlier, the red parts are the areas where architectures were heavily focused for learning. It was not expected to obtain a precise boundary drawing from those images. The main purpose was to observe whether inferences were made from the right regions. As can be seen from the Grad-CAM images, models mostly colored (selected) the correct regions. This means that the networks were quite successful in classification. Finally, the ICC value for metrics is presented in Table 6.
The ICC values of models were 0.58 for accuracy, 0.57 for precision, 0.63 for sensitivity, 0.64 for F1-score, 0.28 for NPV, 0.79 for specificity, 0.74 for AUC and 0.67 for kappa score. While the correlation between models for the NPV value was low, the results for other metrics were found to be acceptable.

IV. DISCUSSION
The TMJ is among one the most complex joints of the human body. Temporomandibular joint and its related structures have an important role in distributing the stresses produced by frequently performed tasks like speaking, chewing and swallowing, as well as directing the jaw movement [45]. Temporomandibular joint disorders correspond to the clinical problems of the temporomandibular joint and joint structures as well as the masticatory muscles [1]. TMD can be the cause of psychological diseases such as depression and inferiority complex or vice versa anxiety and depression may cause TMJ complaints [46]. According to epidemiological studies, the rate of adults showing at least one TMD symptom during their examination may rise to 75% [47]. The MR images of TMJ are interpreted and reported by dentomaxillofacial radiologists. Therefore, observer performance and experience are important factors in the diagnosis of TMJ disease.
Translation research, which has developed in recent years, aims to integrate new methods into medical applications and accelerate the transition to clinical practice. Various studies have been carried out in this context [48], [49]. The aim of this study was to automatically report TMJ MR images in a short time and to prevent the interpretation difference depending on observer performance and experience. In addition, this method can be used for evaluation and follow-up after   treatment. To our knowledge, our study is one of the leading works to diagnose TMD by using different architectures and to interpret the success of models according to images diagnosed by dentomaxillofacial radiologists. In our opinion, use of this methodology will enable higher accuracy in diagnosis while consuming less time and therefore we believe that a significant contribution will be made to current TMJ related literature. 49108 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   The implementation that is analogous to our study and that we can compare is the publication of Kao et al. [28]. They performed DL-based models (InceptionResNetV2, Incep-tionV3, DenseNet169, and VGG16) on total 300 images of 32 healthy and 52 TMD patients. As a first step, they detected the articular space between the temporal bone and the mandibular condyle using the U-Net architecture from 100 sagittal MRI images of the TMJ. Then, they made classification using the specified architectures. Recall, precision, accuracy, and F1 score values for InceptionV3 were 1.0, 0.81, 0.85, 0.9, respectively, and 0.92, 0.86, 0.85, 0.89 for DenseNet169. VOLUME 11, 2023 49109 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  Previous studies other than [28] generally focused on different diseases and inferences. For instance, Kim et al. [25] applied random forest and MultiLayer Perceptron (MLP) methods by using MR images for the detection of TMJ disc perforation. They studied 299 joints belonging to 289 patients. They divided these joints into two groups as perforated and non-perforated. This separation was determined according to the presence of disc perforation detected during surgery. They compared the performance of models by using AUC. MLP performed best with AUC of 0.940, followed by random forest with AUC of 0.918, and disk shape alone resulted in AUC of 0.791. There are some published studies regarding TMJ segmentation in the literature. Authors of a previous research, [26] proposed an all-automatic articular disc detection and segmentation system. In this system, a DL-based semantic segmentation approach was proposed. Within the proposed system, authors aimed to support the diagnosis of TMD in MRI. Two hundred and seventeen (217) MR images were used. These images were images of patients with displaced or normal articular discs. Three DL-based semantic segmentation approaches were used. The first one, an encoder-decoder CNN model named 3DiscNet (Detection for Displaced articular DISC using convolutional neural NETwork), proposed as a new approach within the study. This was compared with U-Net and SegNet-Basic architectures. Another interesting study on temporomandibular joint segmentation was carried out by Liu et al. [27]. They introduced an automated segmentation algorithm based on deep learning followed by a post processing stage. First, a U-Net model was applied to separate images into 3 categories (glenoid fossa, condyles, and background). In the post-processing stage, the internal force constraint of a snake model was used to renew the integrity of the fracture boundary for structural fractures in these split images. Based on the tracking concept, the initial boundary of the snake was obtained. A total of 206 low-dose Computed Tomography (CT) cases were used and compatibility between the experimental results and the gold standard was evaluated with indicators such as the Dice Coefficient (DC) and the Mean Surface Distance (MSD).
In the current study, a CNN network, 6 different pre-trained, fine-tuned models (Xception, ResNet-101, MobileNetV2, InceptionV3, DenseNet-121 and ConvNeXt) were applied on the MR images (2576 MR images of 200 patients) prepared together with Ankara University, Faculty of Dentistry. In addition to these, the performance of ViT was also examined. It is an important point that the data used in our study was not obtained from the public data set but collected and created by the researchers. A pre-processing on the images was also carried out by the researchers. Before applying the models, the region related to the disc was cut out from images. Afterwards, experiments were carried out on these cropped images. Once the experimental results were examined, we observed that the TL-based architectures provided effective results. We figured out from Tables 2, 3, 4 and 5 that the best accuracy rates, F1 scores and AUC values were promising with values above 0.75. Considering all metrics in Table 2, MobileNetV2 provided the highest value. It produced a 97% accuracy rate, 0.97 F1-score and 0.95 AUC value. As a result of the execution of architectures on ''open mouth disc position'' MR images, Xception yielded the best accuracy with 81%. It was seen that the Xception architecture was the most successful model for precision, sensitivity and NPV values as well. The F1-score of this architecture was very effective at 0.79, but here, ResNet-101 presented the highest precision value with 0.80. In addition, the AUC value of ResNet-101 was 0.75, which was higher than Xception. According to the results in Table 4, in terms of all metrics, ResNet-101 provided the best results, hence it can be accepted as the best DL architecture. When we analyzed Table 5, it could be seen that MobileNetV2 was the most effective architecture in consideration to all findings. Since deep webs are black boxes, it is not clear how they learn and why different architectures are successful for each disease is unknown. But, we could state that the MobileNet is a model developed for limited data, this feature has enabled the architecture to be the most successful one in two classes and produce effective results in other groups.
Nevertheless, high success rates do not guarantee that architectures have the capability to diagnose diseases in the right regions. It is possible that models can make predictions over wrong areas. Therefore, we also examined Grad-CAM images. Assessment of results was based on the regions marked by the dentomaxillofacial radiology specialist. In other words, the images generated by coloring with Grad-CAM were compared with the version of the same images marked by the specialist. In Figure 6, the red parts in the second rows show the regions from which the models infer. We observed that the models mostly colored the correct region, when the Grad-CAM images of the architectures were thought to be the best for each group examined. We would also like to state our interpretation on ConvNeXt. ConvNeXt had decent outcomes, but when Grad-CAM images were examined, it was seen that the architecture made inferences from wrong points. In our interpretation, the main reason for this is that the network focuses on wrong area.
When ICC values are analyzed, in general, they revealed good to excellent agreement values suggesting acceptable correlation among the models assessed.
We would like to emphasize again that the process of creating a dataset is costly and time consuming. Experimental results showed that our findings were encouraging. Provided models were successful in the assessment of diagnostic images of TMD patients. However, we believe that future studies should include a higher number of images for all groups incorporating images of patients with more than one disease. This is very important for the results to be more reliable and robust. Because the results obtained with little data are instructive but not sufficient for full practical application. In this regard, we plan to increase the number of data, to examine and to apply different methods mostly used in medical diagnosis.

V. CONCLUSION
For the dentistry literature, a CNN network, 6 different deep learning architectures were applied on MR images for TMD diagnosis. In addition to these, the effectiveness of ViT, which has been very popular nowadays, on MR images was investigated. We observed whether learning was carried out in part related to the disease with Grad-CAM images. Also, ICC was utilized to analyze the correlation between the models. The ICC value shows the agreement between models in correct diagnosis. Considering the results, the reliability between architectures was interpreted as satisfactory. Within the limitations of the present research, architectures were found to be successful in terms of metrics, such as accuracy rate, precision, sensitivity, F1-score, specificity, AUC and kappa score, examining the Grad-CAM images and calculating the ICC. Therefore, this study has the potential to make a significant contribution to the literature and encourage researchers to apply different DL based approaches to solve the diagnostic problem.