Deep Convolution Neural Network Based Articial Intelligence Improves Diagnosis of Thyroid Scintigraphy for Thyrotoxicosis: a Dual Center Study

Background: 99m Tc-pertechnetate thyroid scintigraphy is a valid avenue for distinguishing causes of thyrotoxicosis in the clinic, but the interpretation of thyroid scintigraphic images is subjected with signicant variation among different inter-observers. We aim to develop an articial intelligence (AI) system to improve the diagnosis of thyrotoxicosis. Materials and methods: We constructed an AI model based on a deep neural network with 2468 thyroid scintigraphic images collected from West China Hospital, and evaluated the diagnostic accuracy for classifying four patterns of thyrotoxicosis: ‘diffusely increased,’ ‘diffusely decreased,’ ‘focal increased,’ and ‘heterogeneous uptake.’ Then, we compared the diagnostic performance of the AI model and ve physicians with 200 testing cohorts from two centers. Results: We constructed the AI model, which has the best performance in internal database validation based on four kinds of standout pre-trained networks. This AI model achieves satisfactory performance in classifying four patterns of thyrotoxicosis with an overall accuracy of 91.92% for internal and 86.75% for external data validation. In the following contrastive study, the AI model represented improved diagnostic accuracy and consistency than 5 physicians for interpreting data from West China Hospital (88% vs. 66~73%) and Panzhihua Central Hospital (83% vs. 53%~79%), respectively. Conclusion: Deep convolution neural network based AI model represented considerable performance in classifying four patterns of thyroid scintigraphic images; this may help physicians diagnose causes of thyrotoxicosis and reduced the physicians’ error rate.

physicians with 200 testing cohorts from two centers.
Results: We constructed the AI model, which has the best performance in internal database validation based on four kinds of standout pre-trained networks. This AI model achieves satisfactory performance in classifying four patterns of thyrotoxicosis with an overall accuracy of 91.92% for internal and 86.75% for external data validation. In the following contrastive study, the AI model represented improved diagnostic accuracy and consistency than 5 physicians for interpreting data from West China Hospital (88% vs. 66~73%) and Panzhihua Central Hospital (83% vs. 53%~79%), respectively.
Conclusion: Deep convolution neural network based AI model represented considerable performance in classifying four patterns of thyroid scintigraphic images; this may help physicians diagnose causes of thyrotoxicosis and reduced the physicians' error rate.

Background
Arti cial intelligence (AI) involves wide aspects in the modern healthcare eld. The distinguished advances of AI in big-data retrieval, explicit feature extraction, and satisfactory consistency are strongly bene cial to medical image analysis (1)(2)(3). Initially, the AI is mostly utilized in many monotonous, repetitive tasks and heavy workloads, such as evaluating screening mammography and chest X-rays. (4) At present, with the optimization of computer algorithms and the development of deep convolution neural networks (DCNN), AI has applied to more advanced projects about radiodiagnoses, such as automatic nodule detection for lung cancer in CT images and thyroid cancer identi cation in sonographic images (5,6). However, there were few efforts have been made in nuclear image interpretation, which also requires a variety of repetitive information as well as proper clinical feature extraction for diagnosis of disease, such as thyroid scintigraphy. (7) Thyrotoxicosis is a very common endocrine condition with multiple etiologies, and the misdiagnosis might lead to improper medical treatments. Despite blood biochemical detection and ultrasonography are widely used, it is still not enough to obtain a precise diagnosis in several situations. Thyroid scintigraphy with 99m Tc-pertechnetate is a valid avenue to identify the causes of thyrotoxicosis, especially for distinguishing Graves' disease (GD) and toxic multinodular goiter (TMG) when both thyrotropin receptor antibody was negative or differentiating GD from thyroiditis (8). However, thyroid scintigraphy images are mostly simple planar with limited resolution, which makes the interpretation is a time-consuming, experience-dependent, and subjective work with signi cant variation among inter-observer measurement (9). Thus, an AI model with the capability of feature extraction, ideal diagnostic accuracy, and consistency for thyroid image interpretation is probably an effective strategy to improve the clinical diagnosis of thyrotoxicosis.
In this study, we collected 2468 thyroid scintigraphic images to establish a deep learning neural network and constructed an automatic AI model for the classi cation of thyrotoxicosis. Then, we evaluated the diagnostic performance of AI model and compared it with human physicians based on datasets from dual centers.

Collection, Inclusion, and Exclusion of Patients
This study with retrospective information collection was approved by the Institutional Ethics Committee of West China Hospital in Sichuan University and Panzhihua Central Hospital, respectively. We retrospectively collected cases who were determined as thyrotoxicosis and underwent 99m Tcpertechnetate thyroid scintigraphy from January 1, 2016 to December 31, 2018 at West China Hospital of Sichuan University (Center 1) and Panzhihua Central Hospital (Center 2). Patients who received antithyroid drugs, radioactive iodine therapy or semi/total thyroidectomy were excluded, and images with poor quality or lateral acquisition were also excluded. The thyroid scintigraphy in two hospitals was obtained following the clinical guidelines and manufacturer recommended parameters. Brie y, patients were intravenously injected with 185 MBq of 99m TcO4 -, and then the images were captured for 100-300×10 3 counts about 5-10min using the gamma cameras, which were both equipped with the lowenergy, high-resolution, parallel-hole collimators (GE Discovery NM/CT 670). The energy peak was centered at 140 keV with 15% to 20% windows. All the images were exported as DICOM format for further analysis.

Diagnostic Criteria
Thyroid scintigraphic images were de ned as four patterns referring to published criteria (10-13). The ones had homogeneous increased uptake over than the uptake of salivary with enlarged thyroid were de ned as 'Diffusely increased' (type I); the ones had diminished, and absent uptake was de ned as 'diffusely decreased' (type II); the ones had focal nodule uptake with suppressed uptake in the surrounding, and contralateral thyroid tissue was de ned as 'local increased' (type III), and the ones had multiple areas of focal increased and suppressed uptake was de ned as 'heterogeneous uptake' (type IV).
All characteristic performance of these four patters images were shown in Fig. 1. For this study, all thyroid scintigraphy images from two centers were independently and blindly classi ed by three senior nuclear medicine physicians with more than 10 years of working experience in reading thyroid scintigraphic images. Consensus shall be reached by consulting if there is disagreement.

Construction of AI Model
The images collected from center 1 was de ned as the internal dataset for AI construction and internal validation, while the images from center 2 were de ned as the external dataset for validation only. The architecture of AI model is illustrated in Fig. 2. There are three main steps in the training process: data augmentation, feature extraction, and classi cation. At the rst step, random ipping, rotating, and mixup are applied to the original image to increase the diversity of the data and improve the robustness of the model in augmentation (14). Then, a feature extraction neural network is employed to extract highlevel features from the input image. In this study, we explored four kinds of candidate AI models based on different standout pre-trained networks, including ResNet50 (15), DenseNet169 (16) InceptionV3 (17), InceptionResNetV2 (18). All these networks have been removed the last fully-connected layer and employed as the feature extraction network. At the nal step, a three-layer neural networks are constructed to classify the high-level features into four classes. In the current study, all models were trained using Adam (19) as the optimizer with a weight decay rate of 0.0001 and a learning rate of 0.001 for 300 epochs. The mini-batch size was xed 12. To reduce the side effect of over tting, dropout (20) was employed to the last fully connected layer, with a drop probability of 0.8.

Evaluation of Model Performance
The diagnostic accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of four candidate AI models were individually evaluated to select the one with the best performance in the internal dataset validation. Then, the performance of selected AI in the validation of the external dataset was evaluated by areas under the curve (AUC) of receiver operating characteristic (ROC) as well. To further investigate the diagnostic e ciency and accuracy of AI model, we randomly chose two testing cohorts from dual centers containing 100 cases in each and compared the classi cation performance of our AI and ve nuclear medicine physicians from two centers. All physicians were unaware of any patients' clinical details. The overall accuracy of AI model and human physicians were recorded, while the differences of the diagnostic labels and true labels in identifying four thyroid patterns were represented by the confusion matrix of 4 × 4 contingency table.

Statistical analysis
In this study, we calculated the sensitivity, specificity, accuracy, PPV, and NPV of AI model for classifying four patterns of thyrotoxicosis on thyroid scintigraphy. And we compared overall accuracy between ve physicians and AI model. All analyses were performed by using statistical software SPSS 21.0 (SPSS Inc, Chicago, IL, USA).

Patient Characteristics
The distribution of thyroid images used for AI construction was shown in Table 1

Selection of AI model
The individual performances of AI models from four DCNN methods in internal validation were shown in Table 2. The InceptionV3 achieved the highest overall accuracy of 92.73% in classifying four patterns of thyroid images that were selected for further use. As shown in Fig. 3  Contrastive Study of AI model and Human Two interns with three-year experience (physician 1 and 3, from center 1) and three staffs with 5~6 years' experience (physician 2 from center 1, physician 4 and 5 from center 2) participated in the comparative experiment. AI cost 0.54 seconds to accomplish the interpretation, while the human spent an average time of 32.50±13.29 minutes (21-62 minutes) to nish the same work. As shown in Fig. 4, the AI model still achieved the highest overall diagnostic accuracy in classifying images in both datasets (88% and 83%, respectively), while the best performances of humans were 73% and 79% in this test. The confusion matrixes indicated individual representative features of AI and human physicians in classi cation.
Notably, in reading the data from center 1 (Fig. 5), the disparity in diagnostic accuracy is mainly caused by identifying 'heterogeneous uptake'; AI successfully identi ed 18 out of 25 cases, while human physicians only identi ed 5~15 cases. In Fig. 6, although AI indicated a certain decline in identifying 'heterogeneous uptake' in the dataset from center 2, it still represented more satisfying overall diagnostic accuracy and consistency than human physicians.

Discussion
According to clinical practice, thyrotoxicosis could be classi ed into two general situations, thyrotoxicosis with or without hyperthyroidism. Hyperthyroidism involves Grave's disease, or TMG accelerated biosynthesis and secretion of thyroid hormone by the thyroid gland itself, which may require radioactive iodine ablation or anti-thyroid drugs; whereas subacute thyroiditis or autoimmune thyroiditis caused thyrotoxicosis without hyperthyroidism could cure spontaneously (13,21,22). Thus, distinguishing the real causes of thyrotoxicosis is essential for therapeutic consultation. Thyroid scintigraphy supplies an effective avenue to represent the status of the thyroid and distinguish the causes of thyrotoxicosis by four-pattern classi cation. Generally, 'diffusely increased' suggests GD, 'diffusely decreased' pattern suggests thyroiditis, the 'local increased' is suggestive of toxic adenoma, and the 'heterogeneous uptake' is a clue to TMG. (9,12) Although there are several guidelines for clinical diagnosis, the superposition of subtle differences between devices, injected drug doses, imaging parameters, and subjective variation of physicians might lead to a different diagnostic conclusions. Thus, we developed an AI model and hoped to help clinical physicians remedy this unsatisfactory situation. In this study, the images were captured for 100×10 3 counts from West China Hospital, which are typical "low-abundant" images, whereas the images from Panzhihua Central Hospital were captured for 300×10 3 counts, which are "high-abundant" examples. Our AI model was constructed from the data in West China Hospital and achieved ideal diagnostic accuracy in identifying the thyroid images from Panzhihua Central Hospital. Beyond the time-saving effect, this AI model is bene cial to classify four common patterns of thyrotoxicosis on thyroid scintigraphy images, especially in identifying the patterns of 'heterogeneous uptake' than human physicians in dual centers, which is able to reduce the physician's subjective variations in interpreting thyroid images and provide more appropriate management for patients.
Our current AI model was not used to "replace" the physicians, but to "assist" doctors improve the diagnostic accuracy and e ciency, and provide a proper clue for therapeutic selection. However, there are still several limitations in this study. Firstly, a signi cant decrease of diagnostic accuracy of 'heterogeneous uptake' was found when AI shifted to the new dataset of "high-abundant images," which indicates more optimizations are required in picture feature extraction beyond DCNN. Secondly, although four-pattern classi cation is valid in general situations, the real status of thyroid function still might be alternative following with different diseases and courses. For example, the patients with Hashitoxicosis could represent all these four patterns during the different courses (23,24). Thus, the nal diagnosis of thyrotoxicosis must be correlated with more information, such as medical history, physical ndings, and blood tests, etc. Nevertheless, a new multi-parameter AI model containing both images and hematology index for diagnosis of thyrotoxicosis is under development. Then, further validations by more centers are still required for the optimization of this AI model.

Conclusion
We have successfully constructed an AI model for classifying four common patterns of thyrotoxicosis on thyroid images and achieved considerable diagnostic accuracy in dual centers. With further assessment and validation, this model might be promising in the clinical diagnosis of thyrotoxicosis.