Machine Learning Analysis of Image Data Based on Detailed MR Image Reports for Nasopharyngeal Carcinoma Prognosis

We aimed to assess the use of automatic machine learning (AutoML) algorithm based on magnetic resonance (MR) image data to assign prediction scores to patients with nasopharyngeal carcinoma (NPC). We also aimed to develop a 4-group classification system for NPC, superior to the current clinical staging system. Between January 2010 and January 2013, 792 patients with recent diagnosis of NPC, who had MR image data, were enrolled in the study. The AutoML algorithm was used and all statistical analyses were based on the 10-fold test. Primary endpoints included the probabilities of overall survival (OS), distant metastasis-free survival (DMFS), and local-region relapse-free survival (LRFS), and their sum was recorded as the final voting score, representative of progression-free survival (PFS) for each patient. The area under the receiver operating characteristic (ROC) curve generated from the MR image data-based model compared with the tumor, node, and metastasis (TNM) system-based model was 0.796 (P=0.008) for OS, 0.752 (P=0.053) for DMFS, and 0.721 (P=0.025) for LRFS. The Kaplan-Meier (KM) test values for II/I, III/II, IV/III groups in our new machine learning-based scoring system were 0.011, 0.010, and <0.001, respectively, whereas those for II/I, III/II, IV/III groups in the TNM/American Joint Committee on Cancer (AJCC) system were 0.118, 0.121, and <0.001, respectively. Significant differences were observed in the new machine learning-based scoring system analysis of each curve (P < 0.05), whereas the P values of curves obtained from the TNM/AJCC system, between II/I and III/II, were 0.118 and 0.121, respectively, without a significant difference. In conclusion, the AutoML algorithm demonstrated better prognostic performance than the TNM/AJCC system for NPC. The algorithm showed a good potential for clinical application and may aid in improving counseling and facilitate the personalized management of patients with NPC. The clinical application of our new scoring and staging system may significantly improve precision medicine.


Introduction
Nasopharyngeal carcinoma (NPC), which is a malignant cancer arising in the epithelium of the nasopharynx, is a prevalent form of cancer in a number of populations, including those in South China, Southeast Asia, the Arctic, the Middle East, and North Africa [1][2][3]. It was estimated that 60,600 new cases of NPC were diagnosed in mainland China in 2015, accounting for 40% of NPC cases worldwide [4].
Radiotherapy is the primary treatment for NPC. In recent years, intensity-modulated radiotherapy (IMRT) has been extensively used by virtue of its lower normal tissue doses and more uniform target doses when compared to conformal radiotherapy [5][6][7][8]. is has led to improved disease outcomes due to a higher local tumor control rate. In fact, distant metastasis is now the predominant reason for treatment failure in patients with NPC [9]. Advances in diagnostic and therapeutic techniques have improved the management and treatment of NPC [10][11][12]. Due to its high spatial resolution for examination of soft tissues, magnetic resonance imaging (MRI) has been extensively used as the optimal imaging modality for the assessment of local, regional, and intracranial infiltration of NPC in clinical practice [13]. e usefulness of MRI for stage assignment and disease prognosis has also been reported [12]. e TNM staging system of the American Joint Committee on Cancer (AJCC) [14] is generally accepted as the most widely used tool for cancer staging and plays a key role in guiding treatment and determining the prognosis of NPC in clinical practice. According to recent research, the current 8 th edition of the AJCC staging system for NPC is still not completely satisfactory, though it enables a more accurate prediction of treatment outcomes than the 7 th edition [15]. e demarcation line between the early and late stages (stages II and III) is still not well defined [16]. is leads to confusion in the choice of treatment by physicians. Moreover, patients in the same stage may require different treatments based on their prognoses. Since the TNM/AJCC system is based on factors manually selected from conventional radiographic data, a vast amount of anatomic structural data may be neglected. Because the significance of such data is unknown, neglecting these data may be a limitation of the current TNM system. erefore, detailed MR image reports containing larger amounts of potentially significant features for prognosis as well as new statistical tools with the ability to analyze large amounts of data with good predictive performance are both urgently needed.
Machine learning (ML) is the best solution to the problem discussed above. ML is able to create reasonable generalizations, classify previously unseen data, discover patterns, or predicts new directions based on observed data using multifarious artificial intelligence and statistical models [17]. Many ML methods have been reported in studies of conditions such as lung cancer, breast cancer, and Alzheimer's disease. However, the state-of-the-art ML technique is "automatic machine learning" (AutoML), which has the ability to automatically select the ML classifier with the best performance to suit the data. ere are several excellent open-source AutoML algorithms, including XGBoost, Deep learning (DL), and Light GBM. AutoML also includes common models such as Lasso and Ridge Regression, Random Forest, and Naivebayes. In summary, AutoML is easy to program and its use is feasible in clinical applications.
Based on the above premise, we utilized image data from detailed MR image reports and used ML to make prognosis predictions. We also compared the prognostic value of the predictions made using ML to that of the traditional TNM/ AJCC system. To the best of our knowledge, almost no medical professionals have performed similar research. erefore, our study would be the first step in the exploration of this issue.

Patient.
e Sun Yat-sen University Cancer Center Institutional Review Board approved this retrospective study and waived the requirement for informed consent from the patients. We enrolled 3,814 patients with newly diagnosed NPC in the Sun Yat-sen University Cancer Center between January 2010 and January 2013. Of these patients, 2,973 were excluded due to incomplete medical records. Of the remaining 841 patients, 24 with distant metastasis and 5 without neck MRI and 20 cases combined with other tumors were excluded. Finally, 792 patients were included in our study.
e eligibility criteria were as follows: (1) pathologically diagnosed NPC; (2) no evidence of distant metastasis; (3) receipt of standard IMRT treatment, the details of which were determined by radiation therapists, physicians, and highly qualified physicists (platinum-based chemotherapy was used as routine treatment for NPC); (4) no primary tumor in other parts of the body; and (5) complete imaging and clinical data. e exclusion criteria were as follows: (1) presence of primary tumors in other parts of the body and (2) failing to complete radiotherapy for physical reasons during the course of treatment. All patients completed a pretreatment examination including a complete medical history, physical examination, hematology and biochemistry profiles, chest X-ray, abdominal computed tomography, and MRI of the neck and nasopharynx.

MR Imaging and Image
Analysis. MRI studies were performed using a 1.5 T (Signa CV/i General Electric Healthcare, Chalfont St. Giles, United Kingdom) or 3.0 T system (Siemens Magnetom Tim Trio, Erlangen, Germany) with standard scan procedures. Scanning range included the saddle pool to the lower edge of the sternal collarbone with combined head/neck coils. Non-contrast-enhanced axial, coronal, and sagittal plane T1-weighted images (WI) and axial T2-WI were procured. e following scanning parameters were used: T1WI in the axial, coronal, and sagittal planes (fast-spin-echo (FSE), TR/TE � 500-600/ 10-20 ms, field of view (FOV) � 22 cm, frequency matrix � 256 × 512). T2-WI in the axial plane (FSE, TR � 4000-6000 ms, TE � 95-110 ms, FOV � 22 cm, frequency matrix � 256 × 512). e contrast agent gadolinium diethylenetriaminepentaacetic acid was injected at a dose of 0.1 mmol/kg. T1-weighted axial and sagittal sequences and fat-suppressed coronal imaging were performed sequentially using the same parameters before contrast injection. e scanning section thickness was 5 mm, and the section gaps were 1 mm.
Two radiologists specializing in head-and-neck cancers for over 10 years separately evaluated all MRI materials and gave detailed MR image reports with the average time of 15-20 minutes, and the stage for each NPC case was determined according to the 8 th edition of the AJCC staging system [14]. A review of invasion of other anatomical structures based on lymph node information was also performed. Disagreements were resolved by consensus.

Follow-Up.
After treatment, patients were followed up every three months for the first two years and every six months thereafter for 5 years. e terminal outcomes were distant metastasis-free survival (DMFS), local-regional relapse-free survival (LRFS), progression-free survival (PFS), and 5-year overall survival (OS).

Machine Learning
Step 1. Tenfold cross-validation (10 CV): Patients' data were randomly divided 10-fold without duplication to simulate the 10 times of external validations.
Step 2. Feature reduction is crucial for reducing time consumption by different classifiers and for more clinical feasibility. First, using the RMRe feature of the R Package with default setting, the top 20 most important features were chosen and ranked separately for OS/DMFS/LRFS. Second, in addition to the generalized linear model (GLM) (including Lasso and Ridge) algorithm with grid hyperparameter search in R's H2O package, which has an extremely fast run (<120 s), we progressively added features 1 to 20 separately; the most important features corresponded with the biggest average area under curve (AUC) of 10 CVs (attached code sample: gusting_proper_number_of_variables.r). e detailed description is shown in Supplementary Material 1.
Step 3. Using the AutoML algorithm of the H2O package with the above important variable, the total learning time was set to 7200 s to ensure that sufficient models were trained under different algorithms. e algorithm and model with the highest average AUC of 10 CVs were eventually chosen. Similarly, the TNM system, as the control group, used only two features with the T-and N-stage (attached code sample: automl.r). e demo R code is demonstrated in Supplementary Material 2 (http://docs.h2o.ai/h2o/lateststable/h2o-docs/automl.html for AutoML). Figure 1 depicts the flow chart of our feature selection. e detailed hyperparameter setting and grid search are shown in Supplementary Material 1.

Statistical Analysis.
We compared data between the image data group and the TNM system-control group. Prognostic statistics in the above 10 CV experiments including 10 AUCs, test error, and specificity, along with their average and standard deviations (SDs) for predicting OS, DMFS, and LRFS, were calculated separately for the image data and TNM system-control groups. Paired t-test was used to compare the two groups because they shared the same group information declared in the 10 CV section. P < 0.05 was considered statistically significant [18].
Probabilities with a value of 1 to 100 distribution generated by the final model were considered for scoring OS, DMFS, and LRFS separately. e sum of the three scores (PFS � OS + DMFS + LRFS) was considered the standard voting point of PFS, because in clinical practice, PFS events are defined as any event pertaining to OS, DMFS, and LRFS. en this voting point for PFS was further cut off using the receiver operating characteristic (   curves, we required three cutoffs, and the most appropriate method for obtaining the cutoff value is ROC curve analysis. e Youden-index was calculated using the ROC curve. For the first time, a total of 792 cases from the entire data were used to calculate the first ROC cutoff value (0.986), and those with a score higher than this were classified as stage IV; others with a cutoff value lower than this value were used to calculate the second ROC cutoff value (0.643) to determine stage III cases. Similarly, the third ROC cutoff value (0.270) was used to define stage I/II. e Kaplan-Meier survival analysis was used to compare the OS, DMFS, LRFS, and PFS staging curves between our new machine learning-based scoring system and the 8 th TNM staging system.
Statistical analyses were performed using R 3.1.2 (http:// www.R-project.org), with the main R package including mRMRe, H 2 O, survival, Hmisc, and stats.

Prognostic Performance.
e feature selection was performed by AutoML, shown in Figure 2. All in all, important imaging findings including OS with 13 variables, DMFS with 12 variables, and LRFS with 11 variables are listed in Table 2 and ranked according to their importance. During AutoML learning, each algorithm corresponds to multiple models since grid search automatically generates hyperparameter, and the best AUC is finally selected. e final algorithm of AutoML included OS, ridge regression with parameter (lambda � 0.004, nlambda � 30); DMFS, GBM (Trees � 49, min depth � 3, max depth � 7, min leaves � 4, max leaves � 18), and LRFS, GLM ridge (lambda � 0.002, nlambda � 30). e mean AUC of image data-and TNM system-based ML model is illustrated in Table 3. e average AUC of 10 CVs for OS, DMFS, and LRFS results obtained from the image data-based model was 0.796, 0.752, and 0.721, respectively, which was higher than 0.712, 0.693, and 0.617, respectively, obtained from the TNM system-based model.
e P values were 0.008, 0.053, and 0.025 for OS, DMFS, and LRFS, respectively. e average test errors of 10 CVs for OS, DMFS, and LRFS results from image data-based model were lower than those obtained from the TNM system-based model, and the P values were 0.006, 0.011, and 0.006, respectively. e average specificity of 10 CVs for OS, DMFS, and LRFS results from the image data-based model was greater than those from the TNM system-based model, and the P values were 0.006, 0.011, and 0.006, respectively. Each patient received three voting points representing OS, DMFS, and LRFS. e sum of these voting points represented the PFS voting points and its ROC curve (Figure 3). Using the above PFS voting point system, we created a ROC curve and established three cutoff points, and the best cutoff value obtained was 0.986 (highest Youdenindex). e second cutoff value was greater than 0.643, which is in accordance with staging III. e third cutoff value was greater than 0.270, in accordance with staging I and II.
Further, the survival curves estimated by Kaplan-Meier survival analysis of PFS are shown in Figure 4, and the main statistical data are represented in Table 4. Significant differences were observed in the new machine learning-based scoring system for each ROC curve (P < 0.05). On the contrary, in the TNM/AJCC system, the P value of curves between II/I and III/II was 0.118 and 0.121, respectively, which was not significantly different. Moreover, the patient number distributed as per the new scoring system was more balanced compared to the TNM/ AJCC system.

Discussion
e image data-based model had higher AUC and specificity and lower test error than the TNM system-based model for prediction of OS, DMFS, and LRFS. We thus conclude that image data from detailed MR image reports outperform the TNM system in prognosis prediction for NPC. More interestingly, we used the OS, DMFS, and LRFS prediction probabilities to assign a PFS probability. Using this measure, we established a new scoring staging system, which was proven to be superior to the current TNM/AJCC system. e TNM/AJCC system is an important tool for the prediction of prognosis and for guiding treatment strategies for patients in different risk groups. It is used to evaluate the disease based on the range of local invasion, regional lymphatic spread, and distant metastasis, which are regarded the major prognostic factors for NPC. Fundamentally, the TNM/AJCC system is based on several important anatomic structures that are the most significant and wellaccepted features of the disease on MR images. e system thus considers multiple types of prognosis (OS, DMFS, LRFS, and PFS). Nevertheless, it is difficult to meet all independent statistical conditions for the same factor at the same time without neglecting others. According to recent studies, the current 8 th edition of the TNM/AJCC is still not completely satisfactory in risk segregation and survival prediction [16,19,20]. Recently, an increasing number of studies have confirmed the inadequacy of the current TNM/AJCC system using "nomograms." Such studies highlight the urgent need for a better prognostic determination system for risk categorization of individual patients [21,22].
We created a scoring system for NPC based on detailed MR image reports using ML. We used the scoring system to predict the probabilities of OS, DMFS, and LRFS for each patient and then added the above three probabilities to calculate a PFS probability, which was used to create a new scoring staging system. We divided the patients into four groups using our new scoring system and found that our new machine learning-based scoring system was better than the original AJCC system. Although our study was the first to design such a scoring system, our findings confirm its feasibility. Our scoring system even outperforms the current AJCC system in certain respects. For instance, the new scoring staging system tends to assign patients in the same stage with worse prognoses to the advanced stage.
is improves the accuracy of the individualized treatment. Furthermore, the new scoring system may be used to assign prognoses for patients in the same AJCC stage. is would enable physicians to prescribe more precise individualized treatments for patients in the same stage. Our scoring system suggested that physicians should use higher-dose therapy for patients in the same stage who had worse prognoses. In contrast, the scoring system suggested that patients with better prognoses should be treated based on the old guideline.
e new machine learning-based scoring system reported here was based on detailed MR image reports from a world-class nasopharyngeal cancer research center. As mentioned earlier, we adopted a completely unique report analysis method that has never been reported previously. e image data derived from the detailed MR image reports provide more precise and abundant information than the current 8 th edition stage NPC. ese data contribute to the prognostic system, but only under the condition that the influential factors can be recognized and selected from such a large amount of data.
Current statistical methods used for staging rely mainly on Cox regression. To optimize its predictive performance, several studies have added more factors to the TNM system to build a scoring system using a nomogram based on Cox regression. However, Cox regression cannot be used to effectively select important features in a stepwise manner when applied to a large number of variables and large amounts of data.
ML, which is a booming field at the intersection of computer science, statistics, data mining, and optimization [23], appears to be the preferred option for the resolution of the challenges described above. Due to its strong foundation in statistical theory, ML has become increasingly popular and acceptable when compared to conventional clinical algorithms [24,25]. ML is mathematically based on multifold systems, such as the 10-fold system, as well as crossvalidation to simulate different sets of data. AutoML, which is the ML method we chose for feature selection, has been demonstrated to be robust and to outperform traditional approaches in a previous report [26]. e use of AutoML has become increasingly prevalent in business, science, and other disciplines. Unlike other ML methods, AutoML can be used to automatically suit the data and select the machine learning classifier with the best performance. Interestingly, AutoML can be used to find the optimal combination of prediction algorithms using a process called Stacked Ensembles. Due to the lower time demanded, higher prognostic performance, and easy application without the need for expert programming knowledge, AutoML has the potential to be used by individuals who are not ML experts. It was thus easy to establish the new scoring system without the use of complicated steps. Our approach was shown to be accurate and stable in a clinical application and has the potential for use in precision medicine. Our new scoring system had advantages when compared to the existing scoring systems based on nomogram. For example, the nomogram scoring is fixed when the variables are determined, while the new scoring system can be used to assign separate prognostic scores for OS, DMFS, and LRFS and to calculate a comprehensive prognostic score. erefore, it is convenient for doctors to assess and stage patients based on the different prognostic indicators in the new system in the clinic. For patients with the same PFS  probability, doctors can shorten the review time of patients with high LRFS probability and enhance the treatment of patients with high DMFS probability by comparing their OS, DMFS, and LRFS probability. is study has some limitations. First, the new method ignores the time to event as compared with the existing survival analysis. Second, we do not have a testing set currently due to small data volume and unbalanced data; a large external data volume can be used as an external testing set in future studies. ird, our new scoring system was established using only image data from detailed MR image reports. Considering that a number of other prognostic predictors of NPC have been reported, including the presence of Epstein-Barr virus DNA and the C-reactive protein/albumin ratio (CRP/Alb), the prognostic performance of our scoring system can be improved substantially by including the above-mentioned variables.

Conclusion
Image data from detailed MR image reports enabled better prognostic performance in predicting treatment outcomes for NPC than that achieved using the TNM/AJCC system. We created a completely new scoring system, which is a step toward the enhancement of image data-based clinical predictions and precision oncology.
Data Availability e data of the MRI and other measurements are stored in our RDD. ese data used to support the findings of this