A Fully Automated Visual Grading System for White Matter Hyperintensities of T2-Fluid Attenuated Inversion Recovery Magnetic Resonance Imaging

Background : The Fazekas scale is one of the most commonly used visual grading systems for white matter hyperintensity (WMH) for brain disorders like dementia from T2-fluid attenuated inversion recovery magnetic resonance (MR) images (T2-FLAIRs). However, the visual grading of the Fazekas scale suffers from low-intra and inter-rater reliability and high labor-intensive work. Therefore, we developed a fully automated visual grading system using quantifiable measurements. Methods : Our approach involves four stages: (1) the deep learning-based segmentation of ventricles and WMH lesions, (2) the categorization into periventricular white matter hyperintensity (PWMH) and deep white matter hyperintensity (DWMH), (3) the WMH diameter measurement, and (4) automated scoring, following the quantifiable method modified for Fazekas grading. We compared the performances of our method and that of the modified Fazekas scale graded by three neuroradiologists for 404 subjects with T2-FLAIR utilized from a clinical site in Korea. Results : The Krippendorff’s alpha across our method and raters (A) versus those only between the radiologists (R) were comparable, showing substantial (0.694 vs. 0.732; 0.658 vs. 0.671) and moderate (0.579 vs. 0.586) level of agreements for the modified Fazekas, the DWMH, and the PWMH scales, respectively. Also, the average of areas under the receiver operating characteristic curve between the radiologists (0.80 ± 0.09) and the radiologists against our approach (0.80 ± 0.03) was comparable. Conclusions : Our fully automated visual grading system for WMH demonstrated comparable performance to the radiologists, which we believe has the potential to assist the radiologist in clinical findings with unbiased and consistent scoring.


Introduction
T2-weighted fluid-attenuated inversion recovery magnetic resonance imaging (T2-FLAIRs) is used to assess the severity of white matter lesions that appeared as hyperintensities (WMHs) in vivo. WMH provides important information about brain health, aging, and possible disease burden [1][2][3][4]. WMH has been recognized as an important biomarker for small-vessel cerebrovascular diseases and Alzheimer's disease [5,6].
The Fazekas scale provides a conventional visual grading approach to quantify WMH severity into four scales and is often practiced by radiologists and in clinics worldwide [7]. The Fazekas scale classifies the severity of WMHs presented in the T2-FLAIR using the combination of the periventricular hyperintensity (PWMH) scale and the deep white matter hyperintensity (DWMH) scale [7]. Both PWMHs and DWMHs are graded from zero to three (Table 1) [7].
However, the use of the Fazekas scale in clinical prac-tice or research is often limited by its labor-intensive process, as are all forms of visual grading [8], and low interand intra-rater reliability due to its ambiguous given criteria [9]. Over time, the age-related white matter changes (AR-WMC) scale was introduced to overcome the ambiguousness of the subjectively measured Fazekas scale to provide quantifiable measurements [10]. Yet, the ARWMC scale also had limits due to not providing a detailed separation of DWMH and PWMH lesions. Hence, we had to find an advanced method that is computationally viable to implement for gratifying the original Fazekas scale. Several groups suggested a quantifiable method using the maximum diameter distance to divide DWMH and PWMH. The DWMH and PWMH scales are defined from the measured distance, which they call the modified Fazekas scale (Table 1) [11].
This study aims to provide an automated approach to the modified Fazekas scale that is efficient and easily applicable with reliable results in general clinical research and practice to assist doctors by reducing their labor-intensive

Overview of the Proposed Method
The proposed approach consists of four stages (Fig. 1). First, the ventricle and WMH are segmented from the input 2D T2-FLAIR using a deep learning algorithm [12]. Second, the segmented WMHs are categorized into DWMHs and PWMHs following the rule suggested in the previous study [13]. Third, the maximum diameter is measured for both DWMHs and PWMHs according to the modified Fazekas scale. Finally, the modified Fazekas scale is calculated using the obtained maximum diameter of DWMH and PWMH. For validation, we compared the agreements of our proposed method against those of three certified radiologists.

Institutional Review Board Statement
The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of Eunpyeon St. Mary's Hospital, College of Medicine, The Catholic University of Korea (IRB No. PC20EISI0094 on 02 July 2020).

Study Population Demographics
Two-dimensional (2D) T2-FLAIR scans from the Catholic University of Korea Eunpyeong St. Mary's Hospital were used in this study. The dataset was collected with the inclusion criteria of magnetic resonance imaging (MRI) containing WMH diagnosed with dementia. The exclusion criteria were WMHs with multiple pathologies, such as stroke or other disorders that may cause different components (e.g., cerebrospinal fluid, microbleeds) within the WMHs. The average age of the 404 participants was 68.7 ± 12.7 years.

Comparison between Human Raters and Our Proposed Method
The modified Fazekas scale is based on measuring the maximum diameter (mm) of DWMH and PWMH, which is quantitative (Table 1). Theoretically, our computationally implemented measuring method would be more accurate than the human raters. Yet, we compared our automated results to the human raters to demonstrate the similarity since the main goal of developing this method is to help out the intense labor of humans. For human raters, each T2-FLAIR images were assessed by three certified radiologists with a subspecialty in neuroradiology. All patient information was blinded to make no bias in rating, and also that mutual information shall not be shared between the raters. The images were visually graded independently by raters following the criteria of the modified Fazekas scale. The raters manually used a MRI measuring tool to measure the diameter (mm) of the longest axis on the PWMH and DWMH. Measurement was done on raw MRI without any provided annotations. Then, radiologists provided the modified Fazekas scale on the basis of the measurement [11]. For our proposed method, we proceed with the automated pipeline shown in the overview of the proposed method ( Fig. 1), then provide the modified Fazekas scale.

Automated Classification for the Modified Fazekas Scale 2.6.1 T2-FLAIR Segmentation between Ventricle and WMH
We used our previously reported in-house method for simultaneous ventricle and WMH segmentation ( Fig. 1a) [12]. The publication introduced two individual deep learning-based segmentation methods for T2-FLAIR. This research aimed to produce brain tissues and WMH segmentation using T2-FLAIR without its paired T1-weighted MRI (T1). We utilized the semi-supervised learning method and constructed the deep learning-based segmentation model to train FreeSurfer-generated brain tissue, including the ventricle from T1 to T2-FLAIR [14,15]. Then, the WMH model was trained with U-Net-based architecture using manually annotated and clinically confirmed WMH labels from radiologists utilizing PyTorch (version 1.7.1, python software foundation, Wilmington, DE, USA) [16,17]. The previous research datasets are unrelated to our automated approach. The in-house segmentations demonstrated promising results for further clinical relevance and application.
All processed segmentation labels from the models used for this study were set to right-anterior-superior (RAS) orientation and resampled to 1 × 1 mm 3 spacing for the axial plane. Then, the ventricle and WMH segmentation results were merged for further measurement.

WMH Separation into DWMH and PWMH
We categorized the segmented WMH region further into DWMH and PWMH regions (Fig. 2). The separation was based on the calculated distance between the DWMHs/PWMHs and the boundaries of the segmented ventricle regions. For the X and Y axes, we separated PWMHs and DWMHs in 2D slice-based where ventricle segmentation exists in the axial plane: PWMHs were specified from WMHs within ≤13 mm from the margin of the ventricles; DWMHs were specified from WMHs outside of >13 mm [13]. For the Z-axis, we defined PWMHs and DWMHs based on the range of the ventricles in the Z-axis: PWMHs for WMHs from the lowest slice to slice one above the ventricle and DWMHs for others [11].

Diameter of DWMH and PWMH
We measured the diameters of the separated DWMH and PWMH (Fig. 3). The vertical distance was used for DWMHs, and the horizontal distance was used for PWMHs, as suggested in the modified Fazekas scale [11]. Principal Component Analysis (PCA) based on the euclidean distance was performed on DWMHs in all 2D axial planes to measure the vertical diameter [18]. Taking the irregularly shaped DWMH as an input, the PCA-based measurement generates an approximated ellipse around the DWMH (Fig. 4d). Then, the major and minor axes are suggested for the eclipse. Since the DWMH scale is measured from the maximum diameter, we utilized the distance of the major axis [18]. PWMH is measured by measuring the horizontal diameter between the ventricle and the PWMH. Since the horizontal diameter varies from the starting point of the ventricle, we created a 2D Danielsson distance map for all 2D axial slices containing PWMHs and ventricles (Fig. 5) [19]. We extracted the ventricle contour from the distance map. We created perpendicular rays with a length of 13 mm from each pixel coordinate of the ventricle contour, representing the cut-off distance between PWMH and DWMH [13]. For each cluster of PWMH, we measured the mean distance of every ray that intersected the PWMH.

Classification of the Modified Fazekas Scale
At this stage (Fig. 1d), we finalized the automation process by classifying the modified Fazekas scale. Using the measured maximum diameters of the DWMHs and PWMHs, we assigned scales ranging from 1 to 3 (Table 1) as suggested by the modified Fazekas scale [11]. For the PWMHs, 1 represented maximum diameters <5 mm, 3 represented maximum diameters ≥10 mm, and 2 represented maximum diameters ≥5 mm and <10 mm. For the DWMHs, 1 represented maximum diameters <10 mm, 3 represented maximum diameters ≥25 mm, and 2 represented maximum diameters ≥10 mm but <25 mm. Finally, we classified the modified Fazekas scale using the WMH Visual rating system (Table 1).

Performance Evaluation
We investigated the agreements of the modified Fazekas scale from our proposed method and the experts with different years of experience. The multiple-rater agreement was assessed using Krippendorff's alpha [20]. Krippendorff's alpha was utilized to provide the level of agreement between the visual gradings performed by the radiologists and our proposed method. The inter-rater agreement was assessed using the areas under the receiver operating characteristic curves (AUROCs) [21] for the proposed method and each radiologist assessment. The AU-ROC was utilized to present the correspondence between our proposed method and the radiologists. The AUROC was used to determine the decision threshold for the classification performance of the two raters related to the truepositive rate (TPR) and false-positive rate (FPR) within the range of 0 to 1. Higher AUROCs are associated with higher performance than the gold standard [21]. All the performance evaluation was conducted either using R package software version 3.4.3 (The R Foundation for Statistical Computing, Vienna, Austria) or Python version 3.7 (Python Software Foundation) with the scikit-learn library [22][23][24].

Multiple-Rater Agreement
To investigate the level of agreement between the different ratings, we assessed the multiple-rater agreement using Krippendorff's alpha (α) [25]. The multiple-rater agreements (α) with and without our proposed method for the DWMH scale, PWMH scale, and the modified Fazekas scale are shown in Table 2. The agreement of the modified Fazekas scale among the radiologists (R) and the ratings including our proposed method (A) were both substantial, as indicated by α = 0.732 and 0.694, respectively, as suggested in Krippendorff's alpha [20]. The multi-rater agreement (α) was also substantial (R, 0.671; A, 0.658) for DWHH and moderate (R, 0.586; A, 0.579) for PWMH, as suggested in Krippendorff's alpha [20]. Note that the multi-rater agreement (α) for among the radiologists' ratings only (R) was consistently higher (the modified Fazekas scale, +0.038; DWMH scale, 0.013; PWMH scale, 0.007) than the agreement of the radiologists' ratings and our proposed method.

Inter-Rater Agreement
We determined the performance agreement using AU-ROCs. The agreements of the modified Fazekas scales determined by the radiologists and the proposed method are summarized in Table 3: G shows the evaluations by the radiologists (R1 vs. R2, R1 vs. R3, and R2 vs. R3), and M shows the evaluations by the raters and the proposed method (R1 vs. P, R2 vs. P, R3 vs. P). The interpretations of the area under the curve (AUROC) coefficients are as follows: 0.5, no discrimination; 0.6 to 0.7, poor discrim-ination; 0.7 to 0.8, acceptable discrimination; 0.8 to 0.9, excellent discrimination; 0.9 to 1.0, outstanding discrimination [26]. The average AUROC scores for the modified Fazekas scale determined by the radiologists showed excel-lent discrimination (G 0.87 ± 0.06; M 0.83 ± 0.05) for the modified Fazekas scale 1, excellent and acceptable discrimination (G 0.83 ± 0.08; M 0.77 ± 0.05) for the modified Fazekas scale 2, and acceptable discrimination (G 0.70 ± 0.10; M 0.79 ± 0.09) for the modified Fazekas scale 3. The average AUROC score for the agreement between the radiologists (G) was higher than that for our proposed method (M), the modified Fazekas scale 1 (+0.04), and the modified Fazekas scale 2 (+0.06). In contrast, M showed a higher score than G for the modified Fazekas scale 3 (+0.09).

Discussion
In this study, we demonstrated a fully automated visual grading system for WMH using the modified Fazekas scale on T2-FLAIRs. Our approach aimed to automate the visual grading of the modified Fazekas scale utilizing deep learning and rule-based algorithms with quantifiable imaging-driven measurements using T2-FLAIR exclusively. This study was the first attempt to automate the WMH visual grading using the modified Fazekas scale [11]. Theoretically, since our proposed method is a computational implementation, it is more accurate than the manually calculated results from the human raters when it comes to measuring the diameter of WMHs. Nevertheless, performance evaluations were done on comparing our results to the radiologists' assessments, mainly due to two big reasons. First, the main goal of this method is to help doctors on reducing labor time and cost on daily basis. Second, since we are the first software to implement the modified Fazekas scale, comparison with other software was impossible. Hence, we compared our proposed method to human raters with multiple-rater and inter-rater agreements, which showed a high correspondence. Further investigation of the intra correlation coefficient (ICC) between software is preferred [27].
The multiple-rater agreement investigation (rating agreements with and without our proposed method suggested that the level of agreement from our approach was comparable to those among the radiologists. We used Krippendorff's alpha (α), which indicates the reliabilities of multiple raters for multiple categories [28]. Our results indicated an agreement between the radiologists was similar to the agreement between the radiologists and our proposed method ( Table 2). We noticed '(A) with the proposed method R1, R2, R3, and P' had slightly lower agreement than '(R) without proposed method R1, R2, and R3' for all scales. The lower agreement with our proposed tool is due to the nature of Krippendorff's alpha, as the formula contains the weights on the number of raters in the denominator [20].
The inter-rater agreement between the radiologists and our proposed method demonstrated an equivalent performance on AUROC as well, which indicates the classification performance of the modified Fazekas scale between the two raters. The average AUROC showed minimal differences in the comparisons within radiologists (G) and be- The average AUROC coefficient being higher in lower modified Fazekas scale means that the radiologists performed better for small WMH burdens than our proposed method. In contrast, our proposed method performed better than all of each radiologist and also the average AU-ROC coefficient for grade 3 for the modified Fazekas scale. This indicates out method may be clinically useful for objective disease severity evaluation in large WMH burdens. Regardless, the combined AUROC of the modified Fazekas scales demonstrated that the performance value between G and P was comparable (G 0.80 ± 0.09 vs. P 0.80 ± 0.03), suggesting that our proposed method is clinically useful as an objective indicator for WMH evaluation.
Our study has a few limitations. The implemented modified Fazekas scale may not be widely used more than the original version. However, since the original Fazekas scale is not quantifiable and is based on a qualitative and subjective grading, we had to implement a scale which is applicable to automatic analysis. Additionally, our proposed system is currently being developed, and it has been mostly tested using 2D T2-FLAIRs. While this approach can be extended to any T2-FLAIR protocol, its performance may vary depending on the protocol. Future validation studies are needed to generalize our approach. Another limitation is the lack of ground-truth data, which is grand-scale collected data, on the modified Fazekas scale. We validated our approach against the three radiologists, whose results were used as the standard for comparison. As we have observed from our results, the three radiologists did not agree perfectly, and the ground-truth for the modified Fazekas scale has not been established at this point. To overcome the lack of ground-truth, further studies involving more experienced experts are needed to establish the gold standard for the modified Fazekas scale.
This study presented an automated modified Fazekas scoring approach using the objective measurements driven from T2-FLAIR and showed its performance against certified neuroradiologists. More work is needed to show our approach's applicability to the research and clinical setting in the near future. Even so, we believe the present work could also contribute to both scientific society and clinical environments by suggesting automated analysis for the modified Fazekas scoring, especially for research related to large-scale or multi-site of WMH.

Conclusions
We introduced a fully automated visual grading system for WMH of T2-FLAIRs based on deep learning and rule-based algorithms utilizing the modified Fazekas scale. As we aimed, the results of our method were comparable to those of the three certified radiologists who used the visual grading method. We believe that our proposed method may assist clinic works and radiologists' reading with its fully automated and quantifiable Fazekas scale with consistent measurement.

Availability of Data and Materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.