Machine Learning Application With Quantitative Digital Subtraction Angiography for Detection of Hemorrhagic Brain Arteriovenous Malformations

Clinical features are the primary measures used for risk assessment of cerebrovascular diseases. However, clinical features, especially angioarchitecture, in digital subtraction angiography require further interpretation by specialized radiologists. This approach for risk assessment requires multivariable analysis and is, therefore, challenging when completed manually. In this study, we employed three machine learning models, namely the random forest, naïve Bayes classifier, and support vector machine, for the detection of hemorrhagic brain arteriovenous malformations using digital subtraction angiography. Quantitative measurements from digital subtraction angiography were used as features, and the chi-squared test, minimum redundancy maximum relevance, ReliefF, and two-sample $t$ tests were used for feature selection. Bayesian optimization was conducted to optimize the hyperparameters of the three models. The random forest model outperformed the other two models. As a human control, three radiologists diagnosed an independent testing data set. The random forest model had a computation time of less than a second for the whole data set for classification. Accuracy and the area under the receiver operating characteristic curve were 92.7% and 0.98 for the training data set and 85.7% and 0.97 for the independent testing data set, respectively. Compared with the mean diagnosis time of approximately half a minute per patient and the highest accuracy of 76.2% for the three radiologists, the random forest model was faster and more accurate for our data set. These results suggest that the machine learning model based on hemodynamic features from quantitative digital subtraction angiography is a promising tool for detecting hemorrhagic brain arteriovenous malformations.


I. INTRODUCTION
In digital subtraction angiography (DSA), continuous X-ray acquisition with contrast agent injection in the target vessels and subtraction of images without contrast are employed to eliminate all background structures, except for the enhanced The associate editor coordinating the review of this manuscript and approving it for publication was Hongjun Su.
vasculature. Because it has the highest spatial and temporal resolution among all clinical imaging modalities, it is considered the gold standard for the diagnosis of cerebrovascular diseases, such as brain arteriovenous malformations (BAVMs). A BAVM is an abnormality of the vasculature occurring when an artery directly connects to a vein without a buffer capillary. This abnormality leads to a high-flow shunt and eventually results in the rupture of the vessel, VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ causing intracranial hemorrhage with a high risk of death. This disease is rare but fatal, occurring in 1.42 people (95% confidence interval 1.3-1.6) and being fatal in 0.70 people (95% confidence interval 0.6-0.8) per 100,000 personyears [1]. The precise diagnosis of hemorrhagic BAVMs is crucial for treatment planning. Many treatment strategies are available for BAVMs, including medication, embolization, microsurgery, and stereotactic radiosurgery. A randomized trial revealed that medication is superior to any other intervention therapy for an unruptured BAVM [2]. Therefore, the level of hemorrhagic risk is a critical factor in treatment decisions.
Relevant clinical features, including clinical presentation and angioarchitecture [3], are associated with rupture incidence. Although such clinical features are straightforward, their accurate assessment requires complex computed tomography (CT) or magnetic resonance imaging (MRI) protocols, long scanning time, and experienced radiologists. Besides, intranidal aneurysms and complexed venous outlets are occasionally beyond the scope of CT and MRI. Furthermore, the overall clinical features of individual BAVMs are too complex, with approximately 30 variables, for clinical radiologists to fully comprehend. Additionally, inconsistent results have been obtained regarding the relationship between clinical features and rupture risk. For example, one study revealed the size of the AVM nidus to be associated with intracerebral hemorrhage [4], whereas another indicated that it is not a risk factor [5]. The lack of consistency in previous study results might be caused by differences in the population of the patient cohort, subjective clinical confounders, or radiologists with different levels of expertise. Although angioarchitectural analysis is conventionally used in hemorrhagic risk analysis, it is occasionally criticized for being too subjective, and many years of training are required to master the necessary interpretive skills.
Quantitative DSA (QDSA) is a different approach for assessing risk levels. QDSA defines the salient features from the time-density curve (TDC) of a selected region of interest (ROI) as an objective method for quantifying the hemodynamics of the cerebrovasculature. QDSA provides a surrogate hemodynamic marker and has shown promising results for evaluating the severity of several cerebrovascular diseases, such as dural arteriovenous fistula, carotid stenosis, and BAVMs [6]- [8]. A ruptured BAVM leads to hemorrhage and results in the change of intracranial pressure as well as the hemodynamic conditions. QDSA features can reflect the functional discrepancy between unruptured and ruptured BAVMs. For example, in our previous study [8], we found that the higher the stasis index (SI; a QDSA feature determined by the wash-in slope [WI] divided by the absolute value of the wash-out slope [WO] of the TDC), the higher the level of stagnant outflow. Compared with clinical features, the quantification of temporal information is less intuitive; thus, it has not been commonly used in clinical practice. Nonetheless, QDSA is clinically easier to assess than clinical features; thus, it has more potential for clinical application.
In the clinical setting, it is challenging for radiologists to assess multiple clinical features and hemodynamic QDSA features simultaneously. Machine learning is a powerful approach for managing massive data with high-dimensional features; it has been used to predict the outcomes of several cerebrovascular diseases [9], [10]. To our knowledge, no researcher has used the temporal features of DSA to build machine learning models for the assisted diagnosis of hemorrhagic BAVMs and compared its performance with that of clinical radiologists. In this study, we aimed to detect the ruptured BAVMs by using QDSA features and machine learning algorithms. Furthermore, we compared model performance to the diagnosis by clinical radiologists.

II. STUDY COHORTS AND IMAGING PROTOCOL
This study was approved by the local institutional review board. Cases of 171 patients with BAVMs between 2011 and 2019 were collected retrospectively and randomly separated into the training and testing data sets for the analysis. A total of 150 cases were used in the data set used to train the models, and a separate data set comprising 21 cases was used for testing. No significant differences were found in age, gender ratio, and hemorrhage BAVM percentage between the training and testing data sets (see Table 1). All BAVM diagnoses were confirmed using DSA acquisition. The intracranial hemorrhages caused by the BAVMs were further validated using imaging modalities such as CT and MRI as the gold standard. Images for all cases were collected using the same scanner (Artis zee, Siemens Healthcare) following a standardized routine protocol. The anterior-posterior and lateral projections were simultaneously acquired because of the biplane design of the scanner. The power injector (Angiomat ILLUMENA, Liebel-Flarsheim) with a 4-French size catheter was preset to 1000 psi and located at the C4-6 vertebral body level for common carotid and vertebral artery injection. In order for standardization, we choose those placed at the C4 level with good image quality and without complications after the examination. At 0.2 s after image acquisition commencement, diluted contrast agent (Omnipaque-350 with the total volume of 12-14 mL) was injected for 1.5 s to form a bolus. Two frame rates were used for acquiring DSA images. One was constant at a rate of 6 frames per second (FPS) to the end. Another was dynamic at 7.5 FPS for the initial 5 s, followed by 4 FPS for 3 s, then 3 FPS for 2 s, and finally 2 FPS to the end. The radiographer determined the duration of acquisition based on adequate visualization of the venous structure. Overall acquisition times ranged from 5 to 15 s. Fig. 1 displays the key features of QDSA analysis. Clinical radiologists delineated the ROI at the nidus of the BAVM for the time-density analysis (Fig. 1A, B). The TDC denotes the average grayscale density within the ROI at each time point (Fig. 1C). It was fitted by the gamma-variate function [11] with the trust-region-reflective least square algorithm [12] to reduce random noise and recirculation caused by overlapping vessels. The form of the fitted function is given as follows:

III. QUANTITATIVE DSA
where D is the function of grayscale density with respect to time t, t 0 is the appearance time, K is the scaling constant, and α and β are the arbitrary parameters. D = {d 1 , d 2 , . . . , d N } corresponds to time T = {t 1 , t 2 , . . . , t N }, where N is the total number of temporal points.
Eight hemodynamic parameters ( Fig. 1D) were quantified from the fitted TDC: peak density (PD), bolus arrival time (BAT), time to peak (TTP), WI, WO, SI, full width at half maximum (FWHM), and area under the curve (AUC TDC ). They were defined as follows: PD is the maximum value of the density. BAT is the time point at which the density reaches D(t a ), which is the arrival density and is set at 20 arbitrary units. The argmax operator refers to the argument of the maximum in a function, and TTP is the argument, which is the time point at which the maximum value of the fitted function occurs. WI and WO are the increasing and decreasing slopes of the fitted TDC, respectively. FWHM is the duration between the time points at the densities D(t h1 ) and D(t h2 ), which are the densities with values that are half of the PD value. AUC TDC is the integral of the fitted function, which is the area under the TDC. SI is calculated by the division of WI by the absolute value of WO [8]. There were two projections with 8 parameters each; thus, the total number of QDSA features was 16 for each participant in the study.
As described in the imaging protocol section, two frame rate types were used: one was constant and the other was dynamic. The original TDC conducted using a constant rate was resampled using a dynamic one; then, curve fitting was applied to both curves. The quantitative parameters derived from the fitted curves using constant and dynamic rates were the same.

IV. MACHINE LEARNING APPROACHES
Feature selection is a prerequisite to building machine learning models, and redundant features used in training may degrade model performance for prediction. Here, the filter type method [13] was used, and features were all normalized to zero mean and unit variance. Three commonly used machine learning algorithms, namely random forest, naïve Bayes, and support vector machine (SVM), were used for training and testing. The machine learning models have different hyperparameters that are external to the model and must be determined before model training. Table 2 lists the hyperparameters that were optimized in our study, and the following section describes the algorithms and models used.

A. FEATURE SELECTION
Four types of algorithm were used to rank the features: the chi-squared test, minimum redundancy maximum relevance (MRMR) algorithm [32], ReliefF [33], and two-sample t test.
Each feature ranking algorithm is described in brief as follows: 1) Chi-squared test: This test was used to examine the individual dependency between the QDSA feature and VOLUME 8, 2020 the target variable (i.e., BAVM hemorrhage). If dependency was found between the feature and target variable, the p value would be small. The ranking was based on the value of − log (p). 2) MRMR: This minimizes the redundancy of a set of features and maximizes the relevance of the feature set corresponding to the target variable based on mutual information so that feature importance is ranked. 3) ReliefF: This algorithm is used to perform scoring based on the relationship between a random observation and its nearest neighbors. It iteratively updates the score by a positive term if a feature is different from its neighbors with different target variables and by a negative term if a feature is different from neighbors with the same target variable. 4) Two-sample t test: This is a statistical test of the difference in feature means between the target variables.
If the mean of a QDSA feature in one target variable was different from the other, the p value would be small.
− log (p) was applied for feature ranking.

B. RANDOM FOREST
In a random forest [17] the concept of a majority vote is applied. A random forest combines many decision trees, and the final classification depends on the class that has the most votes where a tree gives each vote. Each decision tree is built using a bootstrap data set, which comprises randomly sampled data with replacement from the training data set, and is evaluated using out-of-bag data. Two hyperparameters are to be optimized in the random forest, namely the number of trees and minimum leaf size. Generally, the more trees there are in a random forest, the higher is the performance, but performance decreases beyond a certain point. The minimum leaf size is the smallest number of observations in a decision tree, and the model tends to overfit when the leaf size is too large. The random forest is a classic ensemble learning method with proven applicability in many settings [19], [30], [31].

C. NAÏVE BAYES
The naïve Bayes classifier [18] is based on Bayes' theorem, and it assumes that features are independent and Gaussian distributed. The classification results for a new data set are determined by its likelihood of being posteriorly distributed, which is proportional to the multiplication of the prior and likelihood in the training data set. The prior of the data is the distribution of each class, and the likelihood is the distribution of each predictor. Because the assumption of a Gaussian distribution may not hold for features, the use of a kernel estimator in naïve Bayes has been suggested to provide higher performance than the Gaussian naïve Bayes [19].

D. SVM
The SVM [20] is an algorithm that finds a hyperplane to separate each class with the maximal margin. It is easy to find the boundary in linearly separable data distribution; however, the data is not always ideally distributed. Generally, a kernel method is applied to map the data to a higher dimension and make it separable, and the type of the kernel is a hyperparameter. The distance between the data and the hyperplane is the margin. A hard margin can make a complete separation but is sensitive to the outlier and the overlapping of classes. On the other hand, a soft margin allows for misclassifications within the margin and is more robust in real-world applications. The box constraint is a penalty term that decides the cost of the misclassification, and the higher it is, the harder the margin becomes.

E. BAYESIAN OPTIMIZATION
A fundamental method for adjusting the hyperparameters involves testing every possible combination to find the one with the highest performance, but this is time consuming. The aim of Bayesian optimization [14] is to optimize the loss function composited using different hyperparameters. It is based on the Gaussian process and assumes a multivariate normal distribution of the loss function at every point of the hyperparameters. Standard deviation, which represents the uncertainty of the loss value at a certain point, decreases once that point is known. An acquisition function is applied to decide the next point of the hyperparameters by accounting for the mean and standard deviation of the estimated loss function, and the estimated loss eventually approaches the true loss. The estimated minimum loss was obtained through an acceptable number of iterations, that is, 100 loops in our experiment. Compared with the trial of every possible combination of the hyperparameters, Bayesian optimization was an efficient approach for the optimization of the hyperparameters [14]. Additionally, we applied 5-fold cross-validation for the training data set to prevent the model from overfitting. The whole training data set was trained again using the optimal hyperparameters to achieve the final model for the classification of the training and testing data sets.

V. EXPERIMENTAL SETUP AND WORKFLOW
The experiment was performed on a personal computer with intel CPU Core-i7 9700 and 64 gigabytes of RAM.
The computer programming language MATLAB v. R2019b (MathWorks, Inc.) was used to perform time-density analysis and the model learning process. The Statistics and Machine Learning Toolbox and Optimization Toolbox in MATLAB were utilized to fit the TDC, rank the features, build the machine learning models, and optimize the model hyperparameters.   [8]. Second, the four feature selection algorithms were performed on QDSA features to rank the features for machine learning. Machine learning models were trained using the ranked features, and the hyperparameters of each model were optimized using Bayesian optimization through 5-fold cross-validation. The classification of both the training and testing data sets were generated for evaluation. Accuracy, sensitivity, specificity, area under the receiver operating characteristic curve (AUC ROC ), and computation time were obtained to evaluate model performance. Because the optimal model was assumed to have the ability to obtain the highest value and lowest deviation between the training and testing data sets, we chose the models based on the following criteria: the highest average performance of the training and testing data sets subtracted by the difference between them. To provide a comparison, three experienced radiologists diagnosed the same DSA testing images based on their visual inspection and empirical knowledge, as they do in clinical practice.

A. STATISTICS AND FEATURE RANKING
The mean and standard deviation of QDSA values corresponding to the BAVM hemorrhage and the feature ranking are presented in Table 3. Of 16 hemodynamic parameters, 9 and 13 parameters exhibited significant differences in the chi-squared test and two-sample t test, respectively. The MRMR and ReliefF are not statistical tests; thus, no significant p value is available.
The mean PD was higher in the nonhemorrhage group, and the difference was statistically significant for both anterior-posterior and lateral projections in both statistical tests. The mean BAT was earlier in the nonhemorrhage group, and this was statistically significant for both projections VOLUME 8, 2020 in the t test. An earlier mean TTP was observed for the anterior-posterior projection, but the difference was not statistically significant. The mean inflow and outflow gradients, WI and WO, respectively, were significantly steeper in the nonhemorrhage group. A higher mean SI (the ratio of WI and the absolute value of WO) was observed in the hemorrhage group, and the difference was statistically significant for both projections in both statistical tests. The mean FWHM was narrower in the nonhemorrhage group, but the difference was only significant for the lateral projection in the t test. A smaller mean AUC TDC was observed in the hemorrhage group, and the difference was statistically significant for both projections in both statistical tests.

B. MODEL PERFORMANCE
The models' output was the probability of rupture and set at 0.5 to calculate accuracy, sensitivity, and specificity. The highest value and lowest deviation between the training and testing data sets were the criteria for selecting the optimal model. Moreover, the models with the least features were preferred if their performance was the same. The hyperparameter values were different for every model with different numbers of features. Table 4 lists the hyperparameters of the chosen machine learning models, and Table 5 presents their performance before and after feature selection. All models had higher performance after feature selection than before it. Before feature selection (all 16 features involved in training), accuracy of the training data set was the highest 81.3% for the random forest and the lowest 74.7% for the SVM. The accuracy of the testing data set was the highest 85.7% for the random forest, and the lowest 76.2% for the naïve Bayes classifier and SVM. The sensitivity of the training data set was the highest 76.5% for the random forest and the lowest 64.7% for the SVM. For the testing data set, sensitivity was the highest 88.9% for the random forest and the lowest 55.6% for the naïve Bayes classifier. The specificity of the training data set was the highest 85.4% for the random forest and the lowest 82.9% for the naïve Bayes and SVM models. For the testing data set, the specificity was the highest 91.7% for the naïve Bayes classifier and the lowest 83.3% for the random forest and SVM. For the training data set, the highest AUC ROC was 0.93 for the random forest and the lowest one was 0.84 for the SVM. For the testing data set, the highest AUC ROC was 0.89 for the random forest, and the lowest one was 0.84 for the naïve Bayes classifier.
After feature selection, the highest performance of the random forest and naïve Bayes models was obtained using the top six features and the top eight features ranked by ReliefF, respectively. In the SVM, the highest performance was obtained using the top seven features ranked by the t test. The optimal model among the three machine learning algorithms was the random forest, and its accuracy, sensitivity, specificity, and AUC ROC were 92.7%, 91.2%, 93.9%, and 0.98 for the training data set, respectively. For the testing data set, they were 85.7%, 100%, 75.0%, and 0.97, respectively. For the training data set, the accuracy, sensitivity, specificity, and AUC ROC of the naïve Bayes model were 81.3%, 76.5%, 85.4%, and 0.91, respectively. For the testing data set, they were 81.0%, 66.7%, 91.7%, and 0.86, respectively. The SVM model had accuracy, sensitivity, specificity, and AUC ROC of 83.3%, 83.8%, 82.9%, and 0.91, respectively. For the testing data set, they were 81.0%, 88.9%, 75.0%, and 0.96, respectively. After feature selection, all the models had fewer features, and their performance was improved. The computation time of the machine learning models for the whole testing data set was less than 1 s.

C. DIAGNOSIS BY RADIOLOGISTS
Three radiologists (one neuroradiologist with 15 years of experience, one neuroradiology fellow, and one resident with 4 years of training) diagnosed all cases in the testing data set based on DSA images. Table 6 describes their performance in terms of accuracy, sensitivity, specificity, and AUC ROC . The average accuracy was 66.7%, with a minimum value of 52.4% and a maximum value of 76.2%. The fellow had the highest performance, with sensitivity of 66.7% and specificity of 91.7%. Their mean diagnosis times ranged from 17 to 33 s, with an overall mean of 28 s. An AUC ROC of 0.78 was obtained from the average of the three diagnoses.

A. SUPERIOR PERFORMANCE BY THE PROPOSED MODEL
In this study, we used QDSA, which provides temporal hemodynamic information, to train machine learning models to detect hemorrhagic BAVMs. Of our models, the random forest model had the highest performance, with accurate detections of 92.7% and 85.7% and AUC ROC values of 0.98 and 0.97 for the training and testing data sets, respectively. The sensitivity and specificity of the random forest model were 91.2% and 93.9% for the training data set and 100% and 75.0% for the testing data set. Fig. 3 displays the receiver operating characteristic curves for the random forest model for the training and testing data sets as well as the mean diagnostic results by radiologists for the testing data set, with AUC ROC values of 0.98, 0.97, and 0.78, respectively. The results showed that the random forest model trained with hemodynamic features from QDSA analysis was superior to the diagnoses of radiologists and could help clinical radiologists with complex risk assessment.

B. RISK ASSESSMENT WITH CLINICAL FEATURES
Clinical features are currently the primary measures used for hemorrhagic risk assessment, and such features have been well investigated. Guidelines concerning the clinical features of BAVMs are available and provide a standard for researchers and trials [3]. The Spetzler-Martin grade, based on the eloquent area, nidus size, and deep location, has been used for the risk assessment of BAVM-related brain surgery [21]. Nevertheless, this grading system is not intended for hemorrhagic risk assessment. One study revealed that input flow pressure and venous outflow restriction are meaningful predictors of hemorrhagic BAVMs [22]. A meta-analysis showed that the deep location of the BAVM, a small number of drainage veins, flow-related aneurysms, and venous ectasia are all independent factors related to hemorrhage risk [5]. Many researchers have obtained the same results in terms of the relationship between venous angioarchitecture and the risk of BAVM hemorrhage [4], [5], [23]. Nonetheless, despite the proven relationship between clinical features and hemorrhagic risk, a precise evaluation of the angioarchitecture requires a clinical radiologist with years of experience, and such an evaluation takes considerable interpretation time. Other modalities with 3-dimensional spatial information are also required under certain circumstances. In a clinical emergency, this is not an ideal approach for hemorrhagic risk assessment.

C. STUDIES REGARDING QUANTITATIVE DSA
In contrast to clinical features, QDSA can be conducted immediately by defining vascular ROIs directly on DSA images. The quantitative parameters of TDC objectively reflect the hemodynamics of the cerebrovasculature. The difference in hemodynamics is important, and a study demonstrated the difference within various brain vessels through an independent component analysis to separate the artery, capillary, and vein using TDC [24]. Similarly, normal and abnormal brain vessels have different hemodynamic properties. A relative TTP was prolonged in a patient with the stenoocclusive arterial disease before stent placement compared with a normal control [25]. BAT, TTP, WO, FWHM, and AUC TDC were reduced in patients with carotid stenosis after stenting [7], indicating increased flow through the carotid artery after treatment. Additionally, TTP has been used to estimate the cerebral circulation time for the detection of vasospasms in patients with subarachnoid hemorrhages [26]. A double peak sign of the TDC was observed in dural arteriovenous fistula and is used for the evaluation of venous drainage function [6].
Regarding BAVMs, a previous study found that the TTP of the feeding artery was lower in patients with hemorrhages than in patients with seizures, and TTP was used to assess hemorrhagic risk in BAVMs by QDSA [27]. Another study revealed that the mean transit time and the ratio of the mean transit time of the draining to the feeding vessels showed statistical differences between hemorrhagic and nonhemorrhagic BAVMs [34]. A recent study also found a prolonged mean transit time in restricted venous drainage and a prolonged cerebral circulation time in BAVMs with hemorrhage [35]. Additionally, in a previous study, we introduced the SI and found that it was associated with supratentorial BAVM rupture at the most dominant drainage vein [8].
In that research, we also demonstrated that BAT, TTP, SI, and FWHM at the cavernous internal carotid artery as well as PD, BAT, SI, and FWHM at the drainage vein all differed significantly between patients with and without hemorrhage. In the present study, instead of measuring the feeding artery and drainage vein, we directly assessed the nidus of the BAVM. We found that almost all QDSA features (except for TTP for both anterior-posterior and lateral projections as well as FWHM for the anterior-posterior projection) were significantly different between patients with and without hemorrhage. On average, PD, WI, and AUC TDC were higher in nonhemorrhagic patients, and BAT, WO, SI, and FWHM were lower in nonhemorrhagic patients. This phenomenon was related to more stagnant outflow in the hemorrhagic BAVM [8], [35]. All of the aforementioned findings thus support the efficacy of QDSA for assessing cerebrovascular diseases.

D. PREVIOUS MODELS
Most relevant studies have focused on finding predictors of BAVM hemorrhage from clinical features [4], [5], [8], [23], [36], [37]. Feghali et al. developed a scoring model termed R 2 eD [36] based on multivariable logistic regression to detect hemorrhagic BAVMs based on race (nonwhite vs. white), exclusive deep location, BAVM size (small vs. large), venous drainage (exclusive deep vs. other), and monoarterial feeding (1 vs. >1 feeding artery). They used the hold-out method, in which half of the data was split as the validation data set to observe if the model was overfitting; the AUC ROC was 0.685. Although clinical features are more intuitive, variation in hemodynamics can directly alert clinicians to the risk of a hemorrhagic BAVM. Our previous study introduced the SI [8], which reflects hemodynamic differences between hemorrhagic and nonhemorrhagic BAVMs, to build a predictive model using logistic regression, and the AUC ROC was 0.75. Huang et al. combined clinical features with TTP to build a statistical model using multivariable logistic regression, and the AUC ROC was 0.838 [37]. In that study, the combination of clinical features with QDSA improved the predictive model. Table 7 of this article summarizes the predictors and performance of the predictors from the aforementioned studies. In this study, we used QDSA from BAVM nidus through machine learning algorithms with proper feature selection and hyperparameter optimization and found AUC ROC values of 0.98 and 0.97 for the training and testing data sets, respectively. Our study verified the feasibility and application of machine learning models for detecting hemorrhagic BAVMs using QDSA.

E. REDUNDANCE OF CLINICAL FEATURES
The proposed machine learning method can efficiently detect hemorrhagic BAVMs using QDSA features. Compared with clinical radiologists, it performed faster and more accurately. It can assist in diagnosis in clinical settings. Because machine learning algorithms can effectively manage high-dimensional features, it is reasonable to hypothesize that the use of QDSA combined with clinical features may improve detection accuracy. However, we found that the inclusion of clinical features, namely seizure, headache, focal deficit, supratentorial location, dilated feeder artery, sprouting angiogenesis, nonsprouting angiogenesis, single drainage, multiple drainage, number of drainage veins, venous rerouting, venous ectasia, exclusive deep veins, and venous sacs, did not result in a clear improvement in results. The accuracy, sensitivity, specificity, and AUC ROC of the model with clinical features were 71.4%, 55.6%, 83.3%, and 0.91, respectively, for the testing data set. Given the direct availability of QDSA, it could be more useful than clinical features.

F. SUPERIORITY OF THE RANDOM FOREST MODEL
Our experimental result revealed that the random forest model outperformed the naïve Bayes and support vector 204580 VOLUME 8, 2020 models. The main limitation of the naïve Bayes model is the assumption of independence among the predictor features. Because QDSA features are quantified from the same TDC, they do not meet the independence assumption. In terms of SVM, performance is limited when features between classes have overlapping distributions. Despite the presence of statistically significant differences in QDSA features between the nonhemorrhagic and hemorrhagic groups, an overlap was observed from their mean and standard deviation values.
A random forest model involves the implementation of the universal concept that two heads are better than one, and it does not require any assumptions. By combining multiple decision trees, the random forest model can balance the bias-variance tradeoff associated with most machine learning methods, and this enhances its performance. Additionally, because resampling bootstrap data sets are used, the random forest model can robustly manage data sets with limited sample sizes [30]. Earlier studies have demonstrated its feasibility and efficacy in practical application [19], [31]. Studies have also shown the superiority of the random forest model to other single-learner-based machine learning models. A study compared the performance of different classification algorithms applied to asphalt pavement deterioration data and reported that the random forest model outperformed the naïve Bayes model [19]. Another study that assessed groundwater potential demonstrated that the random forest model was superior to kernel-mapped SVM models [31]. In the present study, the random forest model was trained using QDSA features and effectively detected hemorrhagic BAVMs on DSA, thereby improving the rupture assessment.

G. FEATURE SELECTION AND HYPERPARAMETER OPTIMIZATION
The purpose of feature selection is to eliminate redundant or irrelevant features such that the model exhibits higher performance and becomes simpler to train and easier to interpret. A study showed higher model performance than other models in the detection of Parkinson disease from voice signals, and the related researchers used the chi-squared model to rank features for the removal of noisy features [38]. Another study developed a two-stage model to diagnose heart failure by using a statistical model based on mutual information at the first stage and a neural network at the second stage, and the performance of this model was higher than the other 28 prediction models [39]. A feature ranking algorithm named ReliefF [33] was used to predict compressive strength and concrete composition, and the result revealed higher correlation and fewer errors than the model without feature selection using ReliefF [40]. In addition, a ReliefFbased artificial neural network had 100% cross-validation accuracy for fault diagnosis in ball bearings [41]. In the present study, we used the aforementioned algorithms to rank features and perform selection to find the optimal subset. Higher performance was found after feature selection. Moreover, the optimal model was the random forest trained using the top six features ranked using the ReliefF algorithm. Fig. 4 presents the violin plot of the top two and last two features ranked using ReliefF. The more distinct distribution between nonhemorrhage and hemorrhage was observed for the top two features than for the last two features. Hyperparameter optimization is another important concern in the development of an efficient machine learning model. The common scheme for finding appropriate hyperparameters involves empirical manual tuning. With this method, the result highly depends on the operator's experience, and finding the optimal solution cannot be guaranteed every time. Grid search is another method for testing every combination of hyperparameters and for finding the globally optimal hyperparameters; however, it is time consuming and impractical for the current application. Alternatively, Bayesian optimization [14] and the genetic algorithm [43] are two efficient methods for hyperparameter optimization. The genetic algorithm is an efficient approach inspired by Darwin's theory of natural evolution. It evaluates and selects parent candidates with favorable performance from a given population. Then, the offspring take the properties from the parent through crossover, and some properties are altered through mutation to prevent the model from becoming stuck in a local optimum. The genetic algorithm has been used in machine learning and achieves satisfactory performance [45], [46]. In this study, we attempted to use both Bayesian optimization and the genetic algorithm with the same computational costs (100 models) to optimize the hyperparameters of the random forest with all QDSA features. The accuracies were 81.3% and 85.7% for the training and testing data sets, respectively, for Bayesian VOLUME 8, 2020 optimization and 78.7% and 81.0%, respectively, for the genetic algorithm. Because the Bayesian optimization had performance superior to that of the genetic algorithm for our data sets, we adopted it in this study. Nevertheless, a previous study demonstrated that the genetic algorithm could be more effective than Bayesian optimization [47]. Therefore, depending on the data set, either the Bayesian optimization or genetic algorithm can be considered for hyperparameter optimization when machine learning models are being developed.

H. STUDY LIMITATIONS
Despite the favorable outcomes, this study has some limitations. First, we did not examine the uncertainties caused by different protocols and modalities. A previous study reported the complete mixing of the contrast agent with blood flow and that it did not separate once mixed [28]. This study proved that the TDC could represent the real hemodynamics of blood flow, and it did not show variability under typical imaging protocols. Second, the selection of ROIs was not the same every time, and we did not account for intraobserver and interobserver reliability. Previous studies, however, have demonstrated the consistency of QDSA by showing its robust and clinical reproducibility [6]- [8], [24]- [26]. Third, even though two projections of 2D-DSA were acquired simultaneously, it was not possible to fully compensate for the overlapping of vasculatures. A 4D-DSA technique has been developed recently, offering both three-dimensional structure and temporal cerebrovasculature information [29]. This is a solution to the problem of overlapping vessels and has the potential to achieve better performance than 2D-DSA in the application of QDSA. Future studies are warranted on the external validation and reliability of ROI selection. Finally, although the proposed model trained by QDSA features has superior performance, modern angiography machines can perform ''flat-panel CT'' to confirm intraoperative bleeding during angiography in one or two minutes. Our proposed model may assist radiologists in determining the hemorrhage risk before additional radiological examinations such as flat-panel CT.

VIII. CONCLUSION
In clinical settings, the evaluation of angioarchitecture needs to be performed by specialized radiologists. This is time consuming and tends to be subjective. Conversely, QDSA objectively quantifies the temporal information of the cerebrovasculature by delineating the nidus of the BAVM on DSA images without further assessment by clinical radiologists. In this study, the accuracy of detecting hemorrhages through machine learning models based on QDSA features exceeded that of diagnosis by trained radiologists in the detection of hemorrhagic BAVM. QDSA combined with the random forest model is thus a promising approach for the evaluation of hemorrhagic BAVMs and has the potential to be used for detecting other cerebrovascular diseases. He has been certified as a Medical Radiation Technologist by the Taiwan's Ministry of Health and Welfare. His main research interests include medical image processing, machine learning, and interventional imaging. VOLUME 8, 2020