Artificial Neural Network-Assisted Classification of Hearing Prognosis of Sudden Sensorineural Hearing Loss With Vertigo

This study aimed to determine the impact on hearing prognosis of the coherent frequency with high magnitude-squared wavelet coherence (MSWC) in video head impulse test (vHIT) among patients with sudden sensorineural hearing loss with vertigo (SSNHLV) undergoing high-dose steroid treatment. This study was a retrospective cohort study. SSNHLV patients treated at our referral center from December 2016 to December 2020 were examined. The cohort comprised 64 patients with SSNHLV undergoing high-dose steroid treatment. MSWC was measured by calculating the wavelet coherence analysis (WCA) at various frequencies from a vHIT. The hearing prognosis were analyzed using a multivariable Cox regression model and convolution neural network (CNN) of WCA. There were 64 patients with a male-to-female ratio of 1:1.67. The greater highest coherent frequency of the posterior semicircular canal (SCC) was associated with the complete recovery (CR) of hearing. After adjustment for other factors, the result remained robust (hazard ratio [HR] 2.11, 95% confidence interval [CI] 1.86-2.35). In the feature extraction with Resnet-50 and proceeding SVM in the horizontal image cropping style, the classification accuracy [STD] for (CR vs. partial + no recovery [PR + NR]), (over-sampling of CR vs. PR + NR), (extensive data extraction of CR vs. PR + NR), and (interpolation of time series of CR vs. PR + NR) were 83.6% [7.4], 92.1% [6.8], 88.9% [7.5], and 91.6% [6.4], respectively. The high coherent frequency of the posterior SCC was a significantly independent factor that was associated with good hearing prognosis in the patients who have SSNHLV. WCA may be provided with comprehensive ability in vestibulo-ocular reflex (VOR) evaluation. CNN could be utilized to classify WCA, predict treatment outcomes, and facilitate vHIT interpretation. Feature extraction in CNN with proceeding SVM and horizontal cropping style of wavelet coherence plot performed better accuracy and offered more stable model for hearing outcomes in patients with SSNHLV than pure CNN classification. Clinical and Translational Impact Statement—High coherent frequency in vHIT results in good hearing outcomes in SSNHLV and facilitates AI classification.


I. INTRODUCTION
Sudden sensorineural hearing loss (SSNHL) is defined as sensorineural hearing loss of 30 dB or greater over at least three consecutive frequencies occurring within 72 hours [1]. Estimated incidence of SSNHL ranged from 11 to 77 per 100,000 people per year [2]. Throughout recent half-century, complete recovery (CR) of hearing still remained only 15.7-26% after steroid treatment [3], [4]. In SSNHL, 35.5% patients suffering from simultaneous vertigo with injured vestibular function, which was significantly associated with poor hearing outcome [5]. To objectively evaluate the vestibular function in patients suffering from vertigo, the video head impulse test (vHIT) was a new instrument to evaluate vestibular function with vestibulo-ocular reflex (VOR) function, where the ability to maintain images on the retina and in the center of visual field during highly-accelerated head movement by driving the eye toward to the contralateral side of the head movement was tested [6]. The vHIT defined the vestibulopathy as decreased VOR gain by time series velocity data of paired head and eye motions [7]. In addition to the VOR gain, saccades during and after the head impulse were also associated with the extent of vestibular damage and compensation [8], [9]. Research about the vestibular parameter significantly determining treatment response in patients of SSNHL with vertigo (SSNHLV) was still scarce [10], but some promising studies revealed the abnormal VOR gain of the posterior semicircular canal (SCC) was significantly associated with the poor hearing prognosis in patients with SSNHL [11], [12], [13]. Recently, the recovery process of human VOR function after vestibular injury called as VOR adaptation has been proved to be frequencyselective [14], [15]. The vHIT is currently the updated tool of vestibular exam in clinical application, but the study utilizing the time-frequency analysis of vHIT is still lacking. Therefore, in light of the time-frequency analysis between neural signals and causal response of body muscles, the coherence was considered to be the appropriate approach to measure the overall synchrony of the VOR and may provide the greater power of prognosis prediction in patients with SSNHLV than the current time series analysis [16].
In recent years, artificial intelligence has exerted its potential to reform health care delivery in otolaryngology field [17], [18]. For clinical physicians in vertigo clinics, the judgement of various vestibular exams was complicated and time-consuming. The artificial neural network (ANN) has been utilized to detect time-frequency plots in the pulse audiogram, the electroencephalography, and the gait analysis [19], [20], [21]. In this study, we took advantage of the ANN-assisted image classification of coherence plots to predict the hearing prognosis in patients with SSNHLV, facilitate clinical judgement, and develop the foundation of smart healthcare application in otoneurology.

A. ETHICAL CONSIDERATIONS
This study was approved by the Institutional Review Board of Kaohsiung Veterans General Hospital, Taiwan(IRB: KSVGH21-CT4-04). The requirement for informed consent was waived because all identifying information was removed from the dataset before analysis.

B. PATIENTS AND STUDY DESIGN
We retrospectively reviewed the electronic medical records of patients with SSNHLV as their primary treatment at a tertiary medical center in Taiwan from December 2016 to December 2020. We excluded patients with hearing loss without treatment within 14 days, pre-existing hearing loss in the contra-lesioned ear over 25 dB HL, retrocochlear pathology, other otologic diseases involving hearing loss, or anterior inferior cerebellar artery infarction on magnetic resonance imaging. After the exclusions, 64 patients were enrolled in the study.
Patients with SSNHL were treated with our standard protocol. Oral prednisolone (1mg/kg/day with maximal dose 60mg/day) or intravenous dexamethasone (5mg/ml b.i.d.) was given for 14 days including 3-4 days course of tapering, and salvage intratympanic dexamethasone injection (5mg/ml) was performed twice per week during 2 weeks for those whose hearing recovery did not reach complete recovery.

C. DATA COLLECTION
Pure tone audiometry and word recognition score were performed at the first visit at our clinic and followed weekly during treatment course and 2 months after treatment. The pure tone audiometry, the highly accurate exam with simulated single-test root-mean-square error about 2.7 dB HL, was the gold standard evaluation to show the hearing threshold of the patient by dB HL [22]. The word recognition score presented the proportion of correctly repeated words by the patients.
The hearing recovery was defined by American Academy of Otolaryngology-Head and Neck Surgery in 2019, as CR (return to within 10 dB HL of the unaffected ear in pure tone audiometry and recovery of word recognition scores to within 5% to 10% of the unaffected ear), no recovery (NR, anything less than a 10-dB HL improvement in pure tone audiometry), and partial recovery (PR, recovery levels between CR and NR) [1].
Vestibular laboratory data were also collected within 2 weeks, including vHIT, cervical and ocular vestibular myogenic evoked potentials (c-and o-VEMP), caloric test, and posturography.

D. vHIT
The vHIT was performed with ICS impulse ® (Otometrics A/S, Taastrup, Denmark) according to the protocol proposed by Halmagyi et al. [7] The patient's head was passively rotated by examiner suddenly and unpredictably in the plane of each SCC pairs within about 5 • in about 100ms, and the corresponding ratio of the eye velocities to the head peak velocities were counted as VOR gain. The ICS impulse ® automatically reported the head and eye velocities for 0.7s, and the onset of head rotation was always relocated at 0.1s in the timeline to make sure the complete record of the VOR impulse (0.1-0.2s in the timeline) and the following saccades (0.2-0.7s in the timeline). At least 10 times exams were repeated for each six SCCs, and the mean VOR gain and the gain asymmetry value within paired SCCs were recorded and regarded as abnormal compared with the mean and standard deviation of the group of normal healthy subjects in our hospital. The asymmetry values were counted specifically according to the planes tested as the right horizontal to the left horizontal, the left anterior to the right posterior, or the right anterior to the left posterior in the way of normalized relative gain asymmetry [23], [24].

E. WAVELET COHERENCE PLOT AND MAGNITUDE-SQUARED WAVELET COHERENCE
The analytic Morlet wavelet transform was utilized to analyze the time series data comprising nonstationary power at diverse frequencies [25], [26]. Cross wavelet power of paired head and eye velocities during vHIT was defined as wavelet transformation of the autocorrelation function and showed area with high coherent power in time-frequency domain. The wavelet coherence analysis (WCA) of two time series data was denoted and adapted from Grinsted et al. [27] as where S acted as the smoothing operator as The wavelet scale s indicated frequency as per wavelet notation. For head and eye velocities annotated as X and Y , the W X n (s) and W Y n (s) represented their wavelet transformation, and the W XY n (s) was the cross-wavelet spectrum equal to W X n (s) W Y * n (s). S scale and S time indicated smoothing process along the wavelet scale and time axes, and the Rn 2 (s) was the magnitude-squared wavelet coherence (MSWC), representing the estimated cross-spectral high common power. The WCA was applied to every repeated exam for each SCCs, and the MSWC of every frequency was calculated. The coherent frequency in our study was defined as the highest frequency comprising MSWC 0.9 over entire exam (0-0.7s in the timeline) evaluated from the lowest frequency in WCA. Since the velocity of head rotation was fixed in vHIT, the coherent frequency of WCA was solely affected by the responsive eye velocities of VOR impulse and following saccades.

F. DATA AUGMENTATION AND PREPROCESSING
The algorithm of data preprocessing was made by MATLAB R2019b (MathWorks, Inc., Natick, MA, USA) and showed in Figure 1.
For further ANN image classification, only one power spectrum wavelet coherence plot comprising the median of the coherent frequencies among repeated exams for each SCCs was chosen. For each subject, there were three specific wavelet coherence plots representative for the horizontal, the anterior, and the posterior SCCs after the selection.
The methods of augmentation across imbalanced class of CR (n = 20) to achieve equal sample numbers as in class of PR + NR (n = 44) included simple over-sampling (OS), extensive data extraction (EDE), and interpolation of time series (ITS). In the method of OS, the time series data of paired head and eye velocities of the representative wavelet coherence plots previously chosen for each SCC were copied randomly in class of CR. In the method of EDE, among repeated vHITs in each SCC, the time series data lines with the closest coherent frequency to the highest coherent frequency of representative wavelet coherence plots previously chosen were extra extracted. In the method of ITS, the time series data lines of the representative wavelet coherence plots previously chosen for each SCC were interpolated with random weights between different subjects in class of CR to create augmented samples [28]. The final samples of augmented CR and PR + NR were n = 44 equally after OS, EDE, or ITS. WCA for all augmented time series data were performed for further ANN classification.
In order to fit the formal protocol of pixel in convolutional neural network (CNN), the time and frequency windows were processed to 0.1-0.35s, 0.1-0.2s, and 0.1-0.3s started from the exam and 4-8Hz, 0-12Hz, and 0-125Hz for horizontal, vertical, and subtotal styles for image cropping of wavelet coherence plots. The period of passive head motion and median frequency of the highest coherent frequency in each SCCs were contained in these windows to make sure adequate principal component analysis of CNN classification in all three styles of image cropping. Finally, one image per patient including all three SCCs of the ipsilesional side was tiled from the left to the right or the superior to the inferior in the sequence of the horizontal, the anterior, and the posterior SCCs according to the cropping styles.

G. CONVOLUTIONAL NEURAL NETWORK AND K-FOLD CROSS-VALIDATION
Transfer learning in AlexNet, ResNet-18, and VGG-19 and feature extraction in AlexNet, ResNet-50, and VGG-19 with proceeding support vector machine (SVM) were employed by using the MATLAB R2019b Deep Learning Toolbox TM . The CNN input was the tiled image containing time-frequency power spectrogram of the wavelet coherence between eye and head velocities in vHIT.
In this study, data were stratified before partition into fivefolds to establish excellent representative of the entire sample in each fold. The repeated five-fold cross-validation method was applied for iterations of training and testing.

H. STATISTICAL ANALYSIS
All statistical analyses were performed using SPSS ver. 22 (SPSS, Inc., Chicago, IL, USA). Continuous variables were analyzed using one-way analysis of variance (ANOVA), and categorical variables were compared using Pearson's chisquared test or Fisher's exact test. The independent variables were determined through multivariate analysis using the forward stepwise method. A two-sided p < 0.05 was regarded as statistically significant. 0.88 in the horizontal, the anterior, and the posterior SCCs. The mean highest coherent frequency was 5.55Hz, 6.52Hz, and 5.59Hz in the horizontal, the anterior, and the posterior SCCs. The significant factors associated with the CR of VOLUME 11, 2023 FIGURE 2. Repeated ten times five-fold cross-validation of hearing outcomes for wavelet coherence plots in horizontal cropping style.

Ultimately
hearing in the univariate analysis were the normal caloric test (p = 0.001), the normal result in vHIT of the posterior SCC (p = 0.006), the greater highest coherent frequency of the horizontal SCC (p = 0.037) and the posterior SCC (p < 0.001). After adjustment for the other factors using the forward stepwise method, the results remained robust in CR for the greater highest coherent frequency of the posterior SCC (HR = 2.11, 95% CI 1.86-2.35, Table 2 ). In Pearson correlation analysis in Table 3, the greater highest coherent frequency of the posterior SCC was statistically significantly correlated to the higher VOR gain (0.553), the lower overt saccade percentage (-0.314), and the lower total saccade percentage (-0.257).
To explore the utility in ANN classification of hearing outcomes ( Table 4, 5, and 6), CNN analyses with repeated five-fold cross-validation were done to determine the mean accuracy for (CR vs. PR + NR), (OS of CR vs. PR + NR), (EDE of CR vs. PR + NR), and (ITS of CR vs. PR + NR). In each dataset and cropping style, the feature extraction in CNN with proceeding SVM demonstrated greatly higher accuracy and much lower variability in standard deviation than the pure CNN analysis.  Besides, Alexnet showed higher accuracy than Resnet and VGG19 with similar standard deviation in pure CNN analysis, and Resnet + SVM revealed better accuracy than Alexnet + SVM and VGG19 + SVM with similar standard deviation in CNN feature extraction with proceeding SVM classification. In each dataset and ANN models, the horizontal cropping style suggested higher accuracy than vertical and subtotal cropping styles with similar standard deviation.
In each ANN model, the augmented datasets had much higher classification accuracy than original datasets in all cropping styles. However, the standard deviations of accuracy in datasets after augmentation showed similar results with original dataset in the horizontal cropping style but generally slightly higher values than original dataset in the vertical and subtotal cropping styles. The best performance of ANN went to the method of feature extraction with Resnet-50 and proceeding SVM in the horizontal image cropping style, the classification accuracy [   stable models to predict hearing outcomes over purely transfer learning of AlexNet, ResNet-50, and VGG-19.

A. SYNOPSIS OF KEY FINDINGS
We identified that the great highest coherent frequency was the independent factor associated with CR of hearing in patients with SSNHLV undergoing high-dose steroid treatment. The coherent frequency of vHIT was associated with the responsive eye motion, including VOR gain and the following saccades, to the head impulse with fixed rotation velocity. The comprehensive power spectrum wavelet coherence plot applied for synchrony evaluations in vHIT data were statistically more significant than the time series variables in prognostic predictions. CNN could be utilized to classify WCA, predict treatment outcomes, and facilitate vHIT interpretation. Feature extraction in CNN with proceeding SVM and horizontal cropping style of wavelet coherence plot performed better accuracy and offered more stable model for hearing outcomes in patients with SSNHLV than pure CNN classification.

B. STRENGTHS OF THE STUDY
The main strength of our study was to develop the novel method of the WCA for vHIT, which was more effective in the prediction of hearing outcome due to the precise representation of the overall synchrony of VOR. We recruited the pure cohort of only cases with SSNHLV receiving standard high-dose steroid treatment completing full exams of inner ear battery. With the application of ANN, it appears that this is the first study to apply ANN-assisted coherence analysis to evaluate the hearing outcomes in patients with SSNHLV.

C. COMPARISONS WITH OTHER STUDIES
Some previous studies based on the time series analysis of vHIT showed the abnormal vHIT result in the posterior SCC was much higher than in other SCCs and was the significantly negative prognostic factor for incomplete hearing recovery in SSNHLV [11], [12], [13], [29]. In the univariate analysis of our study, the abnormal Caloric test (p = 0.001), the abnormal vHIT of the posterior SCC (p = 0.006), and the low highest coherent frequencies in the horizontal SCC (p = 0.037) and the posterior SCC (p < 0.001) were statistically significantly associated with incomplete hearing recovery in SSNHLV. Moreover, in this first study utilizing the novel application of the WCA in vHIT, our study demonstrated that patients with the greater highest coherent frequency in the posterior SCC had 2.11-times probability of CR of hearing in multivariate analysis, and the coherent frequency of the posterior SCC derived from time-frequency analysis was the only independent factor of CR of hearing. Time series parameters including the abnormal vHIT result in the posterior SCC and the abnormal caloric test were not the significant independent VOLUME 11, 2023 factor for hearing prognosis compared with the parameters resulted from time-frequency analysis. For vHIT, the WCA exerted its great resolution and sensitivity in the interpretation of non-stationary signals.
The highest to the lowest rates of the abnormal vHIT results were 34.4% in the posterior SCC, 17.2% in the horizontal SCC, and 14.1% in the anterior SCC in our cohort. For the patients with incomplete hearing recovery, the VOR gain was much lower in the posterior SCC (0.83 ±0.30) than in the anterior SCC (0.95 ±0.20) and horizontal SCC (0.96 ±0.28), but the total and overt saccades percentages were much higher in the horizontal SCC (51.30 ±38.44 and 46.66 ±38.48) than in the posterior SCC (30.43 ±35.68 and 21.80 ±33.49) and the anterior SCC (11.36 ±22.93 and 2.48 ±12.47). In the vHIT, the vestibulopathy was usually defined by the abnormally low VOR gain and the asymmetry of gains within paired SCCs. However, the significant association between the vestibulopathy and the high percentages of saccades was also noted [30]. Thus, our results showed the higher rates of vestibular end organs damage over the posterior and the hor- izontal SCCs than over the anterior SCC, and these could be associated with the vascular events of the common cochlear artery and the vestibulocochlear artery perfusion territories [31], [32]. The relationship between the poor VOR function of the posterior SCC and the inferior hearing recovery after steroid treatment could be explained by the most distal blood supply of the posterior SCC by the posterior vestibular artery and its poor collateral perfusions [12]. Furthermore, in the patients with incomplete hearing recovery, the highest coherent frequency in the posterior SCC (4.94 ±2.28) and the horizontal SCC (5.21 ±2.03) were much lower than in the anterior SCC (6.32 ±1.57). The results of WCA were comparable to the abnormal VOR gain of the posterior SCC and the high total percentages of saccades over the horizontal SCC. Based on frequency-dynamic character of VOR, WCA may be more delicately to detect the damage of VOR function of the posterior SCC rather than time series analysis [33].
The Spearman correlation analysis with parameters in time-frequency analysis and time series analysis was performed to discover the clinical meaning of the parameters in VOLUME 11, 2023 WCA, which showed the great highest coherent frequency in the posterior SCC was statistically correlated to the high VOR gain, the low overt saccade percentage, and the low total saccade percentage. The MSWC was an implement of signal processing to measure the correlation between the head velocity resulted from passive rotation and the corresponding eye velocity in their time-frequency domain. During wavelet transform of vHIT, not merely VOR gain but also saccades may alter the time-frequency plot for eye movement. The WCA took advantage in comprehensive comparison of these sensitive responses of the time-frequency plots between eye and head movement, and the MSWC 0.9 further presented the significantly linear dependance between the paired signals of non-stationary impulses [34]. To clarify the advantages of WCA of vHIT over traditional analysis of VOR gain with an example, Figure 5 illustrated the time series plots and time-frequency plots with scalograms of head and eye velocities from a patient with NR of hearing outcome who had normal VOR gain (0.78), several saccades, and evidently lower highest coherent frequency (3.28) in the posterior SCC. In the time series plots, the head and eye impulses had similar highest rotational velocities, so the normal VOR gain was noted. However, many saccades in time series plot of eye were recorded. Since the frequency of head rotation ranged about 4-8Hz, the highest coherent frequency could evaluate the function of VOR more all-inclusive than time series parameters if the frequency of responsive eye motion was altered by not only VOR gain but also corrective saccades.
In ANN analysis, we found the mean accuracy was higher in the way of feature extraction in CNN and proceeding SVM than pure CNN classification. with Alexnet in the horizontal image cropping style. The accuracy variability of standard deviation was much more stable in CNN feature extraction with proceeding SVM classification than pure CNN analysis. Since we used image tiling to comprise the WCAs in all SCCs, the pure CNN structure was quite sensitive to differences of pixel intensity among regional areas in nature images, which may lead to misclassifications and unstable performance of model in tiled images [35]. Besides, among different styles of image cropping, the mean accuracy was higher in the horizontal image cropping style with much more stable variability of accuracy than in the vertical style and then the subtotal style, sequentially. The medians of highest coherent frequency were 5.94Hz, 6.73Hz, and 6.26Hz for the horizontal, anterior, and posterior SCCs in our cohort. The overt and covert saccades after passive head motions were probably associated with the recovery of VOR function [7]. From the point of view of the principal component of analysis, compared with other image cropping styles, the horizontal image cropping style reserved not only the sharp edge of the highest frequency in the narrower frequency window (4-8Hz) but also the information of corrective saccades in more sufficient time window (0.25s after a head motion) without pixel compression after image tiling.

D. WEAKNESSES OF THE STUDY AND FUTURE WORK
There were some limitations in the current study. First and foremost, this was a retrospective study using pretrained ANN in a single institute with limited case numbers. For balanced classes and robust feature extractions, more subjects and an ANN originally designed should be adopted in the future to obtain more accurate results of classification and more stable performance of ANN models. Second, although the WCA of vHIT data could better evaluate VOR function and predict hearing prognosis in the patients with SSNHLV, the exact pathophysiology and the hypothesis of vascular insults related to poor hearing recovery and abnormal function of the posterior SCC may need further investigations in evidence of radiology or pathology. Last but not least, for clinical application, the cut-off value of the highest coherent frequency associated with complete recovery of hearing could be evaluated in the future with greater cohort.

V. CONCLUSION
The high coherent frequency was a significant independent factor that was associated with good hearing prognosis in the patients who have SSNHLV. WCA demonstrated comprehensive ability in VOR function evaluation and was more robust than time series variables in prognostic predictions. CNN could be utilized to classify WCA, predict treatment outcomes, and facilitate vHIT interpretation. Feature extraction in CNN with proceeding SVM and horizontal cropping style of wavelet coherence plot performed better accuracy and offered more stable model for hearing outcomes in patients with SSNHLV than pure CNN classification.