Effect of simulated hearing loss on automatic speech recognition for an android robot-patient

The importance of simulating patient behavior for medical assessment training has grown in recent decades due to the increasing variety of simulation tools, including standardized/simulated patients, humanoid and android robot-patients. Yet, there is still a need for improvement of current android robot-patients to accurately simulate patient behavior, among which taking into account their hearing loss is of particular importance. This paper is the first to consider hearing loss simulation in an android robot-patient and its results provide valuable insights for future developments. For this purpose, an open-source dataset of audio data and audiograms from human listeners was used to simulate the effect of hearing loss on an automatic speech recognition (ASR) system. The performance of the system was evaluated in terms of both word error rate (WER) and word information preserved (WIP). Comparing different ASR models commonly used in robotics, it appears that the model size alone is insufficient to predict ASR performance in presence of simulated hearing loss. However, though absolute values of WER and WIP do not predict the intelligibility for human listeners, they do highly correlate with it and thus could be used, for example, to compare the performance of hearing aid algorithms.


Introduction
Worldwide the life expectancy is increasing in most regions (Buskens et al., 2019).As a consequence, despite the decrease in birth rates, the global population is both expanding and aging (Gu et al., 2021).This demographic shift towards an aging population necessitates greater attention to medical care.This care should be adapted to the needs of the elderly, including hearing loss that affects more than half of them (Dalton et al., 2003).
The prevalence of hearing loss is even larger for patients suffering from heart failure or delirium (Morandi et al., 2021;Baiduc et al., 2023).Delirium is a significant neurocognitive disorder that may arise due to a medical condition, a drug-induced psychotic disorder, or following a surgical procedures performed under anesthesia on geriatric patients (Association. and Association., 2013;Devlin et al., 2018;Ely et al., 2004;Ely et al., 2001a;Rudolph et al., 2010;2009).The Confusion Assessment Method for the Intensive Care Unit (CAM-ICU) is an established assessment for the diagnosis of delirium (Ely et al., 2001b;Guenther et al., 2010).The training of assessment methods, like the CAM-ICU, is  Overview of the signal processing chains used in the experiments.(A) Binaural anechoic signals are processed through the hearing loss simulator before applying ASR.(B) Multichannel noisy and reverberant signals are enhanced through hearing aid processing before being input to the hearing loss simulator and applying ASR.
complex and time consuming.In addition to the need for a sufficient number of patients with different types of delirium and without delirium, training is only possible on a small scale due to the stress of the patients.An alternative to real patients are standardized human patients/simulated patients (SP) (Barrows, 1968;Pourebadi and Riek, 2022).SPs, i.e., specially trained actors, Röhl et al. 10.3389/frobt.2024.1391818TABLE 1 Full names and labels of the Vosk models used throughout the paper.Larger models are typically expected to yield better performance, as shown here with their respective size and WER using the LibriSpeech corpus.

Category Number
Normal to moderate 8

Moderately severe 17
Severe 22 Profound 2 Total 50 are considered an effective learning method, but they are scarce and expensive (Tengiz et al., 2022;Cleland et al., 2009).Beyond the value of the SPs, there are significant concerns about comparing the experiences of different training groups.For example, in the evaluation of SPs there is evidence of numerous differences between cases in the behavior of SPs over a number of simulations, and it is simply not possible to compare the experiences of one group with those of another (Austin et al., 2006).As a result of these concerns, educational institutions have a strong interest in the robotic simulation of patient behavior to reduce reliance on patients who are suitable for medical assessment training (Buchanan, 2001;Gaba, 2004).Robotic systems and android robot-patients (ARPs) have been introduced for teaching purposes (Abe et al., 2018;Tanzawa et al., 2012;Tanzawa et al., 2013;Hashimoto et al., 2013;Pourebadi and Riek, 2022;Gaumard Scientific Company, 2022a;Gaumard Scientific Company, 2022b;CAE, 2022;Haley et al., 2017;Schwarz and Hein, 2023;Röhl et al., 2022;2023).With the focus on medical dental education, especially the communication and risk management, an ARP was evaluated by using a student's questionnaire, which showed that 95% of the students recognized the usefulness to train the risk management with the ARP (Tanzawa et al., 2012;Tanzawa et al., 2013).An other ARP, called SAYA, simulates a depressed patient for diagnostic training (Hashimoto et al., 2013).With a focus on nursing procedures and communication with patients, there are various simulators, which can move their head or simulate human facial expressions, vital signs, and specific diseases, and are promising tools for clinical training (Pourebadi and Riek, 2022;Gaumard Scientific Company, 2022a;Gaumard Scientific Company, 2022b;CAE, 2022;Haley et al., 2017).Furthermore, there are already efforts to use high-end robots like AMECA in training and continuing education programs for medical staff with a focus on depression (Schwarz and Hein, 2023).With ongoing work on an ARP (see Figure 1) to simulate a critically ill nonverbal patient, it has already been shown in an initial simulation with an ARP, that it has the ability to reproduce human behavior (Röhl et al., 2022).Since the detection of delirium is important, there have been efforts in simulating patients with and without delirium via an ARPs for the education of medical staff in delirium-assessment methods (Röhl et al., 2023).
As delirium assessment is a verbal (medical experts) to nonverbal (patient) communication, the ARP should be able to listen.Therefore, an automatic speech recognition (ASR) was implemented.ASR is often implemented in robotic simulators using small weight models, typically using Vosk (Glauser et al., 2023;Fadel et al., 2022;Paul et al., 2022).Since elderly patients often present with hearing loss, the effect of hearing loss on ASR performance has to be carefully considered.The evaluation of this impact is the focus of this paper, whose remainder is structured as follows.The used methodology is described in Section 2. This entails the description of the used dataset, of the hearing loss simulation and ASR implementation, and of the considered evaluation metrics.The results, obtained using simulated hearing loss based on audiograms from real listeners, are presented in Section 3 before presenting the conclusions.

Methods
The experimental framework presented in this paper had two objectives.First, it aimed at quantifying the impact that hearing loss, simulated using measurements from real listeners, might have on the performance of an ASR system in presence of clean anechoic speech.Second, it aimed at evaluating the joint impact of hearing loss and hearing aid processing on the performance of an ASR system in presence of noisy reverberant speech, i.e., in realistic conditions.The experiments were conducted using data made available as part of the second edition of the clarity prediction challenge (CPC) (Graetzer et al., 2021) and publicly available ASR models to be used with the Vosk toolkit (Shmyrev, 2023).An overview of the signal processing chains used for both objectives is depicted in Figure 2 and summarized in the remainder of this section.

Auditory scene generation
The audio signals from the CPC used in the experiments were generated as follows.Anechoic speech signal s(n) at a sampling frequency f s = 48 kHz, where n denotes the sample index, containing 7 to 10 words and for which text prompts are known, were used as the target signals to be recognized (Graetzer et al., 2022).Input signals y m (n) were generated described in Equation 1.
where h m (n) denotes the room impulse response (RIR) between the source and the m-th of M microphones, x m (n) denotes the clean reverberant signal and v m (n) denotes the additive noise signal.When considering clean binaural signals h m (n) denotes a RIR between the Example of audiograms, left and right ear, for one listener of each of the four considered hearing loss categories.Each of these four examples was randomly selected from the audiograms available in the dataset.The individuality of hearing loss can clearly be seen, though the fact that hearing loss is typically larger at higher frequencies is apparent in all examples.
speech source and the M = two eardrums of the listener.In this case v m (n) = 0 ∀ n.When considering noisy reverberant signals to be processed by hearing aids, M = 6 and h m (n) denotes a RIR between the speech source and one of the front, middle or back microphone of either left or right hearing aid.In this case v m (n) is generated from recordings of daily noises, e.g., washing machine, scaled to obtain various signal to noise ratios (SNRs) ranging from −6 to 6 dB.In all cases the reverberant signal x m (n) is generated using geometric models of rooms with various characteristics using the method described in (Schröder and Vorländer, 2011) and binaural RIRs from (Denk et al., 2018).

Hearing aid processing
Modern hearing aids are typically equipped with multiple microphones whose input is processed to obtain the signal to be played in each ear of the listener.When considering the noisy reverberant signals from the CPC, the M = 6 channel signal y m (n) (3 microphones per hearing aid) was reduced to two channels to be played to the left and right ear.All 20 algorithms considered in this paper were submitted to the clarity enhancement challenge (CEC) (Graetzer et al., 2021), 10 during its first edition (CEC1) and 10 during its second edition (CEC2).This selection of speech enhancement algorithms covers a wide range of approaches, including single-channel source separation, multichannel beamforming and various deep-learning based methods.All algorithms aimed at improving the speech intelligibility of the signals and their performance was evaluated using listening tests.They all aimed at realistic hearing aid applications and used causal signal processing with an algorithm latency of maximum 5 m.Most of these algorithms used the audiogram (see Subsection 2.5) to tailor the processing to each hearing impaired listener in the considered corpus.The same WER and WIP obtained after applying hearing loss simulation to the clean binaural signals, using the four considered models.The black dots denote the score obtained when applying no simulator, i.e., without hearing loss.
audiograms were used in this paper to simulate the effect of hearing loss.

Hearing loss simulator
The hearing loss simulator aims at simulating the detrimental effect of the hearing loss of each particular listener to the processed signal z m (n).The simulator used in this paper relies on the implementation provided as part of the CPC that is based on the well-recognised Cambridge MSBG hearing loss model, named after the authors of the various papers describing it (Moore and Glassber, 1994;Baer and Moore, 1993;1994;Nejime and Moore, 1997;1998).This simulator can be briefly described as follows.First, a filter is applied to simulate the acoustic effect of sound propagating to the eardrum before applying spectral smearing to mimic the reduced frequency selectivity of hearing impaired listeners.Then, loudness recruitment simulates the reduced response in the speech frequency range, typical of hearing impairment.A gammatone filterbank is used to extract envelopes at different frequency bands and each envelope is compressed according to the audiogram of the target listener.These compressed envelopes are finally used as gain to adjust the amplitudes of the input signal before resynthetizing the time-domain signal ŝ m (n).

Automatic speech recognition
This paper focuses on the application of ASR in an ARP.Consequently, the chosen ASR system is designed with the limitations typically present in such systems.First, speech is often recorded using a single microphone.Consequently, for each recording, both channels ŝ 1 (n) and ŝ 2 (n) are input separately to the ASR system, as depicted in Figure 2. Additionally, ASR in a ARP often has to rely on models that can be used offline, potentially using hardware of limited capabilities.For this purpose, the Vosk toolkit (Shmyrev, 2023) is chosen in this paper due to its capabilities and its ubiquitousness in robotic applications.The Vosk toolkit provides numerous models for 20 different languages.Four of the available English language models are used in this paper.They are referred to as A, B, C, and D in the remainder of this paper and their full names, sizes and performance using the clean test data from the LibriSpeech (Panayotov et al., 2015) corpus are summarized in Table 1.Larger models are typically expected to yield better performance.It should as well be noted that in order to conform to the requirements of the Vosk toolkit, all signals were downsampled to a sampling frequency of 16 kHz prior to the ASR stage.

Evaluation
The performance of the ASR system using the four considered models was assessed in terms of (WER) and (WIP) defined in Equations 2, 3, respectively.
where S, D, I, and N denote the number of substitutions, deletions, insertions and number of words to be recognized, respectively, and the WIP is defined as where C and P, denote the number of correctly recognized words and the number of words in the predicted utterance, respectively.The design of ASR systems aims at a lower WER but a higher WIP.In case of many insertions, WER can be higher than 100%.Both WER and WIP are computed using output from the whole dataset.
When reporting WER and WIP for a particular category of hearing loss, it entails applying the previously described methods to the subset of listeners whose audiogram can be fit into this category.In this paper, this was done by averaging the loss over both ears and all frequencies present in the audiogram.The resulting average loss was then categorized according to the scale proposed in (Clark, 1981).Normal to moderate degree of hearing loss (−10-55 dB) were grouped into a single category.Three other categories were considered, namely,: moderately severe (56-70 dB), severe (71-90 dB) and profound (≥ 91 dB) hearing loss.The number of listeners per category of hearing loss is depicted in Table 2 and audiogram examples, for each hearing loss category, are depicted in Figure 3. Correlations were reported using the Pearson coefficient ρ and the adequacy of linear fittings were assessed using the coefficient of determination R 2 .

Results
This section presents the performance of the ASR system mentioned above using the four considered models.
First, we observed the performance using the clean binaural signals.Next, we examined the performance using the processed, noisy, reverberant signals.Finally, we studied the relation between the WER from the ASR system and the WER calculated from the responses of human listeners.WER and WIP obtained after applying hearing loss simulation to the processed reverberant and noisy signals, using the four considered models.

Clean binaural signals
The WER for four models using clean binaural signals is shown in Figure 4.The largest model D consistently performs best, whether hearing loss simulation is applied (with a WER of 28.1%) or not (with a WER of 17.5%).The performance of all four models degrades when hearing loss simulation is applied, with the largest difference observed for A, the smallest model, for which the WER degrades from 19.3% to 40.8% when hearing loss simulation is applied.This confirms that the effect of hearing loss simulation, even on clean binaural signals, is detrimental to the performance of ASR systems.
The effect of the degree of hearing loss on performance is shown in Figure 5.For all four models, performance declined with increasing severity of hearing loss.Again, the largest discrepancy is found for model A, with a WER of 23.1% for "mild to moderate" hearing loss, increasing to 53.1% for "profound" hearing loss.
For all considered categories of hearing loss, the WER decreases as ASR models get larger, with the minor exception of model A and B in presence of "mild to moderate" hearing loss.In this case, the WER was measured at 23.1% for A and at 25.6% for B. This suggests that model size alone is not always enough to predict the performance of ASR models.It should be noted that overall performance was poor, suggesting that the evaluated corpus could pose a challenge for ASR.
Frontiers in Robotics and AI 06 frontiersin.orgWER (top) and WIP (bottom) per hearing loss category obtained after applying hearing loss simulation to the processed reverberant and noisy signals, using the four considered models.
Even the best performing model, D, only achieved a WER of 19.3% for moderate hearing loss.
The same trends are seen when analyzing the WIP.Based on the WIP shown in Figure 4, model D performed best with a WIP of 70.3% on unprocessed binaural signals and 55.6% when the hearing loss simulation was applied.For all considered categories of hearing loss, the WIP increased with the size of the ASR model.Looking at the analysis of the effect of hearing loss severity on WIP in Figure 5, it is clear that performance declined with increasing severity for all four models.Again, the most significant contrast was exhibited by model A, which displayed a WIP of 62.3% for a hearing loss categorized as "mild to moderate", and reduced to 28.2% for a hearing loss categorized as "profound".

Processed noisy and reverberant signals
The WER achieved by the four models under consideration while using processed noisy and reverberant signals is illustrated in Figure 6.
The performance of all models significantly decreased compared to the clean binaural case.Model D (the largest) yielded the best performance with a WER of 57.7% while model A yielded a WER of 71.4%.Due to the high WER, this was interpreted as an unsatisfactory performance of all models rather than a true superiority of model D. The human listeners were able to recognize words much more clearly than any of the models, with a WER of 36.1%.
The effect of hearing loss severity is evident in Figure 7, which demonstrates the degradation of performance across all four models as hearing loss severity increases.When considering the effect on the intelligibility of the listeners, it is noteworthy that the highest WER does not always occur in cases of profound hearing loss, which was unexpected.However, this is most likely an anomaly due to the fact that only two listeners with profound hearing loss are present in the considered dataset (see Table 2).The same trends appeared when considering the WIP.The results achieved by the four examined models using processed noisy and reverberant signals are shown in Figure 6.
In this case, model D yielded the best performance with a WIP of 26.8%, while model A had the lowest WIP at 17.4%.The effect of the severity of the hearing loss on the WIP, as shown in Figure 7, indicated that the WIP decreased as the severity of the hearing loss increased, except in the case of "profound" hearing loss, for which ASR performance appears better then for "moderate" hearing loss.
Examining the recognition performance of human listeners in terms of both WER and WIP as depicted in and Figures 6, 7, similar trends appear but with large difference in absolute value with the performance of ASR models.These findings imply that Frontiers in Robotics and AI 07 frontiersin.orgRelationship between WER predicted by ASR models and listener response for each of the 20 hearing aid algorithms in the dataset.For each algorithm, all audiograms available in the dataset were used.The line depicts a linear fit, and the shaded area covers 3 standard deviations above and below this line.
ASR system performance may not accurately replicate a patient's performance in the studied situations regarding absolute values of WER or WIP.Even so, it is worth considering the correlation between the performance of ASR systems and WER calculated from the responses of human listeners.

Relation between ASR performance and intelligibility
The relationship between the WER and WIP of the ASR system and those derived from human listeners' responses are presented in Figures 8, 9, respectively.The human listeners (see Table 2) had to recognize the speech from the signals processed with hearing aid algorithms as part of the challenge evaluation (Barker et al., 2022;Graetzer et al., 2022).Figures 8, 9 depict, the value of these metrics obtained when considering the signals processed with each of the 20 hearing aid processing algorithms included in the dataset.Each listener had to listen to a few hours of processed speech.It seems that both WER and WIP for all four models displayed a high correlation with those computed from the listeners' responses, with ρ ranging from 0.88 to 0.96 when considering WER, and from 0.85 to 0.94 when considering WIP.Furthermore, it was evident that the correlation could be precisely depicted through linear regression, as indicated by the high R 2 coefficient values ranging from 0.78 to 0.91 when considering WER, and from 0.73 to 0.88 when considering WIP.A hearing aid processing algorithm consistently produced results that did not match the linear relationship.This is the algorithm described in (Cornell et al., 2023), which was the most successful algorithm during the CEC2.Relationship between WIP predicted by ASR models and listener response for each of the 20 hearing aids algorithms in the dataset.For each algorithm, all audiograms available in the dataset were used.The line represents a linear fit, and the shaded area covers 3 standard deviations above and below this line.

Conclusion
The simulation of disease-specific patient behavior by ARP will become increasingly important in the following years.Details such as the patient's hearing loss are critical to achieving the correct ARP behavior for realistic training and education of medical staff.Therefore, the effects of hearing loss and hearing enhancement algorithms on ASR systems were evaluated in this paper.
Experiments were conducted using both clean binaural signals and noisy reverberant signals processed using hearing aids speech enhancement algorithms.The impact of hearing loss was simulated using audiograms measured on real human listeners.All data is available as part of the CPC and the ASR transcription compared publicly available models to be used with the Vosk toolkit.The performance of these different Vosk models was evaluated using WER and WIP.
In the initial experiment, using binaural signal with and without applying hearing loss simulation, the largest considered model outperformed all other models, with the smallest model coming in second place.Notably, all models yielded lower performance in presence of hearing loss simulation.When the hearing loss simulation was applied to processed, reverberant, and noisy signals, all four models performed worse than human listeners.The biggest model performed best.Furthermore, a strong correlation was observed between the WER and WIP of all four models and the responses of the listeners.Therefore, it can be concluded that the hearing loss simulation significantly impacts ASR.Moreover, it appears that the size of the models did not play a significant role in this experiment, as with increasing model size the performance did not increase accordingly.Nevertheless, the biggest model outperformed the smaller models.
Aiming to use data that was both realistic and publicly available, all results were obtained using the data from the Clarity Challenge dataset.However, this choice does come with some limitations.This dataset does not include reverberant conditions without the use of hearing aid algorithms or the recognition scores of the listeners to the clean data, which would be beneficial for future experiments.Additionally, though realistic, the text content of this dataset was not designed specifically for patient simulation, i.e., the text content has no relation to the patient simulation that motivates this paper, which could be a future target.Furthermore, a dataset of speech utterances would allow future work to use clean unprocessed speech that could as well be used to generate speech under various acoustic conditions.Of course, this will as well allow us to extend the evaluation considering speech better matching the target use case of the ARP.
Focusing on the future use of ARPs for medical education and verbal medical assessments, clinical background noise, weak voices, and the choice of words used in the assessment should be considered in following work.

FIGURE 5 WER
FIGURE 5WER (top) and WIP (bottom) per hearing loss category obtained after applying hearing loss simulation to the clean binaural signals, using the four considered models.

TABLE 2
Number of listeners per category of hearing loss.