Comparable Encoding, Comparable Perceptual Pattern: Acoustic and Electric Hearing

Perception with electric neuroprostheses is sometimes expected to be simulated using properly designed physical stimuli. Here, we examined a new acoustic vocoder model for electric hearing with cochlear implants (CIs) and hypothesized that comparable speech encoding can lead to comparable perceptual patterns for CI and normal hearing (NH) listeners. Speech signals were encoded using FFT-based signal processing stages including band-pass filtering, temporal envelope extraction, maxima selection, and amplitude compression and quantization. These stages were specifically implemented in the same manner by an Advanced Combination Encoder (ACE) strategy in CI processors and Gaussian-enveloped Tones (GET) or Noise (GEN) vocoders for NH. Adaptive speech reception thresholds (SRTs) in noise were measured using four Mandarin sentence corpora. Initial consonant (11 monosyllables) and final vowel (20 monosyllables) recognition were also measured. NaÏve NH listeners were tested using vocoded speech with the proposed GET/GEN vocoders as well as conventional vocoders (controls). Experienced CI listeners were tested using their daily-used processors. Results showed that: 1) there was a significant training effect on GET vocoded speech perception; 2) the GEN vocoded scores (SRTs with four corpora and consonant and vowel recognition scores) as well as the phoneme-level confusion pattern matched with the CI scores better than controls. The findings suggest that the same signal encoding implementations may lead to similar perceptual patterns simultaneously in multiple perception tasks. This study highlights the importance of faithfully replicating all signal processing stages in the modeling of perceptual patterns in sensory neuroprostheses. This approach has the potential to enhance our understanding of CI perception and accelerate the engineering of prosthetic interventions. The GET/GEN MATLAB program is freely available athttps://github.com/BetterCI/GETVocoder.


I. INTRODUCTION
N EURAL prostheses restore or improve sensory perception for many patients by directly stimulating sensory neurons using a specifically designed series of devices. Among them, cochlear implants (CIs) are the most successful neural prostheses, which restore hearing abilities to about one million hearing impaired people [1], [2]. For people with normal sensory functions, either the impaired or artificiallyrestored sensation cannot be directly perceived. The challenges faced by patients after prosthesis intervention are also difficult to comprehend. This is a limitation on demonstration and education about the neuroprostheses. In addition, the engineering and development of prostheses need simulation tools to effectively predict the performance of various implementations for target patients.
To meet these needs, simulators of sensory (e.g., auditory and visual) impairments and prostheses have been proposed [3], [4], [5], [6], [7], [8]. Prostheses for different sensory domains (e.g., vision and audition) are designed to mimic healthy neural pathways, as are their simulators. Simulators for prostheses often adopt similar principles during their development, however, comparative studies of various sensory prostheses are limited in number. In 2007, Hallum et al. and colleagues performed a review comparing auditory and visual prostheses [9]. While there have been hundreds of papers using CI acoustic models since the 1980s, image models for visual prostheses have been few [6], [7], [9]. The authors concluded with an outlook on the modeling of visual prosthesis with the background of both the well-known advantages and limitations of acoustic modeling of CIs [9].
In this paper, we propose a critical principle needed for the development of simulations of protheses: the encoding stages of the simulator and the encoding stages of the protheses should be mirror, in that, the quality of the signal characteristics should be the same or as close as possible. We argue that this critical principle could help to minimize the performance discrepancy between actual and simulated implantees. To examine it, CIs and CI simulators were studied.

II. RELATED WORKS
In the 1990s, open speech communication with CIs became possible due to advances in signal processing. One important milestone is the continuous interleaved sampling (CIS) strategy [10]. The success of CIS was attributed to an effective combination and exact implementation of a series of signal processing stages [11]. Input microphone signal is filtered into a few frequency bands covering a wide range of frequencies which are thought to be important for speech intelligibility; Second, temporal envelopes from each band are extracted and compressed to fit into the electrical dynamic range of individual users; then, the compressed envelopes are used to modulate the amplitudes of high-rate biphasic charge-balanced electric pulses for corresponding electrode stimulation. Among bands, the electric pulses are non-simultaneously fired, i.e., in an interleaved sampling manner. The CIS strategy coarsely mimics the frequency analysis of the basilar membrane and allows for simplification and optimization of an engineered implementation.
CIS abandoned the philosophy of explicitly extracting phonetic cues, such as fundamental frequency (F0) and formants. Instead, temporal envelopes from fixed bands are extracted where phonetic cues are found to be implicitly represented. For example, F0 can be found in the periodicity on the envelope [12] and formants can be estimated by comparing the relative power of all bands [13]. More recent strategies have inherited key features of CIS, and the approach is still an option for almost all modern CI products.
Modern CIs are still far from perfect, which means CI users face challenges in many sound perception tasks. Simulators of CIs have been investigated throughout the history of CI technology development, e.g., [14], [15], [16]. They were proposed to model the electric hearing of CIs recipients using acoustic stimuli presented to normal hearing (NH) people. Electric hearing refers to the perception of sound through the stimulation of the auditory nerve with electrical currents of CI, rather than through air vibrations.
The most widely used CI simulators are temporal-envelopebased vocoders [15]. In these vocoders, temporal envelopes from multiple bands are extracted and then directly used to modulate band-limited noise or sine-wave carriers. The modulated carriers are summed up to obtain the output signal for stimulation. Because of the similarities in temporal envelope extraction, these kinds of vocoded sounds were assumed to be transmitting similar information as CI devices. This is a widely accepted assumption and consistent performance trends between these two hearing modes were often reported [8], [17]. The advantage of these conventional continuous carrier vocoders is obvious: They can be used to predict the overall trend of intelligibility with CIs.
However, there are significant simulation-to-real disparities in absolute perceptual test results [8], [18]. Some degradation methods with good physiological hypotheses can be used to decrease the disparities. One solution is to adjust the degree of current spread in the vocoder [19]. The current spread between CI electrodes refers to the distribution of electrical stimulation delivered by the electrodes to the auditory nerve fibers in the cochlea. This current spread can affect the frequency resolution, potentially causing speech degradation. Another solution is to include shallow insertion or frequency shift in the simulations [20].
Instead, here we focus on encoding or signal processing implementations of CI strategies. According to the proposed critical principle, the encoding stages of a CI simulator and the encoding stages of a target CI strategy should be the same or as close as possible. However, signal processing of conventional temporal-envelope-based vocoders and CIS strategy as described above have many obvious differences. They may share the same envelope extraction methods, but the latter stages, including envelope compression and pulsatile carrier modulation of CIS, are not included in the processing stages. This could be another reason behind the simulation-toreal disparities.
We developed an advanced simulator, i.e., the GET vocoder in [21], and proposed a variant Gaussian-enveloped Noise (GEN) in current work. The idea is to transform the pulsatile CI electric stimuli directly into an acoustic sound in a pulseto-pulse manner. Technically, GET/GEN vocoded sound could maintain the same information as a CI strategy, as they use the same process to encode the original sound. To learn more about the connection or difference between GET and other related vocoders, please see [21] for an in-depth introduction.
In the preliminary studies [21], [22], the theoretical analysis based on the time-frequency uncertainty principle was done 1 and the advantage of GET on a single SRT test has also been verified.
In this study, the GET vocoder and its newly proposed variant GEN vocoder, are used to examine the proposed critical principle for prosthesis simulators. We argue that, following the proposed critical principle, the GET/GEN vocoder could derive comparable perceptual patterns in multiple tasks across NH and CI listeners in SRTs as well as consonant and vowel confusion rates. We selected CI users who used the advanced combinational encoder (ACE) strategy in their clinical processors as the target to simulate, as this kind of processor is estimated to be used by half of global users [2]. In Sec. III, several algorithms are introduced, i.e., a standard ACE strategy, the GET vocoder and the variant GEN vocoder for ACE-CI simulation, and the control simulators (i.e, conventional vocoders with continuous sine-wave carriers and current spread manipulation). In the following sections, the speech intelligibility simulation performance of GET/GEN is systematically validated. To conclude, CI simulators and their implications on general neuroprosthesis simulations are further discussed. The GET/GEN MATLAB program is freely available at https://github.com/BetterCI/GETVocoder. 1 According to the time-frequency uncertainty principle, it is not possible for acoustic GET or GEN pulses to be extremely short. There is a compromise between the temporal duration of a signal and its frequency content. This limits the effectiveness of GET/GEN in simulating electric pulses with microsecond widths in CI [21].

III. ALGORITHMS AND MODELS
The proposed framework for a comprehensive simulation of CI processing is conceptually demonstrated in Fig. 1. According to Fig. 1. A, the incoming sound captured by a microphone is pre-processed and then passes through a CI signal processing strategy. For a real CI, the information to be represented at the electrodes has been determined by the strategy, so the signal then branches and is delivered to an NH acoustic ear (by sound synthesizing and electroacoustic devices) or a CI electric ear (by CI implant device and electrodes). The GET and GEN vocoders follow the acoustic branch and their difference is only at the carrier signals for synthesizing. They simulate individual electric pulses using individual acoustic pulses with the assumption of delivering the same sound information to the same group of neurons (see Fig. 1. B). They were used in NH listeners to simulate perception of CI users using the ACE strategy and was compared with actual ACE CI users. Conventional sine-wave vocoders with current spread manipulation were also used as control.
A. ACE Strategy 2 ACE and CIS processing share the following features such as band-pass filtering, temporal envelope representation, and interleaved sampling.Their main difference is that ACE uses an n-of-m maxima selection, i.e., in each time frame among the total m (≤ 22) bands only n-maxima (default: 8) with the highest energies generate electric pulses to corresponding electrodes [22], [23].
The incoming sound x is sampled at a sampling rate f s = 16000 Hz. It is pre-emphasized by a high-pass filter: Then y is processed in an overlapped frame manner. Each frame, denoted by v[n] has 128 sampling points (corresponding to 8 ms; n ∈ {0, 1, 2, . . . , 127}) windowed by a Hanning window. The frameshift is determined by a pre-selected stimulation rate. In this case, the stimulation rate r was 900 pps (pulse-per-second), so the frameshift was s = ⌈ f s/r ⌉ = 18 points. Discrete Fourier transform (realized by a fast Fourier transform, FFT) is used to transfer v[n] to V[k] at the frequency domain and only the left half of the symmetric bins is preserved: The bins between k = 2 and 63 were combined into 22 bands, i.e., bin numbers of low-to-high bands are [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 8]. The power of each band is calculated by: The weight matrix w ∈ R 22×65 ) is determined by the frequency response of the window function. For each frame, u has 22 values, among which the 14 lowest values are rejected (i.e., set to zero). In this way, the so-called "n-ofm" is realized. Then the preserved maxima u ′ are compressed according to Then g is quantized with eight bits into the range between threshold (T) and comfortable (C) current levels. The quantized g ′ is then used to control the magnitude of the individual current-balanced bi-phasic electric pulse (phase duration: 25 µs; inter-phase gap: 8 µs, see the inset in Fig. 1). The ACE-electrodogram 3 of a sentence speech is demonstrated in the left panels of Fig. 2.

B. Gaussian-Enveloped Tone/Noise Vocoders
We have argued that faithfully replicating the signal processing stages in a CI strategy is critical for a good simulation of CI. The GET vocoder [21] meets this requirement. In this study, we proposed a new variant of GET, i.e., the GEN vocoder. The ACE strategy is used in our study. Each electric pulse calculated using the ACE strategy is directly mapped into an envelope pulse according to: where n 0 and t 0 denote the band number and time center of the electric as well as the consequent acoustic pulse, g n 0 ,t 0 is an element of g recovered from g ′ , and the standard deviation σ is used to control the duration of the acoustic pulse and equaled to 3/ f c in our experiments.
To synthesize a sound for simulation, an envelope pulse p n 0 ,t 0 with unit amplitude (i.e., let g n 0 ,t 0 = 1 in Eq. 5) is used as an impulse response of the electric stimuli. Therefore, convolution is used to transform the electric band signal to an envelope band signal P n 0 (t). In this convolution, a traditional simplified CI electrodogram was used rather than that demonstrated in Fig. 2. This means that individual electric pulses were represented by a δ function (i.e. a single vertical line) rather than the detailed bi-phasic waveform. The final acoustic band signal is where f c is the center frequency of the current band and ϕ 0 is an arbitrary initial phase and equaled to zero in our experiments. The band signals for the acoustic sound are demonstrated in the right panels of Fig. 2. The band signals are then summed to generate a GET-vocoded sound. Although there is a "tone" in the name of GET, intrinsically the carrier is not limited to sine waves. Here, we add another possibility, i.e., the noise carrier from GEN vocoders. This implementation method simply replaces the carrier signal in Eq. (6) according to where L n 0 = ⌈0.1B n 0 ⌉ (i.e., 0.1 times the bandwidth of the No. n 0 band) and then rounded up to a whole number.
In specific,  (7), the frequency f n 0 ,l was a random value uniformly distributed in the corresponding frequency band; the initial phase ϕ n 0 ,l was a random value uniformly distributed in the range of [0, 2π ].

C. Conventional Sine-Wave Vocoders With Current Spread Manipulation
To evaluate the potential superiority of GET or GEN, conventional vocoders [15], [16] should be used as controls. The current spread around individual CI electrodes may interact with one another, and the naive implementation in the conventional vocoders without current spread simulation often overestimates CI performance. Previous studies have shown that certain degrees of current spread simulation could be used to approximate average CI performance on speechin-noise intelligibility [19]. The simulation of the assumed spread of current from each electrode to remote neurons can be performed by summing the weighted energies of the envelopes of each channel to the energies of other channels. Here, we implemented conventional sine-wave vocoders with current spread simulation as follows.
The incoming sound x[n] is sampled at a sampling rate f s = 16000 Hz. A pre-emphasized signal y[n] is acquired by Eq. (1). Then a bank of 22-band 6 th -order bandpass Butterworth filter is used to process y. The cutoff frequencies are equal to the FFT-based bandpass filters in the ACE strategy.
Then it is filtered by an 8 th -order low-pass Butterworth filter with a cutoff frequency of 250 Hz. The filter output is a temporal envelope of the k th band, denoted as E k [n]. The current spread degree in dB/octave is denoted by w. The spectral smearing effect is realized by in which f m represents the center frequency of the m th band.
In reference to the literature [19], [24], in our experiments, we used w = 8, 10, and 12 dB/octave. Then the vocoded in which A is adjusted to keep the root-mean-square of the whole signal unchanged after the spectral smearing.

D. Spectrograms of Vocoded Sounds
One mandarin sentence from the MSP corpus was processed by six algorithms, i.e., ACE, GET50, GET150, GEN50, Sine12, Sine10, and Sine8. The electrodogram or spectrograms of these output signals are illustrated in Fig. 3. The six algorithms were used in two experiments (ACE, GET50, and GET150 in the first; ACE, GET50, GEN50, Sine12, Sine10, and Sine8 in the second).

A. Overview
Neural prostheses are expected to improve sensation performance in all aspects of the corresponding sensation rather than in a single task. Here, we examine the proposed critical principle for effective prosthesis simulators in CIs, i.e., faithfully following the signal processing details of the prosthesis as critical for obtaining the best simulation performance in multiple tasks.
The framework of the recently proposed GET vocoder adheres to this critical principle in its approach to simulation. It could derive similar means and variances in SRTs in noise with a sentence corpus as actual CIs [21], [22]. In this work, we designed a GET variant (i.e., GEN) and developed a systematic test battery to validate the GET/GEN model. Training effects (Exp. 1), multiple sentence corpora (Exp. 1 & 2), and vowel and consonant confusion (Exp. 2) were included in two experiments. The algorithms in Sec. III were all tested.
If one vocoder outperforms the others in most tasks in the battery, it would be recognized as optimal among the included vocoders. In total, nine CI users (see Table I) and 66 NH listeners who listened to vocoded speech were recruited. NH participants all self-reported no otological disease or hearing loss. The total number of person-visit-time conducted in the study was 242. In terms of time, the total number of hours spent on these experiments was about 125 hours. All participants received financial compensation or credits for their participation. Written informed consent was obtained from all participants before the experiment, and all procedures were approved by the ethical review board at Shenzhen University.

B. Experiment 1: SRT in Noise-Training Effects and Multiple Corpora
Difficulty in speech perception in noise is one of the most common complaints of the CI listening experience. The signalto-noise ratio (SNR) at which one listener could recognize a certain percentage (e.g., 50%) of the words in a sentence is defined as an SRT. To quantitatively estimate speech-innoise perception ability, SRTs with target sentences in stable or babble noises are usually measured through an adaptive psychophysical procedure [25].
In two previous studies by the authors [21], [22], we have measured SRTs with a GET simulation using a single corpus. Specifically, in [22] we evaluated the effects of the electrical dynamic range (EDR) of CIs with both actual CI users and GET vocoders. EDR is defined as the difference between the comfortable (C) level and detection threshold (T) level at individual electrodes for individual implantees. Three EDRs were compared, i.e., 30, 100, and 150 CL (current level; defined by a CI manufacturer). We found that narrower EDR in the perceivable range would lead to worse speech performance (i.e., higher SRT in noise). GET simulation with EDR = 150 CL derived comparable SRTs to actual CI users with much narrower EDRs (typically 20-80 CL). We argue that this discrepancy could be attributed to the listening experience difference. For newly implanted CI users, training and rehabilitation may take months or longer to help their brains learn to use this new type of stimuli used to represent sound [26]. CI listeners had months or years of CI listening experience, while NH participants are naive listeners to the vocoded sound with limited training experience during the experiment itself.
In this experiment, GET vocoders with EDR = 50 and 150 CL were used in two groups of NH participants. On a daily basis, SRTs were measured using four different corpora. The NH participants visited the laboratory for five consecutive days and the CI participants visited only once. It was hypothesized that using similar strategy parameters encoded speech could derive similar intelligibility with any corpus for both electric CI and acoustically simulated CI listeners, given enough training.
1) Methods: Nine CI users (see Table I) and 16 NH listeners (college students; age range: 20 to 26; 9 males) were recruited in Experiment 1.
CI users were all tested with their daily used processor and ACE strategy (Sec. III-A). NH listeners were tested with GET (Sec. III-B) vocoded sound delivered through a sound card and headphones. ACE and GET algorithms were implemented as described in Sec. III. The clinical EDRs of the CI users were in the range of 30-100 CLs. For NH with GET vocoded sound, EDR = 50 CL and 150 CL were tested in two groups (each with eight listeners). The four Mandarin sentence corpora were Mandarin speech perception (MSP) [27], Mandarin hearing in noise test (MHINT) [28], an in-house corpus developed by our group in South China University of Technology (SCUT), and Mandarin Chinese matrix (CMNmatrix) [29]. More details about the SCUT corpus are provided in the supplemental materials. There are seven, ten, and ten monosyllabic words in the sentence of MSP, MHINT, and SCUT respectively. There are five disyllabic words with a fixed structure "name-verbnumber-adjective-object" in CMNmatrix. All corpora were provided by corresponding developers. MHINT materials were recorded by a male speaker. The other corpora were recorded by a female speaker.
During each visit, listeners were tested using four 20-sentence lists respectively from the four corpora in the order of MSP, SCUT, MHINT, and CMNmatrix. For each participant, no list was used more than once. An adaptive procedure with one list generated an SRT50, i.e., the SNR at which the subject has a 50% (by a 1-down-1-up procedure) possibility to recognize a certain percentage of the words in a sentence. The percentage criteria for MSP, SCUT, MHINT, and CMNmatrix were designated as 70%, 75%, 75%, and 50% respectively. The noise was a 20-talker (10 male-10 female) babble noise, which was generated using the same method as described in [22]. The adaptive procedure of CMNmatrix also followed the method in [22], while the others followed the method in [30]. For training purposes, at the end of each trial, the sentence text was provided and the corresponding audio was also replayed to the listener.
2) Results: SRT results are shown in Fig. 4. To understand the effect of training and EDR on simulation SRTs in NH listeners, a two-way mixed-design analysis of variance (ANOVA) was administered separately for each corpus, with EDR as the between-subjects factor and the day number as the within-subjects factor. Data normality was confirmed by the Shapiro-Wilk test ( p > 0.05). Significant main effects of EDR and training were found in all four corpora. There was no significant interaction between EDR and day number except for the CMNMatrix corpus. Detailed statistical results are provided in Table II. The interaction between EDR and day number for the CMNMatrix corpus was mainly caused by the comparable performance with EDR = 150 CL in the first two days. When the first day was excluded, the interaction became insignificant (F(3, 42) = 0.648, p = 0.589).
Overall, for the simulation results in NH listeners, the wider EDR (150 CL) had a better performance than the narrower EDR (50 CL); both EDRs demonstrated a training effect, indicating a significant improvement as the number of training days increased.
It is of interest to see how many days of training are needed before the NH listeners with vocoder simulations reach a comparable performance to the actual CI users. Independent T-tests were used to examine the significance of the mean SRT difference between each simulated condition and the corresponding CI condition (see Fig. 4 and detailed statistical results in supplementary materials). On the third to fifth day of training, for all four corpora, GET with EDR = 50 CL derived comparable mean SRTs as the CI group (range of mean EDRs:  Regarding the absolute mean SRTs, the mean SRTs of CI users and the NH group with EDR = 50 on the fifth day were higher than 10 dB for MSP, SCUT, and MHINT, but they were close to zero for CMNMatrix. The reasons include 1) learning effects, i.e., participants had more practice because CMNMatrix was tested after the other three during each visit test; 2) CMNmatrix was tested in a much different psychophysical procedure from the others, i.e., lower percentage criteria for trial correctness and close-set testing rather than open-set.

C. Experiment 2: Consonant/Vowel Identification and Confusion
The first experiment demonstrated that after about two days of training, the GET vocoder with the same strategy and parameters as actual CI users could derive similar mean SRTs with multiple corpora as experienced CI users. However, this result is still not significant enough to support GET as an optimal CI simulator. Oxenham and colleagues have shown that conventional sine-wave vocoders can also derive SRTs close to CI results if the current spread was manipulated to degrade the sound signal to a proper degree [19], [24]. In order to further investigate the potential benefits of the GET vocoder and its variants, as well as the proposed framework for prosthetics emulation, it is necessary to add these sine-wave vocoders with current spread manipulations as controls.
We know that clear speech signals are highly redundant and robust to many kinds of distortions like clipping, bandlimiting, compression, and vocoder processing. However, in noise, the intelligibility may be degraded (i.e., with increased SRTs) by all kinds of distortions. Toward our study aim, two more monosyllabic tests were included, i.e., vowel identification and consonant identification. We assumed that the SRT in noise and the clear phoneme discrimination may rely on different acoustic cues, which may be distorted in different patterns for different vocoders, e.g., the GET vocoder and sine-wave vocoder. The identification scores and the confusion patterns were compared between CI users and five simulation groups of NH listeners. If a vocoder could provide a better simulation in all tests (i.e., SRTs in noise, vowel, and consonant identification), it would be recognized as a better simulator.
1) Methods: Nine CI users (see Table I) and 50 NH listeners (college students; age range: 19 to 24; 22 males) were recruited in Experiment 2. The NH listeners were assigned into five groups, each of which included ten people, and were tested using one of the five vocoders. GET50 and GEN50 were implemented according to Sec. III-B with EDR = 50 CL. Sine12, Sine10, and Sine8 were implemented using the conventional 22-channel sine-wave vocoder as described in Sec. III-C respectively using three different degrees of current spread simulation, i.e., 12, 10, and 8 dB/octave. SRTs in noise were measured in all six groups (one CI + five NH) with MSP, SCUT, and MHINT corpora using the  same procedure as Experiment 1 in Sec. IV-B. Consonant and vowel tests were measured using a close-set alternative forced choice manner. There were 11 final consonants embedded in a /Ca/ syllable in Tone 1, which refers to the first of the four tones used in the Mandarin language [31]. There were 20 initial vowels embedded in a /dV/ syllable in Tone 1 [32], [33]. We used recorded speech from three female and three male speakers, resulting in 66 (= 11 × 6) and 120 (= 20 × 6) tokens in total for consonants and vowels respectively. For the CI group, SRT results taken from Exp. 1 were used to compare with the GEN and GET processed speech in Exp. 2.
The first experiment told us that NH subjects need at least two days to train their brains to become familiar with the vocoded speech. In this experiment, NH participants did the test remotely on three consecutive days. CI participants were tested in the laboratory only once. Remote tests were carried out through a customized website that could play audio stored on a server and also collect the subjects' feedback. On each day, the task order is from SRTs (from MSP through SCUT to MHINT) through the consonant to vowel test. In SRT tasks, answer correctness was provided in the same way as in Experiment 1. In consonant and vowel tasks, no feedback about the answer's correctness was provided.
2) Results: After an initial examination of the data collected in Exp. 1 and 2, we noticed that the remotely collected data from Exp. 2 had sometimes odd results and exhibited larger variances than the data from Exp. 1 collected in the lab. Two of the 50 participants, i.e., S14 (for GET50) and S54 (for Sine10), were excluded from the analysis of the results. S14 was excluded due to missing data, which could have affected the validity of the results. S54 was excluded because the subject consistently selected a fixed answer in the vowel and consonant tasks (see the supplementary materials), indicating that the subject may not have fully understood the tasks or may not have been fully engaged in the study, making their data unreliable for analysis. This exclusion was done to ensure the integrity and accuracy of the results. For each of the five tests with each participant, the better result (i.e., lower SRT or higher percentage scores) between the second and third days was preserved for further analysis.
The mean results are illustrated in Fig. 5. Consistent with the first experiment, all actual and simulated CI SRTs were higher than theoretically much lower SRTs with non-vocoded original signals in NH listeners (not tested in the experiment). The mean scores for both consonants and vowels were below 60%, putting them far below the typical performance of NH individuals when listening to non-vocoded speech.
The results of one-way ANOVA revealed significant differences among the outcomes produced by the six algorithms. Post-hoc pairwise comparison with no correction showed that in all of the five tasks, mean results with GET50 and GEN50 showed no significant difference from those of the actual CI group. Sine8 derived the poorest performance among the six groups. The mean results from Sine12 and Sine10 showed no significant difference. 4 The fundamental cause remained elusive. They outperformed the CI group in the SRT tasks (by more than 2.0 dB; statistically significant for Sine12 with MHINT, but not for the others) but worse than CI in monosyllabic phoneme recognition tasks (by more than 12% significantly). This indicates that conventional sine-wave vocoders degrade speech information in a different manner from ACE and GET/GEN. Detailed descriptive statistical analysis results are in Table III.
To further observe the detailed confusion patterns, the mean vowel and consonant confusion matrices are shown in Fig. 6. The distances between the five simulated matrices (GET50, GEN50, Sine12, Sine10, and Sine8) and the CI matrices were respectively 6.6, 5.7, 7.9, 7.5, and 9.6 for the vowel results and 11. 2, 9.1, 9.5, 9.6, 10.9 for the consonant results. The distance was calculated by the root-mean-square (RMS) of the differences between corresponding elements in the matrices. The GEN-processing resulted in the least RMS difference and the closest in similarity to actual CI processing. The results suggest that GEN50 is the optimal simulator of the tested ones. Based on the above distance results, GET50 performed better (i.e., lower distance) compared to the SineX vocoders in vowel recognition but worse (i.e., higher distance) in consonant recognition.

A. CI Simulator
In this work, to experimentally validate the GET/GEN vocoders and the proposed prosthesis simulation principle, two experiments were carried out. The most widely used CI signal processing strategy, ACE, and corresponding CI users were included. After about two days of training, NH participants listening to the sounds processed by the GET or GEN vocoders showed consistent mean results in all tests as CI users. However, the conventional sine-wave vocoders with 12 and 10 dB/oct current spread simulation outperformed the CI group in SRTs, but did worse than the CI group in consonant and vowel recognition. This interaction indicates that conventional sine-wave vocoders do not transmit speech information in the same way as CIs. For example, the dynamic range was not compressed in sine-wave vocoders. The wide dynamic range is critical to speech-in-noise intelligibility [22] and would partly offset the degradation introduced by the current spread. Identification of clear phonemes may be more sensitive to current spread or not substantially supported by the wide dynamic range. Regarding the consonant and vowel confusion patterns, the newly proposed GET variant, i.e., the GEN vocoder, also performed better. The timbre of the "tone" carriers might highly influence consonant discrimination with GET50, which was compensated by the noise carrier in GEN50.
This study is limited in several ways. The sample size of the study is small (N ≤ 10). Experiment 1 revealed that after three to five days of half-hour training sessions per day, NH listeners were able to comprehend the GET/GENprocessed speech at a level comparable to that of CI listeners. While performance appeared to level off by the fifth day, it cannot be ruled out that further training might lead to lower speech recognition thresholds (SRTs). Further research is needed to simulate the rehabilitation process for CI users following activation, including their performance at 1, 3, and 12 months after activation. Additionally, the current spread [34] or frequency-place shift [20] were not quantified or manipulated in GET/GEN vocoders. The GET/GEN vocoders are highly effective in simulating the group performance of those CI users. In future studies, more individual parameters will be considered to determine their impact on predicting individual performance.
Instead of human listeners, computation algorithms have also been suggested to be used for CI performance prediction. Brochier and colleagues combined biophysical models of the electric field in the cochlea, neural model of signal processing in the auditory nerve, and automatic speech recognition to predict the perception and misperception of phonemes and SRT with CIs [35]. Both the method and our GET/GEN vocoders have a similar philosophy in terms of utilizing as much information from the CI processing to encode sounds. Both human simulation and machine simulation have their own advantages. Human simulation could assume that the cognitive and language abilities in the brain are equivalent at least for some post-lingual CI users. The time course of learning and rehabilitation could be simulated in human subjects. Human simulation could also be used as a tool not only for CI performance prediction but for demonstration and education. Machine simulation is much less time-consuming, a benefit to the industry. In this work, our focus is on human simulation.

B. Neuroprothesis Simulators
The lessons from CIs to other neuroprostheses have been discussed in previous literature [7], [9], [36]. These limited works were mainly comparing bionic hearing and vision. Among them, the most systematic comparison with balanced detailed contents on both sides was provided in [9] published in 2007. This status might be due to the limited overlaps between researchers, knowledge, and tools of these two fields. In recent years, many simulation works added new features like gaze contingency [37], temporal features [38], infrared image [39], end-to-end optimization [40], semantic face image translation [41], and electrode-retina distance [6] to supplement the previously un-simulated features or to add new features to current prosthetic vision techniques. This kind of work also happened in other prosthetic fields, e.g., prosthetic arm and hand [42]. In these experiments, there were usually only normal subjects without real implanted ones. This is mainly limited by the small population of implanted patients.
Previously, we know most CI simulators can predict actual performance trends with various simulation parameters. For example, in [43] the effects of channel number (from 1 to 9) on sentence recognition in quite was examined. However, the absolute quantitative scores with CIs could not be easily predicted by these simulators [8], [9]. Some studies have introduced certain degrees of current spread to derive similar intelligibility in noise as CI user [19]. However, in this work, Experiment 2 showed that this method cannot derive satisfactory simulation results simultaneously in SRT and phoneme recognition tasks. We argue that only simulating key features (e.g., the temporal envelope) is not enough. The current work verified that faithfully replicating the signal processing details of the real implant could derive comparable perceptual results and patterns in multiple psychophysical tasks.

VI. CONCLUSION
The GEN vocoder, a variant of our recently proposed GET vocoder, after enough training, could be used to simulate the mostly widely used ACE strategy. The perceptual patterns in SRTs with multiple corpora and vowel/consonant phoneme recognition were quantitatively consistent between the simulated and actual CI users. The successful simulation proved that faithfully replicating the signal processing (or degradation) details of the real prosthesis is a necessary step in the simulations of prostheses. This is a lesson from this audition study and may be helpful for other sensory neuroprostheses, in considering their great similarities in electrode-to-brain interface and brain sensation and cognition.

ACKNOWLEDGMENT
The authors would like to thank all volunteers who have participated in their experiments. They thank XU Rui and CAO Teng for their assistance in the data collection in Experiment one. They thank Li Xu for providing the speech materials for vowel and consonant recognition tests. Language polishing assisted by OpenAI's ChatGPT in several parts of the article during revision.