The Factor Analysis of Speech: Limitations and Opportunities for Cochlear Implants

.

S e e h t t p://o r c a .cf. a c. u k/ p olici e s. h t ml fo r u s a g e p olici e s.Co py ri g h t a n d m o r al ri g h t s fo r p u blic a tio n s m a d e a v ail a bl e in ORCA a r e r e t ai n e d by t h e c o py ri g h t h ol d e r s .

Introduction
Current spread limits the benefito fi ncreased numbers of electrodes in cochlear implants, because ganglion cells at ag iven location on the spiral ganglion are stimulated by multiple electrodes (e.g.Abbas et al. [1]).Friesen et al. [2] showed that the percent-correct sentence recognition in noise of cochlear implant users plateaued once about sevenelectrodes were activated.Theyattributed this effect to the influence of current spread.In order to research this effect, Grange et al. [3] developed an ovela coustic simulation of ac ochlear implant, the SPIRAL vocoder.S PI-RAL modulates al arge, and fixed, number of sinusoidal carriers according to the mixed influences of an independently controlled number of electrodes.Using SPIRAL, Grange et al. confirmed alimiting effect of current spread on the benefito fa dditional electrodes.However, in the present study,w eu se SPIRAL to showt hat the increasing benefitofadditional electrodes shows amarked inflection at around sevenelectrodes even in the absence of current spread, indicating the existence of amore fundamental limiting factor.We postulated that this second limiting factor might be informational redundancyi ns peech itself.In order to investigate this possibility,w ef actor analysed speech envelopes extracted from an auditory filterbank.Factor analysis allows one to quantify the number of independent Received28February 2018, accepted 10 July 2018.modulators in the speech signal and, through the factor loadings, to characterise the spectral extents of those modulators.These loading patterns indicated that conventional logarithmic spacing of analysis channels may not provide an optimal frequencymap for cochlear implant processors.

Materials and vocoding
Speech and noise were mixed and vocoded using SPIRAL [3], av ocoder designed to simulate listening through a cochlear implant (CI) with normally-hearing listeners.In one experiment, the target speech consisted of IEEE sentences (Rothauser et al.,1969)from the M.I.T.recordings, spoken by am ale speaker ('DA').I nt he second experiment, digit triplets were used, spoken by af emale and recorded in our laboratories.Each set of target material wasm ixed with speech-shaped noise, which wass pectrally filtered to match the long-term spectrum of the target speech for that experiment.
SPIRAL employed 80 sinusoidal carriers equally distributed along an ERB scale in the 20-20000 Hz range.Input signal analysis wasperformed by rectangular bandpass 512-point finite impulse-response filters uniformly distributed along an ERB scale.The bandpass filters covered a1 20-8658 Hz frequencyr ange fully and without overlap, such that filter widths increased as the number of activated channels decreased.To extract temporal envelopes, the filtered waveforms were half-waver ectified and low-pass filtered with a50Hzcut-off.The centre fre-quencyo fab and wast he place frequencyo ft he corresponding simulated CI electrode.With no simulated current spread (spread of excitation set to −200 dB/oct), the temporal envelope extracted within aband modulated only as mall number of tone carriers, whose frequencies were closest to the place of the simulated electrode.

Procedure
SRTs were measured using ao ne-up/one-down adaptive tracking method that kept the combined levelo fs peech and noise at 65 dB A. Fort he IEEE sentences (that contain fivek ey words), the adaptive track converged on the 30% point in the psychometric function by increasing the signal-to-noise ratio (SNR)by2dB if fewer than twokey words in ag iven sentence were identified and otherwise decreasing it by 2dB.Fort he digit triplets, the adaptive track converged on the 50% point by increasing the SNR by 2dBiffewer than 2digits were correctly reported, and decreasing it by 2dBifnot.The IEEE-sentence SRTmeasurements started with al ow SNR (set 12 dB lower than SRTs measured in practice runs); the SNR wast hen increased in 4dBsteps until at least one word from the first target sentence wascorrectly identified, at which point the adaptive phase started.Ford igit triplets, the initial SNR washigh (set 12 dB above practice-run SRTs)and wasdecreased in 4dBsteps until less than 2digits were correctly identified.Sentence SRTs employed lists of 10 sentences and were computed as the mean of the last 8c omputed SNRs.Digit-triplet SRTt racks stopped after 10 reversals and SRTs were computed as the mean of all the computed SNRs of the adaptive phase.Each SRTcondition had adifferent number of activated channels.These were 4, 5, 6, 8, 10, 15 or 20 for IEEE-sentences and 3, 4, 6, 8, 10, 15 or 20 for digit-triplets.Forsentence SRTs, the sentence order wasfi xeda nd the condition order wasq uasi-randomized and rotated against the material.210 sentences were used in total.The experiments included conditions with current spread that are not reported here.Each participant performed four practice SRTr uns prior to the experimental runs.
Fore ach experiment, 21 young adults participated, recruited from the Cardiff University undergraduate population and self-reported as normally hearing (age mean 19, range 18-21).Briefing, consent and debriefing followed the rules set out by the institutional reviewboard.

Factor analysis
Factor analysis (FA) wasconducted on the concatenation of all IEEE sentences spoken by the M.I.T.s peaker DA, in the 200-8000 Hz range.The FA followed the procedure used by Ueda and Nakajima [4]: temporal modulation envelopes were extracted (bygammatone filtering operated every 1 / 4 ERB, half wave rectification and low-pass filtering at 50 Hz), then squared and converted to z-scores such that aco-modulation analysis across frequencybands would be operated on ac orrelational basis.The resulting power envelopes were fed through aprincipal component analysis (PCA Matlab program from Brian Moore, University of Michigan, Nov. 2016), which operated asingular value decomposition of the power envelope matrix and allowed the user to specify the number of retained components.The derivedcomponent loadings were then subjected to av arimax rotation to maximize factor independence and obtain factor loading curves.Each factor loading curverepresents aco-modulating region of the speech spectrum that modulates most independently of the others.In order to generate a" scree" plot of reducing normalized Eigenvalue as af unction of the number of retained factors, normalized Eigenvalues were derivedf or each number of retained factors, through matrix multiplication of the normalized Eigenvectors and the z-scored power-envelope correlation matrix, followed by point-bypoint division by the normalized Eigenvectors.The resulting scree plot represents the amount of information each additionally retained factor adds in the explanation of the temporal envelope modulations that carry speech information.

Results
Digit-triplet and IEEE-sentence SRTo utcomes are displayed in the upper twop anels of Figure 1.In both experiments, SRTs improveds ignificantly as the number of channels wasincreased [digit triplets: F (6, 120) = 76.38,p<0 .0001; IEEE sentences: F (6, 120) = 58.25,p< 0 .0001].In both experiments, the improvement slowed down markedly beyond around 7c hannels.Bilinear fitting of SRTs on al ogarithmic number-of-electrode scale wasused to establish the position of the knee point.It was found to be at 7c hannels regardless of speech material.The bottom panel of Figure 1shows ascree plot of reducing normalized Eigenvalue as afunction of the number of retained factors.There, bilinear fitting also showed aknee point occurring at around 7factors.

Discussion
SRTs employing CI simulations with the SPIRAL vocoder showed that, even with no simulated current spread, and across twov ery different sets of speech material, speech intelligibility improvesless steeply beyond aknee point at around 7channels.This wasalso noticeable in Grange et al. ([3], experiment 1),where normalised, percent-correct intelligibility of IEEE sentences waspresented as afunction of number of activated channels and severity of current spread; even with no current spread simulated, intelligibility plateaued.The knee points in both percent-correct and SRTmeasures in the absence of current spread alerted us to the possibility that speech statistics might fundamentally limit the number of effective channels for speech intelligibility with CIs.
The bottom panel of Figure 1p resents the FA scree plot for speech in the absence of noise.There is as trong similarity between the scree plot and the IEEE-sentence SRTs trends, since both exhibit aknee point around 7factors/activated electrodes.This similarity suggests that the number of effective (ori ndependent)c hannels in speech intelligibility with CIs is fundamentally limited by speech statistics.As demonstrated in Grange et al.,t he effective number of frequencyc hannels can be further reduced by the spectral smearing caused by current spread.It should be noted that comparison is made here between FA of speech in quiet against SRTs in noise.The FA in quiet demonstrates the distribution of information across frequencyi nt he speech, butt akes no account of the robustness of this information to noise contamination.Afurther development of this work would therefore be to conduct FA using speech noise mixtures at appropriate SNRs.
Figure 2s hows howt he factor loadings deviate from logarithmic spacing.The number of factors increases as one descends the panels.Forthe most part, the factors are discrete spectral bands, butt he three-factor panel shows af actor that is split into twos pectral peaks.Those two peaks become separate factors once afourth factor is considered.Abroader factor,centred on 900 Hz is present in most of the panels.From fivetotwelvefactors, the factors vary significantly in their widths.In contrast, most commercial CIs analyse sound through filters that are, or approximate equal widths on alog-frequencyscale.All have filters whose bandwidths increase monotonically with frequency.
The fact that speech information is grouped across frequencyi nf actors whose widths are not logarithmically  spaced begs the question of whether current CIs analyse speech signals optimally.Indeed, allocation of FA-inspired channels to an array of CI electrodes may well do abetter job at transmitting speech information.
Ming and Holt [5] explored the same possibility,but using aq uite different method.Theyu sed ac omputational model of efficient auditory encoding to optimize as et of functions for the information carried by speech on the basis of the sound waveform (i.e.not limited to speech envelopes).Theya nalysed speech into as et of six kernel functions that, once optimized, displayed band-pass characteristics reminiscent of asix-channel filterbank, but with more channels concentrated at lowf requencies than in acochleotopic map.Moreover, avocoder based on this newfi lterbank produced better speech recognition than ac ochleotopic one.Our FA-derivedc hannels look rather different to those produced by Ming and Holt; the sixfactor solution shows finest resolution at mid frequencies (Figure 2).C omparisons between the twod esigns, and the logarithmic channel distribution conventionally used in cochlear implants have yet to be made.
The information falling within these FA-inspired channels can be allocated to the electrode array in twow ays.First, information from ac hannel can be allocated to the electrode whose place frequencyi sn earest to that of the channel centre frequency.This would result in aspectrally 'natural' allocation of information.However, such an FAinspired strategy may be undermined by the effects of current spread; neighbouring narrowc hannels could excite closely grouped electrodes, causing information in these channels to be blended once again by current spread.A second strategy,that counters current spread, could use ac-tivated electrodes that are spaced out as much as possible along the cochlea, producing awarped frequencymap designed to spatially separate the independent modulators.This strategy involves significant spectral warping, that may render the sound less natural, and consequently require more listener adaptation.
The spectrally warped FA-inspired strategy would address current spread optimally by minimising interaction between channels.From an informational point of view, such as trategy should improve intelligibility by providing more information to aCIpatient'sbrain, and this improvement should manifest itself as steeper improvement in intelligibility with increasing numbers of channels than seen with logarithmic spacing.The trade-off between the number of channels used and counteracting the effect of current spread will still be present, however, and may lead to an umber of channels for which speech intelligibility reaches amaximum.Speech FA strongly suggests that that optimum number will be lower than the 12-22 electrodes commercial CIs currently employ.The evaluation of FAinspired strategies such as those proposed above will be the object of afollow-up study.
An important caveat is that the FA output is illustrated here for aspecificvoice and hence, filters inspired by such FA are optimized for that voice.The robustness of as ingle FA-inspired strategy can be assessed by the analysis of the effect different voices have on channel boundaries.One approach could be to establish the effect of changes in fundamental frequencyand vocal tract length.Speaking style (e.g., ordinary vs. clear speech, emotional vs. neutral, varying rate of speech)a nd material type (e.g.connected discourse vs. short utterances)m ay also impact channel boundaries.The preliminary analysis of 6different voices (three male and three female)uttering the IEEE sentences showed that for 6to12channels, channel boundaries typically varied by less than ±20% in frequencyt erms.It is unclear at this point whether this variability may be too great for asingle mapping to work for different voices.
Previous efforts made to reduce the effect of current spread, at source (e.g.Srinivasan et al [6])orbydeactivation of the least effective channels (Noble et al. [7])h ave yielded very modest improvements.FA-inspired strategies provide an ew handle to mitigate current spread, which merits careful investigation.
Pl e a s e n o t e: C h a n g e s m a d e a s a r e s ul t of p u blis hi n g p r o c e s s e s s u c h a s c o py-e di ti n g, fo r m a t ti n g a n d p a g e n u m b e r s m a y n o t b e r efl e c t e d in t his ve r sio n.Fo r t h e d efi nitiv e ve r sio n of t hi s p u blic a tio n, pl e a s e r ef e r t o t h e p u blis h e d s o u r c e.You a r e a d vis e d t o c o n s ul t t h e p u blis h e r's v e r sio n if yo u wi s h t o cit e t hi s p a p er. Thi s v e r sio n is b ei n g m a d e a v ail a bl e in a c c o r d a n c e wit h p u blis h e r p olici e s.

Figure 1 .
Figure 1.Topt wo panels: SRTs for digit triplets and IEEE sentences heard through the SPIRAL CI simulator without current spread simulation.Bottom panel: scree plot from the factor analysis of speech modulations in the IEEE sentences spoken by M.I.T.t alker 'DA' .T he lines going through the data or screenplot points are best bilinear fits.Error bars are standard error of the means.

Figure 2 .
Figure 2. Outputs of the factor analysis of IEEE sentences spokenbythe M.I.T.talker 'DA' :factor loading curves as afunction of frequency.From the top to the bottom panel, the number of factors is increased to illustrate hows peech information is distributed as afunction of retained factors.