Decoding of selective attention to continuous speech from the human auditory brainstem response

Humans are highly skilled at analysing complex acoustic scenes. The segregation of different acoustic streams and the formation of corresponding neural representations is mostly attributed to the auditory cortex. Decoding of selective attention from neuroimaging has therefore focussed on cortical responses to sound. However, the auditory brainstem response to speech is modulated by selective attention as well, as recently shown through measuring the brainstem’s response to running speech. Although the response of the auditory brainstem has a smaller magnitude than that of the auditory cortex, it occurs at much higher frequencies and therefore has a higher information rate. Here we develop statistical models for extracting the brainstem response from multi-channel scalp recordings and for analysing the attentional modulation according to the focus of attention. We demonstrate that the attentional modulation of the brainstem response to speech can be employed to decode the attentional focus of a listener from short measurements of ten seconds or less in duration. The decoding remains accurate when obtained from three EEG channels only. We further show how out-of-the-box decoding that employs subject-independent models, as well as decoding that is independent of the specific attended speaker is capable of achieving similar accuracy. These results open up new avenues for investigating the neural mechanisms for selective attention in the brainstem and for developing efficient auditory brain-computer interfaces.


Introduction 40
Humans have an extraordinary capability to analyse crowded auditory scenes. We can, for instance, 41 focus our attention on one of two competing speakers and understand her or him despite the distractor 42 voice (Middlebrooks et al., 2017). People with hearing impairment such as sensorineural hearing loss, 43 however, face major difficulty with understanding speech in noisy environments, and this difficulty 44 persists even when they wear auditory prosthesis such as hearing aids or cochlear implants (Armstrong 45 et al., 1997). Auditory prosthesis could potentially aid with understanding speech in noise through 46 selectively enhancing a target speech, for instance based on its direction, using algorithms such as beam 47 forming (Kidd et al., 2015). However, such selective enhancement requires knowledge of which sound 48 the user aims to attend to. Current research therefore attempts to decode an individual's focus of 49 selective attention to sound from non-invasive brain recordings the sound processing in an auditory prosthesis. It could also form the basis of a non-invasive brain-52 computer interface for motor-impaired patients with brain injury, for instance, who may not be able to 53 respond behaviourally. Moreover, such decoding of selective attention could be employed clinically 54 for a better understanding and characterization of hearing loss. 55 Neural activity in the cerebral cortex, especially in the delta (1 -4 Hz) and theta (4 -8 Hz) 56 frequency bands, tracks the amplitude envelope of a complex auditory stimulus such as speech (Ding 57 and Simon, 2012; Giraud  The model's coefficients can be assembled into complex coefficients ',* = ',* (2) + ',* (3) that 169 encode accordingly the amplitude of the brainstem response, the temporal delay as well as the phase 170 difference between stimulus and response. We thus obtained T = 25 temporal delays that, together with 171 the N=64 recording channels, led to 1,600 complex model coefficients. 172 The model coefficients were then estimated for each subject using a regularised ridge regression 173 as = ( ; + ) >7 ; , in which X is the design matrix of dimension @ × 2 with @ the number 174 of samples available in the recording, and is a regularisation parameter (Hastie et al., 2009). In 175 particular, the columns of the design matrix are the neural recordings ' ( " + * ) at the different time 176 points " as well as their Hilbert transforms ' + ( " + * ). To normalise for differences between datasets, 177 can be written as = " H where H is the mean eigenvalue of ; and " is a normalised 178 regularisation coefficient (Biesmans et al., 2016). 179 A five-fold cross-validation procedure was implemented to evaluate the model. In each of five 180 iterations, and for each participant, four folds of the ten-minute data were used to compute the model 181 coefficients, yielding about eight minutes of training data. The remaining fifth fold, two minutes of 182 testing data, served to estimate the fundamental waveform and to compute the performance of the 183 model. The performance was quantified by dividing the reconstructed ( , = ) and the actual ( ) 184 fundamental waveforms obtained on the testing data in ten-seconds long segments and computing 185 Pearson's correlation coefficient between these waveforms for each segment. The correlation values 186 obtained over the five testing folds were pooled to determine the mean and standard error of the 187 reconstruction performance. This performance was determined for 50 different normalised 188 regularization parameters with values ranging from 10 -15 to 10 15 . For each subject, the regularization 189 parameter that yielded the largest reconstruction performance was chosen as the optimal regularization 190 parameter.

9
The procedure above, including the use of the Hilbert transform of the EEG data, was employed 192 both when reconstructing the fundamental waveform obtained from EMD as well as when estimating 193 the fundamental waveform obtained from band-pass filtering the speech signal (see below). 194 The Python code for computing the complex coefficients of the backward model, together with 195 a sample of a fundamental waveform and the corresponding EEG recordings, is on Github (Kegler et 196 al.). 197

Significance of the fundamental waveform reconstruction 198
To determine if the linear backward models showed a significant brainstem response to the 199 fundamental frequency, we also computed, for each subject, one noise model as a linear backward 200 model that attempted to reconstruct the fundamental waveform of an unrelated speech segment from 201 the same female speaker. The noise models were computed using the same methodology we employed 202 for determining the actual brainstem response, including the same cross-validation procedure and the 203 same determination of the optimal regularization parameter per subject. 204 We then assessed whether the correct linear backward model outperformed the noise model, or 205 the opposite, by comparing the correlation coefficients obtained on the ten-second segments through a 206 two-tailed Wilcoxon signed rank test. The results of the statistical tests are indicated for each subject in 207 Figure 1-A through asterisks: no asterisk is given when results are not significant (p > 0.05), one asterisk 208 when results are significant (*,0.01 < ≤ 0.05), two asterisks when significance is high (**, 0.001 < 209 ≤ 0.01), and three asterisks when significance is very high (***, ≤ 0.001). 210

Estimation of the neural response (forward model) 211
To gain further information about the neural origin of the response we also computed a linear forward 212 model that estimated the EEG responses from the fundamental waveform. The coefficients of the 213 forward model, as opposed to those of a backward model, allow for a neurobiological interpretation of 214 their spatio-temporal characteristics (Haufe et al., 2014). The forward model relates the EEG recording 215 (2) 218 in which * (2) and * (3) are the model's real coefficients. They can be interpreted as real and imaginary 219 parts of the complex coefficients * = * (2) + * (3) . To investigate the temporal dynamics of the neural 220 response, we considered a broader range of time lags than for the backward model. Specifically, we 221 employed a set of T=201 possible delays * that ranged from -50 ms up to 150 ms with an increment 222 of 1 ms. Although we did not expect a neural response at negative delays or at delays larger than 20 ms, 223 we included those nevertheless to verify the absence of a significant response there. The model 224 coefficients were estimated by concatenating the data from all subjects that showed a significant 225 brainstem response to the speech signal as assessed earlier (generic or subject-averaged model) and 226 using a regularised ridge regression as previously described. 227 As for the backward model, we made the Python code for computing the complex coefficients 228 of the forward model available on Github as well (Kegler et al.). 229

Significance of the auditory brainstem response 230
We sought to investigate at which latencies significant neural responses emerged. We therefore 231 compared the obtained forward model to noise models. One thousand forward noise models were 232 computed analogously to the forward model, except that the fundamental waveform of the actual speech 233 signal was replaced with a fundamental waveform of an unrelated speech stimulus, from the same 234 female speaker. We constructed these unrelated speech stimuli by randomly picking four parts, each 235 with a duration of 2.5-minutes, from the eight parts that constituted the female speech material used in 236 the competing speaker condition. This procedure was repeated to create 1,000 surrogate waveforms (out 237 of all 1,680 possible combinations). We then employed a mass-univariate analysis to identify the 238 significant time delays (Groppe et al., 2011). In particular, we computed the average magnitude of the 239 responses over the EEG channels, yielding a single real time-varying function for the actual neural 240 response and of the noise responses. We then pooled the values from the 1000 noise responses over the 241 time lags to establish a single empirical null-distribution. We used this distribution to determine a 242 critical value corresponding to a p-value of 0.05 to which the actual neural response was compared at 243 each time lag from -50 ms to 150 ms (Bonferroni correction for multiple comparison). 244 In addition, we analysed the topography of the forward model at the peak latency Q of the 245 average magnitude of the responses over the EEG channels. To this end, the forward noise models were 246 used to build an empirical null distribution for each channel. For each noise model, the peak latency of 247 the average magnitude was determined, and the magnitude of each channel's response at this latency 248 was used to establish the null distribution of that channel. Finally, the forward model at time Q was 249 compared to the corresponding null empirical distribution at the respective channel at a significance 250 level of p = 0.05, with FDR correction for multiple comparison over channels. 251

Stimulus artifacts 252
We also computed the cross-correlation between the EEG responses to speech in quiet and the 253 corresponding broad-band speech signal, with the purpose of checking for stimulus artifacts. To this 254 end the speech stimulus was resampled from 44,100 Hz to 1,000 Hz, the sampling frequency of the 255 EEG data. The cross-correlation functions were then analysed for statistically significant peaks at delays 256 between -200 ms to 200 ms following the same procedure as described above for the forward model. 257 Briefly, the cross-correlations were first averaged over subjects, and the absolute value of the resulting 258 functions were then averaged over electrodes, yielding the average neural response as a function of 259 latency. To establish a chance level, the same calculations were reproduced when replacing the speech 260 stimulus by a different one from the same speaker. This procedure was repeated 1,000 times, yielding 261 1,000 noise responses. These stimuli were constructed as described above. These noise responses were 262 pooled over time lags to build a single null distribution that was then used to assess the significance of 263 the actual averaged neural responses as described above for the forward model (p = 0.05, Bonferroni-264 corrected for multiple comparison over time lags between -200 ms to 200 ms). 265

Attentional modulation of the auditory brainstem response 266
To analyse the attentional modulation of the brainstem response to one of two competing speakers, we 267  were significantly different. 280

Differences between brainstem responses to attended and to ignored speech 281
We sought to determine whether the difference in the brainstem response to attended and to ignored 282 speech reflected merely a difference in the strength of the response, or if there were other as well. To 283 this end, we compared the magnitudes and the phases of the complex coefficients of the forward model 284 for an attended voice to those for an ignored voice. Because the forward models for the male and for 285 the female voice reflected the different fundamental frequencies of both speakers, we performed this 286 analysis separately for the male and for the female voice. Regarding the magnitude, we computed the 287 ratio of the amplitude of the attended and of the unattended model, at the peak delay of their average 288 amplitude (9 ms, for both the male and female voices). We then employed a two-tailed Wilcoxon sign-289 rank test to determine for which electrodes the ratio was significantly different from unity (p < 0.05, 290 FDR-corrected for multiple comparison over electrodes). To compare the phase, we computed the phase 291 difference between the attended and the ignored model at each electrode at this same peak latency. We 292 considered the wrapped phase differences that were mapped to the range of [-p, p]. We then determined 293 the statistical significance of the phase difference through the Rayleigh test for non-uniformity of 294 circular data (p < 0.05, FDR-corrected for multiple comparison over electrodes). The Rayleigh test 295 13 assesses the null hypothesis that the phase differences are uniformly distributed around the circle. 296 However, it does not inform on the value of the phase differences. Therefore, we derived 95% 297 confidence intervals for the mean phase difference by pooling the data across all electrodes that 298 exhibited significant phase clustering. All circular statistics were performed using the Circular Statistics 299 Toolbox for Matlab (Berens, 2009). Finally, we compared the latency of peak amplitude between the 300 attended and ignored models using a Wilcoxon signed rank test. 301 In order to enable a direct comparison with our previous related work, we also computed the 302 difference between the TRF at electrode CPz and the average TRF of the two mastoids to produce one 303 dipolar response (Forte et al., 2017). CPz was selected due to its central location, similar to the one used 304 in our previous study, and because it emerged in our present study as one of the central electrodes that 305 displayed a significant response to speech in quiet (Figure 1-C). We then computed the ratio of this 306 dipolar response between the attended and the ignored condition. 307

Decoding of auditory attention 308
We investigated how attention could be decoded from short segments of neural data that were obtained 309 in response to competing speakers. We first trained and assessed the performances of the two pairs of 310 speaker-specific linear backward models (MA, MI, FA, FI, as described above) using five-fold cross-311 validation. For all the attention decoding procedures presented hereafter, the normalised regularisation 312 coefficient of the backward models was fixed to the value that yielded the best reconstruction for speech 313 in quiet, " = 10 >Q.R . 314 The testing fold was divided into testing segments with a duration of 0.5, 1, 2, 4, 8, 16 and 32 315 s. For each testing segment we therefore obtained four different correlation coefficients: the correlation 316 coefficient rMA between the fundamental waveform of the male speaker and its reconstruction based on 317 the MA model, the correlation coefficient rMI between the fundamental waveform of the male speaker 318 and its reconstruction based on the MI model, as well as the correlation coefficients rFA and rFI between 319 the fundamental waveform of the female speaker and its reconstruction based on the FA and FI model, 14 respectively. The computed correlation coefficients were then employed to decode attention on each 321 segment. We thereby explored two different avenues (Figure 6-A). 322 First, we based the decoding on the attended models MA and FA only. To this end, we 323 compared the correlation coefficients from both models. If rMA exceeded rFA we concluded that the male 324 speaker was attended, and otherwise that the female speaker was the focus of attention. Second, we 325 considered the ignored models MI and FI only. If rMI was larger than rFI attention was decoded as having 326 been directed at the female speaker, and vice versa if rMI was smaller than rFI. 327 The decoding of attention using these two different methods was performed using all 64 EEG 328 channels as well as based on three EEG channels only (vertex and mastoids: Cz, TP9, TP10). The 329 decoding of attention based on the attended models was also performed using the fundamental 330 waveform obtained by band-pass filtering. 331 We sought to compare the performance of the obtained attentional decoding to that of a random 332 classifier. A random binary classifier can achieve a high accuracy by chance. This is especially true 333 when the number of testing data is small, which in our case occurs when the duration of the testing 334 segments is long. To account for this effect, we determined the 95% chance level, that is, the highest 335 accuracy that a random classifier would achieve in at least 95% of cases. This 95% chance level was 336 computed using a binomial distribution (Combrisson and Jerbi, 2015). 337

Subject-independent attention decoding 338
In real-life situations, the decoding of auditory attention may be required for a subject for whom training 339 data is not available. This situation requires to train a decoder on other people for whom training data 340 is at hand, and to then apply it to the subject under consideration. We refer to such decoders as out-of-341 the-box models since, once trained on the data from a set of volunteers, they can be readily applied to 342 other subjects. To assess how well these out-of-the-box models decode auditory attention, we trained 343 linear backward models on the pooled data from all subjects and quantified their performances using a 344 leave-one-subject-out cross-validation coupled with a five-fold cross-validation regarding the auditory 345 stimuli (i.e. testing on data from a subject and from a part of the stimulus unused during training). To 346 train the model, the testing data from all-but-one participants was concatenated and used to obtain the 347 model coefficients. The unseen part of the data from the remaining subject was used to assess the 348 performance of the model. In particular, we assessed the classifier that compared the performances of 349 the MA and the FA model. Its classification accuracy was evaluated as described above. 350

Speaker-averaged attention decoding 351
We also wondered how well selective attention could be decoded from the brainstem response if the 352 specific models of the brainstem responses to the individual voices were not available. We therefore 353 followed a similar analysis as used for decoding auditory attention based on the speech envelope 354 attended. An equal proportion of data from each attention condition was included in each cross-360 validation fold. To determine the focus of attention, we then considered short testing segments as 361 described above. For each testing segment we computed the correlation coefficient between the 362 reconstructed fundamental waveform and the actual ones of the two speakers. If the reconstruction 363 matched the fundamental waveform of the male speaker more closely than that of the female one, we 364 concluded that the subject had attended the male speaker. Otherwise we determined that the focus of 365 attention was on the female voice. The performance of the classifier was evaluated as described above. 366 367

Response to a single speaker 369
We first measured neural responses to a single non-repetitive speech signal from 64-channel EEG. We 370 employed empirical mode decomposition to obtain a fundamental waveform from the speech signal 371 (Forte et al., 2017), and linear regression with regularization to reconstruct the fundamental waveform 372 from the multi-channel EEG data for each individual subject (linear backward model, Methods). The 373 performance of the reconstruction was assessed through the mean Pearson's correlation coefficients 374 over ten-second segments of the reconstructed fundamental waveform to the actual one (Figure 1-A). 375 We verified that the linear backward models did extract a significant brainstem response to 376 speech. To this end we also constructed models based on the fundamental waveform of unrelated speech 377 signals from the neural data. For almost all subjects that we assessed (15 out of 18), the model that 378 reconstructed the actual fundamental waveform significantly outperformed the one that attempted to 379 reconstruct an unrelated fundamental waveform, showing that the former was able to extract a 380 meaningful brainstem response (Figure 1-A, two-tailed Wilcoxon signed-rank test). 381 To investigate the spatio-temporal characteristics of the brainstem response we computed a 382 generic linear forward model that estimated the EEG recordings from the fundamental waveform using 383 the data from all the subjects that yielded significant reconstructions in the previous test presented in 384 were approximately in antiphase to those near the mastoids, reflecting the direction of the brainstem's 393 dipole sources (Figure 1-D). 394 We also computed linear forward models for single subjects (Figure 2). We find that they 395 yielded peak responses at similar latencies, and showed similar topographies, although these were 396 noisier than the ones obtained from the average over all subjects. 397

Absence of stimulation artifacts 398
To determine if stimulus artifacts were present in the recordings, we computed a cross-correlation 399 between the EEG data and the broadband speech signal. Broadband speech elicits neural responses from 400 the brainstem to the cortex, at latencies ranging from 5 ms to a few hundred ms (Maddox and Lee, 401 2018). A stimulus artifact would arise, in contrast, instantaneously, at a delay of -1 ms. This delay 402 reflects the fact that, in our analysis, we compensated for the earphone's 1 ms delay of delivering the 403 sound to the ears. The responses that we recorded contained, however, only significant contributions 404 between 9 and 12 ms delays, firmly in the range of subcortical neural activity (Figure 3). We could 405 accordingly not detect stimulus artifacts in our EEG recordings. 406

Attentional modulation of the response to competing speakers 407
We then investigated how attention modulates the brainstem response. Following a classic auditory 408 attention paradigm we presented subjects with a male and a female voice diotically and simultaneously, 409 instructing them to attend to either the male or the female speaker, while recording their neural activity respectively. We observed that the performance of the two models that reconstructed the fundamental 416 waveform of a speaker when they were attended was, in most subjects, significantly better than that of 417 the corresponding model for the ignored voice (Figure 4, two-tailed Whitney-Mann rank test). The 418 average ratio between the reconstruction performance of the model for the attended male voice to that 419 for the ignored male voice was 1.22, significantly larger than one (Z(17) = 7, < 0.001, two-tailed 420 Wilcoxon signed-rank test). The ratio was 1.15 in the case of the female voice, which was significantly 421 above one as well (Z(17) = 38, p = 0.039, two-tailed Wilcoxon signed-rank test). The two ratios did 422 not differ significantly (Z(17) = 69, p = 0.47, two-tailed Wilcoxon signed-rank test). The better 423 reconstruction performance of the fundamental waveform of an attended speech signal demonstrates 424 the attentional modulation of the brainstem response to speech that we described previously (Forte et 425 al., 2017). 426 We wondered if the difference between the attended and the ignored brainstem response 427 reflected merely a difference in the strength of the response, or if there were other differences as well. 428 To investigate the nature of these differences, we compared the coefficients of the attended forward 429 models to those of the ignored models, at the peak delay of their average amplitude (9 ms). We found 430 that the ratio of the magnitude of the coefficients did not differ statistically from unity, neither for the 431 male nor for the female voice ( Figure 5-A,C; Wilcoxon sign-rank test, FDR correction for multiple 432 comparison over electrodes). However, we found a statistically significant clustering of phase 433 differences between the attended and the ignored models at several electrodes near the midline as well 434 as near the mastoids ( Figure 5-B,D; Rayleigh test for non-uniformity of circular data, FDR correction 435 for multiple comparison over electrodes). For the male voice, the mean phase difference was found to 436 be -0.51 π (95 % confidence interval: [-0.56 π ; -0.47 π]), while it was -0.12 π for the female voice (95 437 % confidence interval: [-0.17 π ; -0.08 π]). This shows that the ignored models were not merely a scaled 438 version of the attended models, but that the brainstem response to ignored speech occurred at a different 439 phase from that to attended speech. 440 Due to the range of frequencies that constitute the fundamental waveform, the phase shift 441 between the attended and the ignored models did not equate to a consistent temporal shift. We did 442 indeed not find a statistically-significant difference in the timing between the peak amplitude of the 443 attended and the ignored models across the different subjects, for the male or female voice (p = 0.17 444 and p = 0.69 respectively, two-tailed Wilcoxon signed rank test). 445 To facilitate comparison with previous work we also computed the difference of the mastoid 446 electrodes and the electrode at CPz, yielding a dipolar response (Forte et al., 2017). We found that the 447 response's ratio between the attended and ignored condition was significantly greater than unity, for 448 both the male and female voices (p = 0.016, and p = 0.003 respectively, Wilcoxon sign-rank test). 449

Decoding of auditory attention 450
Having verified the attentional modulation of the brainstem response to speech using high-density EEG 451 recordings and linear backward models, we sought to investigate whether this approach could be used 452 to decode auditory attention. We expected the focus of attention to emerge, for instance, from the 453 difference in the performances of the models MA and FA. This difference should typically be positive 454 when the subject attended to the male voice and be negative otherwise (Figure 6-A). Similarly, attention 455 could potentially be decoded from the difference of the reconstruction performance of the models FI 456 and MI. A subject's attention to the male voice should mostly lead to a positive difference, and a focus 457 on the female voice to a negative difference. 458 We tested the accuracy of the decoding on samples of a duration that varied from half a second 459 to over 30 seconds (Figure 6-B). The averaged decoding accuracy based on the attended models (MA, 460 FA) remained significantly above chance even for very short samples that lasted only half a second. It 461 was, for instance, 59% and 69% for two-second and sixteen-second samples, respectively. In contrast, 462 the models MI and FI by themselves did not allow for a decoding of the attentional focus with an 463 accuracy that was better than chance. In the following we therefore discuss decoding obtained from the 464 attended models only. 465 Practical applications of the decoding of auditory attention benefit from a small number of 466 required recording channels. We therefore investigated how well the developed decoding works if the 467 linear backward models use only three EEG channels, the left and right mastoid as well as the central 468 channel Cz. Strikingly, the subject-averaged decoding accuracy was barely smaller than that of the 64-469 channel model; for instance, it remained at 69% for a sixteen-second sample when the classifier based 470 on the attended models was used (Figure 6-C). 471 Both for the 64-channel as well as for the 3-channel decoding we observed variation in the 472 decoding accuracy from subject to subject (Figure 7-A). For a duration of 16 s, for instance, some 473 subjects showed decoding accuracy close to 90%, whereas other subjects exhibited significantly lower 474 decoding accuracies that did not exceed the change level. However, even for short testing segments and 475 for the majority of subjects, the decoding remained above chance level. We note in addition that the 476 20 subjects that did not allow for significant decoding include those for whom we did not obtain significant 477 brainstem responses to speech in quiet (Figure1-A). 478 Because of the complexity of empirical mode decomposition (EMD), the computation of the 479 fundamental waveform through this method cannot typically be performed online. We therefore 480 wondered if attention could be decoded based on a similar waveform obtained through band-pass 481 filtering the audio signal in the range of the fundamental frequency. Band-pass filtering is indeed a 482 comparatively simple operation that can run in real time. We found that decoding based on the band-483 pass filtered audio has a similar accuracy as the one based on the waveform obtained from EMD, which 484 is encouraging for real-time applications (Figure 7-B). 485 Real-world settings will often feature voices that have not been encountered before and for 486 which no speaker-specific model of the brainstem response is available. In an attempt to generalise our 487 results, we computed a speaker-averaged backward model for any attended speaker, irrespective of 488 whether it was the male or the female one. We then decoded attention from the performance of this 489 speaker-averaged model in reconstructing the fundamental waveform of either the male of the female 490 speaker. The averaged decoding accuracies that we obtained were slightly lower than those from the 491 speaker-specific models but were above chance level for durations down to 0.5 s (Figure 7-C). 492 The decoding described above utilized linear backward models that were subject specific and 493 hence required prior training from EEG recordings for each individual. Such subject-specific training 494 may, however, not always be available. We thus assessed the performance of a linear backward model 495 that was trained on the whole population of subjects, and thus represented an average model that could 496 be used out-of-the-box to decode attention. As expected, the decoding accuracies were then lower than 497 those for the subject-specific models. While the decoders based on the attended models with all 64-498 channels remained above the chance level for all the tested durations, the 3-channel setup yielded worse 499 performance only slightly exceeding the chance level for all but the longest duration. For duration of 500 16 s, for instance, the 64-channel setup yielded 65% accuracy, while the 3-channel only 63% ( Figure  501 7-D). Although the accuracy of this decoding when averaged across subjects was not very high, we note 502 that this average was significantly reduced by a few subjects that showed particularly poor accuracies 503 of around 50%, reflecting poor brainstem recordings from these subjects. The majority of the subjects, 504 in contrast, yielded decoding accuracies that exceeded the chance level. 505 506

507
We showed that the brainstem response to the fundamental frequency of speech can be measured 508 reliably from high-density EEG recordings in most subjects through a statistical modelling approach. 509 The response is most evident in the differences between the electrodes near the mastoids and those close 510 to the vertex, in agreement with the dipolar structure of scalp-recorded auditory brainstem activity (Ono 511 et al., 1984;Grandori, 1986;Norrix and Glattke, 1996;Bidelman, 2015). Moreover, the response 512 latency of 8 ms evidenced a subcortical origin. 513 The frequency-following response (FFR) to simpler acoustic signals such as long vowels has 514 recently been found in an MEG study to contain cortical contributions (Coffey et al., 2016). However, 515 when measured through EEG, the cortical contributions emerge earliest at a latency of 20 ms, are 516 smaller than the subcortical ones, and mostly apparent for frequencies up to about 100 Hz (Bidelman, 517 2018). The response to the fundamental frequency of running speech that we have measured here does 518 not show a measurable signal at latencies longer than 14 ms and was recorded in response to a 519 fundamental waveform high-pass filtered above 100 Hz. While contributions from cortical structures 520 cannot be entirely ruled out, we did not observe any within our measurement accuracy. 521 When subjects switched attention from one to another of two competing speakers, we found 522 that the fundamental frequency of each speaker was better encoded in the brainstem response when that 523 speaker was attended rather than ignored. These results align with those that we obtained previously 524 from different recording equipment and with a different analysis procedure that did not involve 525 statistical modelling and that did not address attention decoding (Forte et al., 2017). Here we found, 526 however, that the ratio of the attended to the ignored temporal response functions, as obtained from the 527 forward models, did not differ significantly between the male and the female voice. Indeed, although 528 the scalp maps that we derived largely showed a larger response to the attended than to the ignored 529 speaker ( Figure 5-A, C), the modulation was not statistically significant. This presumably reflected the 530 inclusion of all electrodes in the forward model, including many electrodes that displayed a poor signal-531 to-noise ratio. The backward models, in contrast, employed a weighting of the contribution from each 532 electrode which boosted those with a large signal-to-noise ratio and thus led to a more significant result. 533 To further investigate this issue, we also computed the response at a single channel that was obtained 534 as the difference between the electrodes at the mastoids and at CPz, mimicking our previous bipolar 535 recordings (Forte et al., 2017). The amplitude of this response was significantly modulated by selective 536 attention, in agreement with our previous results. 537 The modelling work that we developed here allowed us to further investigate the origin of the 538 difference in the brainstem response to attended and to ignored speech. We thereby found a significant 539 difference between the phases of the response to attended versus ignored speech. Such a phase shift 540 could in principle emerge from a difference in latency between the attended and ignored model. 541 However, we found no statistically significant difference in peak latency of the attended and ignored 542 responses. The phase shift might instead signify different relative contributions of different parts of the 543 brainstem to the scalp-recorded response. The different values of the phase shift that we obtained for 544 the male and female voice may reflect the differences in the fundamental frequencies of both stimuli. 545 Most importantly, we developed a procedure to decode the attentional focus of a subject to 546 speech based on her or his brainstem response as measured from as few as three recording channels. 547 This will enable the future characterization and investigation of the subcortical mechanisms through 548 which the brain solves the cocktail party problem. Potential practical applications include brain 549 computer interfaces, such as neuro-steered auditory prostheses, as well as clinical assessments of supra-550 threshold hearing impairments that cannot be identified from pure-tone audiometry. Any of these 551 applications will benefit from a decoding method that is fast and requires only a small number of 552 recording channels. 553 We showed that the best decoding is achieved when linear models that relate the neural 554 recording to the speech signal are computed for each subject individually. Such subject-specific models 555 may cause difficulty in practice as sufficient training data per subject may not always be obtainable. 556 The out-of-the-box models reflect the generalized version of the models obtained from the data pooled 557 over many subjects and can be readily applied to other subjects for which no training data is available. 558 We have shown that while the decoding performance of the out-of-the-box models is below those of 559 the subject-specific models, the average decoding accuracy still exceeds the noise level for the high-560 density EEG setup. This suggests a consistency of the brainstem responses to speech across the 561 participants. We also note that the out-of-the box models were fitted using the data from all subjects, 562 including those that did not yield a significant reconstruction of the fundamental waveform in the 563 speech-in-quiet condition. 564 Potential real-world applications will also often require the decoding of attention to a speaker 565 that has not been encountered before. As an important step in this direction, we showed that speaker-566 averaged models that are trained on both attended speech signals, thereby computing an attended model 567 that was averaged over the different voices, still performed well and allowed to decode attention. Future 568 work could investigate how well these models generalise to speakers for which no training data is 569

available. 570
Another important feature for real-time attention decoding is that the whole computational 571 pipeline -from the processing of the audio signal to the computation of reconstructed waveforms and 572 the attention decoding -can run online. Our reconstruction of the fundamental waveform through a 573 backward model, the assessment of its performance as well as the subsequent attention decoding were 574 all based on linear operations that can easily run in real time. However, the EMD that we employed for 575 the computation of the fundamental waveform comes with large computational costs. We therefore 576 explored how a computationally much simpler operation, band-pass filtering of the audio signal, 577 performed regarding the decoding of attention. Promisingly we found that this method still allowed to 578 decode attention from very short segments of data, evidencing the potential for real-time decoding. 579 While two bandpass filters with different corner frequencies were applied to the male and female voice, 580 this approach could be extended to use filterbanks or use online pitch estimation algorithms. 581 The decoding procedure that we developed relies on the correlation between the reconstructed 582 fundamental waveform from the brainstem response and the actual fundamental waveform of the speech 583 signal. The obtained correlation coefficients are small, typically between 0.05 and 0.1 (Figure 1 rapidness of the brainstem response: because the brainstem response to speech occurs at the 592 fundamental frequency of a voice, it is ten-to hundredfold faster than the cortical response to the speech 593 envelope. This rapidness appears to compensate for the smaller magnitude of the response. 594 Although brainstem responses and cortical responses allow for similarly efficient attention 595 decoding when high-density EEG is available, the decoding based on the brainstem response to speech 596 may have advantages when only a few channels are available. The accuracy of attention decoding based 597 on the speech envelope drops indeed below 80% for a trial of at least 20 seconds when relying on 598 subject-specific five-electrode montages (Mirkovic et al., 2015;Fuglsang et al., 2017). Similarly, the 599 attention decoding based on the brainstem response that we have developed here achieves an averaged 600 accuracy of 69% when based on three electrodes (TP9, TP10 and Cz) and on 16 seconds of data, and 601 reaches 72% when 32 seconds of data are available (Figure 5-B). This good decoding performance from 602 a few EEG channels may be due to the effective capturing of the brainstem response by sparse 603 montages, as well as due to a consistent dipole orientation across subjects (Dale and Sereno, 1993). 604 Importantly, we employed only band-pass filtering as a pre-processing step for the EEG data. The 605 simplicity of this attention decoding method and its good accuracy when based on a few EEG channels 606 may make this method attractive for practical applications. 607

25
The mixed-speaker stimuli that we employed were obtained by superimposing two speech 608 signals, and our decoding was based on the knowledge of these separate voices. The individual 609 components of a complex acoustic scene are, however, in general not available and need to be estimated 610 from the acoustic mixture. The application of our method for decoding attention to steer an auditory 611 prosthesis towards an attended voice, for instance, will thus require to first segregate the different voices 612 that are present in the acoustic space, and to then determine the focus of the user's attention. The The decoding that we have described here is based on linear backward models that reconstruct 629 the fundamental waveform of the speech signal from the EEG recordings. This method determined the 630 brainstem response to the voiced parts of speech, and in particular to its pitch, but did not measure the 631 brainstem response to the voiceless speech components (Maddox and Lee, 2018). Improved 632 performance may be obtained through canonical correlation analysis that relates the neural recording to 633  The subject-averaged decoding accuracy obtained from the models MA and FA reaches 73% at a 812 duration of 32 seconds and remains above chance level (grey) for very short durations of 500 ms. accuracies from the speaker-specific models achieved per individual subject (coloured lines, consistent 820 across panels) varies by up to approximately 50% around the average (bold black line). However, for 821 each individual subject the decoding based on 64 channels (top) is similar to that achieved from three 822 channels (bottom). Here, the decoding is based on the difference between the attended models (same 823 data as presented on the population level in Figure 6 reduced, yet better than chance, decoding accuracies for most subjects.