EEG-based detection of the attended speaker and the locus of auditory attention with convolutional neural networks

When multiple people talk simultaneously, the healthy human auditory system is able to attend to one particular speaker of interest. Recently, it has been demonstrated that it is possible to infer to which speaker someone is attending by relating the neural activity, recorded by electroencephalography (EEG), with the speech signals. This is relevant for an effective noise suppression in hearing devices, in order to detect the target speaker in a multi-speaker scenario. Most auditory attention detection algorithms use a linear EEG decoder to reconstruct the attended stimulus envelope, which is then compared to the original stimuli envelopes to determine the attended speaker. Classifying attention within a short time interval remains the main challenge. We present two different convolutional neural network (CNN)-based approaches to solve this problem. One aims to select the attended speaker from a given set of individual speaker envelopes, and the other extracts the locus of auditory attention (left or right), without knowledge of the speech envelopes. Our results show that it is possible to decode attention within 1-2 seconds, with a median accuracy around 80%, without access to the speech envelopes. This is promising for neuro-steered noise suppression in hearing aids, which requires fast and accurate attention detection. Furthermore, the possibility of detecting the locus of auditory attention without access to the speech envelopes is promising for the scenarios in which per-speaker envelopes are unavailable. It will also enable establishing a fast and objective attention measure in future studies.

The aim of this paper is to further explore the possibilities of CNNs for EEG-based AAD. As opposed to de Taillez et al. (2017) and Ciccarelli et al. (2019), who aim to decode the attended speaker (for a given set of speech envelopes), we aim to decode the locus of auditory attention (left/right). When the locus of attention is known, a hearing aid can steer a beamformer in that direction to enhance the attended speaker.

A. Experiment setup
The dataset used for this work was gathered previously (Das et al., 2016). EEG data was collected from 16 normal-hearing subjects while they listened to two competing speakers and were instructed to attend to one particular speaker. Every subject signed an informed consent form approved by the KU Leuven ethical committee.
The EEG data was recorded using a 64-channel BioSemi ActiveTwo system, at a sampling rate of 8196 Hz, in an electromagnetically shielded and soundproof room. The auditory stimuli were low-pass filtered with a cutoff frequency of 4 kHz and presented at 60 dBA through Etymotic ER3 insert earphones. APEX 3 was used as stimulation software (Francart et al., 2008).
The auditory stimuli were comprised of four Dutch stories, narrated by three male Flemish speakers (DeBuren, 2007). Each story was 12 min long and split into two parts of 6 min each. Silent segments longer than 500 ms were shortened to 500 ms. The stimuli were set to equal root-mean-square intensities and were perceived as equally loud.
The experiment was split into eight trials, each 6 min long. In every trial, subjects were presented with two TABLE I: First eight trials for a random subject. Trials are numbered according to the order in which they were presented to the subject. Which ear was attended to first was determined randomly. After that, the attended ear was alternated. Presentation (Dichotic/HRTF) was balanced over subjects with respect to the attended ear. Adapted from Das et al. (2016). the maximal bandpass attenuation was 0.5 dB while the stopband attenuation was 20 dB (at 0-1 Hz) and 15 dB (at 32-64 Hz). After the bandpass filtering the EEG data was downsampled to 20 Hz (linear model) and 128 Hz (CNN). Artifacts were removed with the generic MWF-based removal algorithm described in Somers et al. (2018).
Data of each subject was divided into a training, validation and test set. Per set, data segments were generated with a sliding window equal in size to the chosen window length and with an overlap of 50 %. Data was normalized on a subject-by-subject basis, based on statistics of the training set only, and in such a way that proportions between EEG channels were maintained. Concretely, for each subject we calculated the power per channel, based on the 10 % trimmed mean of the squared samples. All channels were then divided by the square root of the median of those 64 values (one for each EEG channel). Data of each subject was thus normalized based on a single (subject-specific) value.

C. Convolutional neural networks
A convolutional neural network (CNN) consists of a series of convolutional layers and non-linear activation functions, typically followed by pooling layers. In convolutional layers, one or more convolutional filters slide over the data to extract local data features. Pooling layers then aggregate the output by computing, for example, the mean. Similarly to other types of neural networks, a CNN is optimized by minimizing a loss function, and the optimal parameters are estimated with an optimization algorithm such as stochastic gradient descent.
Our proposed CNN for decoding the locus of auditory attention is shown in Fig. 1 where 64 is the number of EEG channels in our dataset and T is the number of samples in the decision window.
(We tested multiple decision window lengths, as discussed later.) The first step in the model is a convolutional layer, indicated in blue. Five independent [64×17] spatio-temporal filters are shifted over the input matrix, which, since the first dimension is equal to the number of channels, each result in a time series of dimensions [1 × T ].
Note that "17" is 130 ms at 128 Hz, and 130 ms was found to be an optimal filter width -that is, longer or shorter decision window lengths gave a higher loss on a validation set. A rectifying linear unit (ReLu) activation function is used after the convolution step.
In the average pooling step, data is averaged over the time dimension, thus reducing each time series to a single number. After the pooling step, there are two fully connected (FC) layers. The first layer contains five neurons (one for each time series) and is followed by a sigmoid activation function, and the second layer contains two (output) neurons. These two neurons are connected to a cross-entropy loss function. Note that with only two directions (left/right), a single output neuron (coupled with a binary cross-entropy loss) would have sufficed as well. With this setup it is easier to extend to more locations, however. The full CNN consists of approximately 5500 parameters.

D. CNN training and evaluation
The model was trained on data of all subjects, including the subject it was tested on (but without using the same data for both training and testing). This means we are training a subject-specific decoder, where the data of the other subjects can be viewed as a regularization or data augmentation technique to avoid overfitting on the (limited) amount of training data of the subject under test.
To prevent the model from overfitting to one particular story, we cross-validated over the four stories (resulting in four folds). That is, we held out one story and trained on the remaining three stories (illustrated in Table II). Such overfitting is not an issue for simple linear models, but for the CNN we propose here, our experiments have shown that it is an issue. Indeed, even showing only the EEG responses to a part of a story could result in the model learning certain story-specific characteristics. That could then lead to overly optimistic results when the model is presented with the EEG responses to another (albeit different) part of the same story. Similarly, since each speaker has their own "story-telling" characteristics (for example, speaking rate or intonation), and a different voice timbre, EEG responses to different speakers may differ. Therefore, it is possible that the model gains an advantage by having "seen" the EEG response to a specific speaker, so we retained only the folds wherein the same speaker was never simultaneously part of both the training and the test set. In the end, only two folds remained (see Table II).
We refer to the combined cross-validation approach as leave-one-story+speaker-out.
July 26, 2020 DRAFT In an additional experiment we investigated the subject-dependency of the model, where, in addition to crossvalidating over story and speaker, we also cross-validated over subjects. That is, we no longer trained and tested on N subjects, but instead trained on N − 1 subjects and tested on the held-out subject. Such a paradigm has the advantage that new subjects do not have to undergo potentially expensive and time-consuming re-training, making it more suitable for real-life applications. Whether it is actually a better choice than subject-specific retraining depends on the difference in performance between the two paradigms. If the difference is sufficiently large, subject-dependent retraining might be a price one is willing to pay.
We trained the network by minimizing the cross-entropy between the network outputs and the corresponding labels (the attended ear). We used mini-batch stochastic gradient descent with an initial learning rate of 0.3 and a momentum of 0.9. We applied a step decay learning schedule that decreased the learning rate after epoch 10 and 35 to 0.15 and 0.075, respectively, to assure convergence. The batch size was set to 20, partly because of memory constraints, and partly because we did not see much improvement with larger batch sizes. Weights and biases were initialized by drawing randomly from a normal distribution with a mean of 0 and a standard deviation of 0.5. Training ran for 100 epochs, as early experiments showed that the optimal decoder was usually found between epoch 70 and 95. Regularization consisted of weight decay with a value of 5 × 10 −4 , and, after training, of selecting the decoder in the iteration where the validation loss was minimal. Note that the addition of data of the other subjects can also be viewed as a regularization technique that further reduces the risk of overfitting.
All hyperparameters given above were determined by running a grid search over a set of reasonable values.
Performance during this grid search was measured on the validation set.
Note that in this work the decoding accuracy is defined as the percentage of correctly classified decision windows on the test set, averaged over the two folds mentioned earlier (one for each story narrated by a different speaker).

E. Linear baseline model (Stimulus reconstruction)
A linear stimulus reconstruction model (Biesmans et al., 2017) was used as baseline. In this model, a spatiotemporal filter was trained and applied on the EEG data and its time-shifted versions up to 250 ms delay, based on least-squares regression, in order to reconstruct the envelope of the attended stimulus. The reconstructed envelope was then correlated (Pearson correlation coefficient) with each of the two speaker envelopes over a data window with a pre-defined length, denoted as the decision window (different lengths were tested). The classification was made by selecting the position corresponding to the speaker that yielded the highest correlation in this decision window. The envelopes were calculated with the "powerlaw subbands" method proposed by Biesmans et al. (2017); that is, a gammatone filter bank was used to split the speech into subbands, and per subband the envelope was calculated with a power law compression with exponent 0.6. The different subbands were then added again (each with a coefficient of 1) to form the broadband envelope. Envelopes were filtered and downsampled in the same vein as the EEG recordings.
For a fairer comparison with the CNN, the linear model was also trained in a leave-one-story+speaker-out way.
In contrast to the CNN, however, the linear model was not trained on any other data than that of the subject under testing, since including data of other subjects harms the performance of the linear model.
Note that the results of the linear model here merely serve as a representative baseline, and that a comparison between the two models should be treated with care-in part because the CNN is non-linear, but also because the linear model is only able to relate the EEG to the envelopes of the recorded audio, while the CNN is free to extract any feature it finds optimal (though only from the EEG, as no audio is given to the CNN). Additionally, the prepossessing is slightly different for both models. However, that preprocessing was chosen such that each model would perform optimally-using the same preprocessing would, in fact, negatively impact one of the two models.

F. Minimal expected switch duration
For some of the statistical tests below, we use the minimal expected switch duration (MESD) proposed by Geirnaert et al. (2020) as a relevant metric to assess AAD performance. The goal of the MESD metric is to have a single value as measure of performance, resolving the trade-off between accuracy and the decision window length. The MESD was defined as the expected time required for an AAD-based gain control system to reach a stable volume switch between both speakers, following an attention switch of the user. The MESD is calculated by optimizing a Markov chain as a model for the volume control system, which uses the AAD decision time and decoding accuracy as parameters. As a by-product, it provides the optimal volume increment per AAD decision.
One caveat is that the MESD metric assumes that all decisions are taken independently of each other, but this may not be true when the window length is very small, for example, smaller than 1 s. In that case the model behind the MESD metric may slightly underestimate the time needed for a stable switch to occur. However, it can still serve as a useful tool for comparing models.

A. Decoding performance
Seven different decision window lengths were tested: 10, 5, 2, 1, 0.5, 0.25 and 0.13 s. This defines the amount of data that is used to make a single left/right decision. In the AAD literature, decision windows range from approximately 60 to 5 s. In this work, the focus lies on shorter decision windows. This is done for practical reasons: July 26, 2020 DRAFT in neuro-steered hearing aid applications the detection time should ideally be short enough to quickly detect attention switches of the user.
To capture the general performance of the CNN, the reported accuracy for each subject is the mean accuracy of 10 different training runs of the model, each with a different (random) initialization. All MESD values in this work are based on these mean accuracies.
The linear model was not evaluated at a decision window length of 0.13 s since its kernel has a width of 0.25 s, which places a lower bound on the possible decision window length.  It is not entirely clear why the CNN fails for two of the 16 subjects. Our analysis shows that the results depend heavily on the story that is being tested on: for the two subjects with below 50 % accuracy the CNN performed poorly on story 1 and 2, but performed well on story 3 and 4 (80 % and higher). Our results are based on story 1 and 2, however, since story 3 and 4 are narrated by the same speaker and we wanted to avoid having the same speaker in both the training and test set. It is possible that the subjects did not comply with the task in these conditions.

B. Effect of decision window length
Shorter decision windows contain less information and should therefore result in poorer performance compared to longer decision windows. Figure 4 visualizes the relation between window length and detection accuracy.
A linear mixed-effects model fit for decoding accuracy, with decision window length as fixed effect and subject as random effect, shows a significant effect of window length for both the CNN model (df = 96, p < 0.001) and the linear model (df = 94, p < 0.001). The analysis was based on the decision window lengths shown in Fig. 4; that is, seven window lengths for the CNN and six for the linear model.

C. Interpretation of the results
Interpreting the mechanisms behind a neural network remains a challenge. In an attempt to understand which frequency bands of the EEG the network uses we retested (without retraining) the model in two ways: (1) by filtering out a certain frequency range (Fig. 5, left); (2) by filtering out everything except a particular frequency range (Fig. 5, right). The frequency ranges are defined as follows: δ = 1-4 Hz; θ = 4-8 Hz; α = 8-14 Hz; β = 14-32 Hz. Figure 5 shows that the CNN uses mainly information from the beta band, in line with Gao et al. (2017). Note that the poor results for the other frequency bands (Fig. 5, right) does not necessarily mean that the network does not use the other bands, but rather, if it does, it is in combination with other bands.
We additionally investigated the weights of the filters of the convolutional layer, as they give an indication of what channel the model finds important. We calculated the power of the filter weights per channel, and to capture the general trend, we calculated a grand-average over all models (i.e,. all window lengths, stories, and runs). Moreover, The results are shown in Fig. 6. We see primarily activations in the frontal and temporal regions, and to a lesser extent also in the occipital lobe. Activations appear to be slightly stronger on the right side, as well. This result is in line with Ciccarelli et al. (2019), who also saw stronger activations in the frontal channels (mostly for the "Wet 18 CH" and "Dry 18 CH" systems). Additionally, Gao et al. (2017) also found the frontal channels to significantly differ from the other channels within the beta band ( Fig. 3 and Table 1 in Gao et al. (2017)). The prior MWF artifact removal step in the EEG preprocessing and the importance of the beta band in the decision making (Fig. 5) implies that the focus on the frontal channel is not attributed to eye artifacts. It is noted that the filters of the network act as backward decoders, and therefore care should be taken when interpreting topoplots related to the decoder coefficients. As opposed to a forward (encoding) model, the coefficients of a backward (decoding) model are not necessarily predictive for the strength of the neural response in these channels. For example, the network may perform an implicit noise reduction transformation, thereby involving channels with low SNR as well.

D. Effect of validation procedure
In all previous results we used a leave-one-story+speaker-out scheme to prevent the CNN from gaining an advantage by already having seen EEG responses elicited by the same speaker or different parts of the same story.
However, it is noted that in the majority of the AAD literature, training and test sets often do contain samples from the same speaker or story (albeit from different parts of the story).
To investigate the impact of cross-validating over speaker and story, we trained the CNN again, but this time using data of each trial. Here, the training set consisted of the first 75 % of each trial, the validation set of the next 15 %, and the test set of the last 15 %. Figure 7 shows the results of both strategies for decision windows of 1 s. Other window lengths show similar results. Clearly, even though the test and training set were always disjoint, performance was significantly better when the network had already seen other examples from the same story or speaker during training. For decision windows of 1 s, using data from all trials resulted in a median decoding accuracy of 92.8 % (Fig. 7, left), compared to only 80.8 % when leaving out both story and speaker (Fig. 7, right). A Wilcoxon signed-rank test shows this difference to be significant (W = 91, p = 0.0134).
Every trial Leave-one-story+speaker-out

E. Subject-independent decoding
In a final experiment we investigated how well the CNN performs on subjects that were not part of the training set. Here, the CNN is trained on N − 1 subjects and tested on the held-out subject-but still in a leave-one-story+speaker-out manner, as before. The results are shown in Fig. 8. For windows of 1 s, a Wilcoxon signed-rank test shows that leaving out the test subject results in a significant decrease in decoding accuracy from 80.8 % to 69.3 % (W = 14, p = 0.0134). Surprisingly, for one subject the network performs better when its data was not included during training. Other window lengths show similar results.

IV. DISCUSSION
We proposed a novel CNN-based model for decoding the direction of attention (left/right) without access to the stimulus envelopes, and found it to significantly outperform a linear decoder that was trained to reconstruct the envelope of the attended speaker.

A. Decoding accuracy
The CNN model resulted in a significant increase in decoding accuracy compared to the linear model: for decision windows as low as 1 s, the CNN's median performance is around 81 %. This is also better than entropybased direction classification presented in literature (Lu et al., 2018), in which the average decoding performance Despite the impressive median accuracy of our CNN, there is clearly more variability between subjects in comparison to the linear model. Figure 4, for example, shows that some subjects have an accuracy of more than 90 %, while others are at chance-level-and two subjects even perform below chance level. While this increase in variance could be due to our dataset being too small for the large number of parameters in the CNN, we observed that the poorly performing subjects do better on story 3 and 4, which were originally excluded as a test set in the cross-validation. Why our system performs poorly on some stories, and why this effect differs from subject to subject, is not clear, but nevertheless it does impact the per-subject results. This story-effect is not present in the linear model, probably because that model has far fewer parameters and is unable to pick up certain intricacies of stories or speakers.
As expected, we found a significant effect of decision window length on accuracy. This effect is, however, clearly different for the two models: the performance of the CNN is much less dependent on window length than is the the underlying speech envelopes. The latter requires computing correlation coefficients between the stimulus and the neural responses, which are only sufficiently reliable and discriminative when computed over long windows.
Lastly, we evaluated our system using a leave-one-story+speaker-out approach, which is not commonly done in the literature. While for linear models this is probably fine, our results demonstrate that for artificial neural networks July 26, 2020 DRAFT a stricter cross-validation is required, as they appear to be able to overfit to the characteristics of a specific story or speaker (despite the use of a non-overlapping training and test set, and despite the fact that the network only sees the neural responses), thereby yielding results that are possibly overly optimistic.

B. Future improvements
We hypothesize that much of the variation within and across subjects and stories currently observed is due to the small size of the dataset. The network probably needs more examples to learn to generalize better. However, a sufficiently large dataset, one which also allows for the strict cross-validation used in this work, is currently not available.
Partly as a result of the limited amount of data available, the CNN proposed in this work is relatively simple.
With more data, more complex CNN architectures would become feasible. Such complex CNN architectures may benefit more from generalization features such as dropout and batch normalization, not discussed in this work.
Also, for a practical neuro-steered hearing aid, it may be beneficial to make soft decisions. Instead of the translation of the continuous softmax outputs into binary decisions, the system could output a probability of left or right being attended, and the corresponding noise suppression system could adapt accordingly. In this way the integrated system could benefit from temporal relations or the knowledge of the current state to predict future states.
The CNN could for example be extended by a long short term memory (LSTM) network.

C. Applications
The main bottleneck in the implementation of neuro-steered noise suppression in hearing aids thus far has been the detection speed (state-of-the-art algorithms only achieve reasonable accuracies when using long decision windows).
This can be quantified through the MESD metric, which captures both the effect of detection speed and decoding accuracy. While our linear baseline model achieves a median MESD of 22.6 s, our CNN achieves a median MESD of only 0.819 s, which is a major step forward.
Moreover, our CNN-based system has an MESD of 5 s or less for 11 out of 16 subjects (8 subjects even have an MESD below 1 s), which is what we assume the minimum for an auditory attention detection system to be feasible in practice. 1 (For reference, an MESD of 5 s corresponds to a decoding accuracy of 70 % at 1 s.) On the other hand, one subject does have an MESD of 33.4 s, and two subjects have an infinitely high MESD due to below 50 % performance. The inter-subject variability thus remains a challenge, since the goal is to create an algorithm that is both robust and able to quickly decode attention within the assumed limits for all subjects.
Another difficulty in neuro-steered hearing aids is that the clean speech envelopes are not available. This has so far been addressed using sophisticated noise suppression systems (Aroudi et al., 2018;O'Sullivan et al., 2017;Van Eyndhoven et al., 2017). If the speakers are spatially separated, our CNN might elegantly solve this problem by steering a beamformer towards the direction of attention, without requiring access to the envelopes of the speakers at all. Note that in a practical system the system would need to be extended to more than two possible directions of attention, depending on the desired spatial resolution.
For application in hearing aids, a number of other issues need to be investigated, such as the effect of hearing loss (Holmes et al., 2017), acoustic circumstances (for example, background noise, speaker locations and reverberation (Aroudi et al., 2019;Das et al., 2018Das et al., , 2016Fuglsang et al., 2017)), mechanisms for switching attention (Akram et al., 2016), etc. The computational complexity would also need to be reduced. Especially if deeper, more complex networks are designed, CNN pruning will be necessary (Anwar et al., 2017). Then a hardware DNN implementation, or even computation on an external device such as a smartphone could be considered. Another practical obstacle are the numerous electrodes used for the EEG measurements. Similar to the work of (Fiedler et al., 2016;Mirkovic et al., 2015;Montoya-Martínez et al., 2019;Narayanan Mundanad and Bertrand, 2018), it should be investigated how many and which electrodes are minimally needed for adequate performance.
In addition to potential use in future hearing devices, fast and accurate detection of the locus of attention can also be an important tool in future fundamental research. Thus far it was not possible to measure compliance of the subjects with the instruction to direct their attention to one ear. Not only may the proposed CNN approach enable this, but it will also allow to track the locus of attention in almost real-time, which can be useful to study attention in dynamic situations, and its interplay with other elements such as eye gaze, speech intelligibility and cognition.
Note that for this application, the leave-one-story out strategy is not required, and much higher performance can be achieved by training on the same material used for testing.
In conclusion, we proposed a novel EEG-based CNN for decoding the locus of auditory attention (based only on the EEG), and showed that it significantly outperforms a commonly-used linear model for decoding the attended speaker. Moreover, we showed that the way the model is trained impacts the results significantly, and that a thoughtful cross-validation where both the speaker and story are unseen by the model is important-at least when dealing with models that are more complex than typical linear models. Although there are still some practical problems, the proposed model approaches the desired real-time detection performance. Furthermore, as it does not require the clean speech envelopes, this model has potential applications in realistic noise suppression systems for hearing aids.
V. ACKNOWLEDGEMENTS