EEGminer: discovering interpretable features of brain activity with learnable filters

Objective. The patterns of brain activity associated with different brain processes can be used to identify different brain states and make behavioural predictions. However, the relevant features are not readily apparent and accessible. Our aim is to design a system for learning informative latent representations from multichannel recordings of ongoing EEG activity. Approach: We propose a novel differentiable decoding pipeline consisting of learnable filters and a pre-determined feature extraction module. Specifically, we introduce filters parameterized by generalized Gaussian functions that offer a smooth derivative for stable end-to-end model training and allow for learning interpretable features. For the feature module, we use signal magnitude and functional connectivity estimates. Main results. We demonstrate the utility of our model on a new EEG dataset of unprecedented size (i.e. 721 subjects), where we identify consistent trends of music perception and related individual differences. Furthermore, we train and apply our model in two additional datasets, specifically for emotion recognition on SEED and workload classification on simultaneous task EEG workload. The discovered features align well with previous neuroscience studies and offer new insights, such as marked differences in the functional connectivity profile between left and right temporal areas during music listening. This agrees with the specialisation of the temporal lobes regarding music perception proposed in the literature. Significance. The proposed method offers strong interpretability of learned features while reaching similar levels of accuracy achieved by black box deep learning models. This improved trustworthiness may promote the use of deep learning models in real world applications. The model code is available at https://github.com/SMLudwig/EEGminer/.


INTRODUCTION
H UMAN behavior arises from the interactions between many different brain regions [1] [2].Hence, revealing and decoding functional brain connectivity enables making predictions of behavioral states [3], identifying subjects and predicting fluid intelligence [4], as well as facilitates predicting sustained attention [5] and emotional responses to music [6] [7], to mention but a few examples of important applications.However, the underlying brain connectivity is not readily accessible from EEG recordings, which simply record electric potential differences at multiple locations on the skull.
Several conventional measures have been proposed for estimating underlying connectivity from EEG recordings [8], such as signal correlations and phase locking values [9].Tuning these measures, however, involves extensive manual feature engineering.Indeed, in practice, the search space of informative features explodes exponentially when one considers coordinated brain activity and employs functional interactions between distinct brain areas.Therefore, manual feature engineering is a suboptimal process.In contrast, deep neural networks [10] enable learning features from large amounts of annotated training data.Concretely, such deep models involve trainable parameters (usually arranged into a series of processing layers), the values of which are estimated using backpropagation of gradients [11] from the training data.However, features extracted by deep models often lack in interpretability, which can present significant detriments in real-world applications [12].Besides interpretability, off-the-shelf deep learning models require a significant amount of annotated data, hindering their applicability in data-scarce scenarios such as EEG analysis.
The aforementioned limitations of deep neural networks regarding interpretability and the need for large training datasets can be reduced by introducing appropriate inductive biases.The most common approach in the EEG domain is to use convolutional neural networks (CNN) [13] [14].A prominent example is EEGNet, which takes cues from conventional EEG signal processing techniques to create a compact and relatively interpretable model [15].There are very limited constraints on the model however, such as fully trainable temporal filter functions.Convolutional filters have been constrained by parameterized sinc functions in the audio domain [16].This constrains the model but could reduce trainability, especially in the low signal-tonoise ratio regime of the EEG domain.Further, CNNs are usually thought of as processing spectral-temporal features.An attempt to introduce constraints towards using EEG functional connectivity features has been undertaken with SyncNet [17], which however does not make these constraints fully explicit and thereby limits its interpretability.
Distinct from the above models, we propose to implement an end-to-end differentiable EEG decoding pipeline (coined as EEGminer), making use of explicit neuroscientific measures of brain activity, such as signal correlations.The type of feature itself is therefore untrainable, but by using a differentiable implementation, the frequencies over which Fig. 1.Visualization of the EEGminer architecture.Independently trainable filters are applied on each input channel (five signal channels are given here over four seconds).As an example of a differentiable feature module used in this paper, the filtered signals are then correlated with each other to get connectivity matrices.As they are symmetric, the upper triangular is extracted as a feature vector.Multiple filters per input channel result in multiple independent feature maps (shown here are three connectivity feature maps).The feature vectors resulting from all feature maps are concatenated, standardized (not shown) and used by the linear classifier for logistic regression.
the connectivity is computed can be learned.The proposed model consists of learnable temporal filters parameterized by generalized Gaussians, a differentiable feature module and a linear layer for classification (Fig. 1).Our approach can easily incorporate multiple types of features (local activity, covariation, phase synchrony, etc.) and keeps close relation with the topological information about the sensors, which in turn guarantees the direct interpretation of the trained model.Our approach can be interpreted both from the perspectives of conventional neuroscience on the one hand and deep learning on the other.The steps that make up an EEG pipeline (e.g.signal filtering, connectivity measure, classification) are analogous to consecutive layers in a deep neural network.Conversely, viewed from a deep learning perspective, our proposal for a differentiable EEG pipeline can be seen as introducing strong inductive biases into an end-to-end trainable model.By choosing an appropriate model architecture and training procedure, the interpretability of learned features can be made explicit while maintaining competitive decoding accuracy.
Learnable Gaussian filters have been used before as part of deep learning models, both as a trainable filterbank [18] as well as in their time-domain representation as Morlet wavelets to parameterize a convolutional filter [19].In this work, we propose to design a more general filter function, based on the generalized Gaussian, which introduces a shape parameter.During training, the filter shape can adapt from a Gaussian (i.e.Morlet wavelet) with good training be-havior to a rectangular filter (i.e.sinc) with good frequency selectivity.
To demonstrate the generalizability of our approach, we apply it on two datasets and three distinct classification tasks, in all cases searching for subject-independent features.The chosen tasks put a focus on spontaneous brain activity that does not exhibit event-locked responses.The main dataset used was collected in collaboration with the Science Museum London, providing recordings from an unprecedented number of participants during passive music listening.We train our model to discriminate between resting state and music listening and to differentiate female and male subjects based on their brainwaves during music listening.Generalizing to a different setting, we use the SEED dataset to classify positive and negative emotions elicited by watching emotional film clips [20] [21].For all three tasks we compare model classification performance to conventional and deep learning approaches and extract the learned magnitude and functional connectivity features and the importance attributed to them by the classifier.
In summary the contributions of this paper are as follows:

•
We introduce generalised Gaussian filters that allow for end-to-end learning of frequency bands of interest

•
We propose a modular deep learning architecture for decoding EEG signals with explicitly interpretable features, without sacrificing accuracy

•
We provide novel insights into brain states underlying music listening and emotional responses The remainder of the paper is organized as follows: Section 2 describes related work in the literature and section 3 outlines the proposed method.Section 4 describes the baselines and datasets employed in this study and presents results on performance benchmarks as well as interpretations of the learned features and their statistical validation.The last section summarizes and concludes our work.

RELATED WORK
Two widely cited convolutional neural networks developed for EEG decoding are ShallowConvNet and DeepConvNet [22].ShallowConvNet employs a single temporal filtering layer with trainable convolution kernels shared across electrode channels, followed by a spatial filter, temporal pooling and linear classifier.The composition of temporal and spatial filters is inspired by the filter bank common spatial pattern algorithm [23].Using similar operations, DeepConvNet targets more complex feature extraction with additional network layers.Both models have a large number of trainable parameters due to missing constraints and are relatively difficult to interpret as a consequence.
EEGNet is another important convolutional decoding model developed for the EEG domain [15].The EEGNet architecture is comprised of trainable convolutional temporal filters very similar to ShallowConvNet.This is followed by spatial filters, which create projections from signal channels filtered in the same frequencies.Constraining the spatial filter to work within a single temporal filter channel makes the model more parameter efficient and facilitates learning from relatively small datasets of EEG recordings.The model applies a further convolution layer for higher-level feature extraction and a linear classifier.Being similar in its architecture to ShallowConvNet, the interpretability is limited.
In order to reduce model complexity, parameterized sinc bandpass filters can be employed instead of arbitrary convolutional filters (i.e., SincNet [16] and related architectures [24] [25]).The kernel of a convolutional sinc bandpass filter is constructed as the difference between two sinc lowpass filters as where the sinc function is defined as sinc(x) = sin(x)/x.The trainable parameters are here only the low and high cutoff frequencies f 1 and f 2 as opposed to the full filter function.Limiting the temporal filters to sinc functions improves interpretability, since each filter is defined by a specific frequency band, while still allowing for learnable filters.However, an important difference between audio and EEG signals is the significantly lower signal-to-noise ratio in the latter, which is expected to make training of the bandpass filters more difficult.Looking at sinc bandpass filters in the frequency domain, a weakness with regard to its use in deep learning is apparent: the ideal sinc bandpass filter has a rectangular frequency response, which has a flat pass-band and essentially no transition band and therefore a zero derivative at essentially every point, meaning it cannot give useful gradients when training a deep learning model (see Fig. 2).The convolutional sinc filter still works, since it has to be truncated in the time domain, which introduces a transition band in the frequency domain and therefore some useful gradients, but this leaves room for improvement.Applying a similar constraint on the first-level filters, recent work on inductive biases for the EEG domain has been done with SyncNet, a simple deep learning model targeting EEG synchrony with claims towards interpretability [17].The model is based on a trainable convolutional layer parameterized as the real part of a complex Morlet wavelet.The time-domain filter is defined as where b is the amplitude of the wavelet, ω is the center frequency, φ is the phase-shift and β controls the frequencytime precision tradeoff.An approximation of cross-spectral density is then achieved by a linear combination of the filtered input channels.This linear combination is identical in its nature to the spatial projection used in EEGNet and is thereby limited in its approximation of synchrony.Furthermore, it insufficiently constrains the feature space towards an approximation of synchrony.Convolutional neural networks are commonly thought to work with spectral features.Recently there has been growing interest in adapting deep learning architectures to work on symmetric positive definite (SPD) matrices, such as spatial covariance matrices of EEG signals.SPDNet was developed as a simple deep learning model to be applied on SPD matrices [26] and has been used for EEG decoding [27].The focus with this model is however on the classification of SPD matrices, not on creating the specific SPD matrix from input signals in the first place.In the case of EEG decoding, usually a covariance matrix derived from wide-band signals is used.The interpretation of trained features is difficult.
Considering the limitations of the aforementioned models, a well trainable and directly interpretable model should have the following characteristics:

•
The types of temporal filters should be explicitly defined and have good gradient support on all relevant frequencies

•
The feature extraction should be a close approximation or direct computation, and well constrained to the specific feature

•
The training procedure should enable direct interpretation of classifier weights

PROPOSED METHOD
The proposed model consists of trainable temporal filters with specific constraints, followed by a pre-defined differentiable feature module (band magnitude, functional connectivity) and leading to a linear classifier (see Fig. 1).Each of these components will be presented in the following subsections, accompanied by our strategy for feature interpretation and the training procedure.

Temporal Filters
As mentioned in section 2, a prominent example of using parameterized filter functions in a deep learning model are sinc bandpass filters [16].Sinc filters however have a very narrow transition band, which means the derivative of the filter function with regard to most frequencies is zero (see Fig. 2).The filter will therefore get limited gradients during training.In order to improve trainability, we propose generalized Gaussian filters in the frequency domain, which have a wide transition band for good gradient support.
Filtering in the frequency domain allows for full control of the filter function [28] and has previously been used in a deep learning model [29], confirming that it does not impede differentiability.The generalised Gaussian is given by where Γ(n) = (n−1)! is the gamma function, µ is the center frequency, α is the scale and β is the shape parameter.We fix the gain of all filters at one and accordingly drop the normalization factor of the generalised Gaussian.The bandwidth of a filter is conventionally defined as the point where the filter function reaches half height.Generalizing Cohen's reparameterisation of the normal distribution to replace the standard deviation with a full-width halfmaximum (FWHM) parameter [30], we reparameterise the scale α with the FWHM parameter h.To derive the new form, we set the function value equal to 0.5 and solve for the solutions of x (i.e.points/frequencies at which the function is at half height), depending on the center frequency µ and function shape β: resulting in the two points x 1 and x 2 The FWHM can then be defined as the distance between the two points at which the function reaches half height as This allows us to use the filter bandwidth to reparameterize the scale (α) of the generalised Gaussian, which is more intuitive for signal processing applications.The filter F is then finally defined as At β = 2 the generalised Gaussian equals the normal distribution, which corresponds to a Morlet wavelet in the time domain [30].For higher values of the shape parameter, the filter moves towards a rectangular function, similar to a sinc filter in the frequency domain (see Fig. 2).In fact, the sinc filter is a special case of the generalized Gaussian in the limit of β = ∞.This can be shown by inserting the FWHM reparameterization based on the bandwidth h (equation 6) for the scale parameter α into the generalised Gaussian function centered around µ as It is then straightforward to analyze the function in the limit of β → ∞: This precisely coincides with the frequency domain representation of an ideal sinc bandpass filter given by two heavy-side step functions centered around µ with bandwidth h.We clamp the shape parameter at a minimum of β = 2, since the generalised Gaussian moves towards the Laplace distribution for lower values, which is expected to interfere with training due to its non-continuous derivative.
The cost of a wide transition band is reduced frequency selectivity, which makes the function a less ideal filter.This potential problem can be addressed by making the shape parameter of the generalised Gaussian trainable, which can then move the filter from the Gaussian towards a rectangular shape over the course of training.
Designing the filter in the frequency domain also allows for full control over the phase response.For simplicity we use linear-phase filters here with a group delay of 20 ms, but a zero-phase or any linear or non-linear phase response could be used depending on the feature in question.The group delay of a linear-phase setup could also be added as a trainable parameter to allow for phase alignment between signals (not reported here).
We initialize all filters at 23 Hz with a wide bandwidth (44 Hz) and shape β = 2, forming a normal distribution.The filter then has a non-zero derivative for all relevant frequencies.During training, filters narrow down to specific frequencies of interest.We accelerate the training of the shape parameter by rescaling it to For numerical stability in the generalised Gaussian function, we clamp the shape parameter at a maximum of β = 3 (corresponding to β rescale = 10).
The filters are trained independently for each electrode, except when using phase-locking value (PLV) connectivity, where each filter is shared across all electrodes to limit the model to within-frequency band connectivity as required by the definition of the metric.Whenever multiple filters per electrode are used (and, hence, multiple feature maps are derived from the same multichannel signal, as shown in Fig. 1), the feature module handles each feature map separately and in our correlation model inter-electrode connections are considered solely within individual maps.This results in multiple independent feature maps of connectivity, which are then concatenated into a single feature vector for classification.

Differentiable Feature Module
The proposed model is compatible with a variety of features, as long as they are defined for specific frequency bands and can be implemented in a differentiable way.For this study, we focus on band magnitude and signal correlations, and also perform initial tests on a phase synchrony descriptor, namely PLV.
The band-limited magnitude measure employed in this study is computed as the magnitude of the bandpass filtered signal, given by as the sum of magnitudes over frequency bins ω ∈ Ω, given the filtered signal x in frequency domain.For simplicity, we average the magnitude over all frequency bins instead of just the bins within the limits of the bandpass filter.
Using the magnitude instead of the power was observed to be more successful, which is likely due to the nonlinearity of squaring, which also leads to skewed feature standardizations.
To assess functional connectivity between the filtered signals, a connectivity measure has to be chosen and implemented to be a well-behaved differentiable function in order to allow gradients to backpropagate through the measure into the trainable bandpass filters.A particularly suitable candidate are signal correlations due to their simplicity and relative success [31], computed as the Pearson correlation between two bandpass filtered signals.Signal correlations are defined both as coupling within a frequency band as well as across different frequency bands, which makes them a natural fit to the independently trainable bandpass filters proposed here.The correlation function itself is well differentiable, as can be seen from the fact that it is even sometimes used as the loss function [32].The correlation measure between two signals X and Y is given by over time t ∈ T , where x and ȳ are the signal means.This can be efficiently implemented using an inner product matrix multiplication for the numerator.Signal correlations are defined in the range [-1, 1], with larger absolute values indicating stronger coupling.Since a phase shift between two signals can lead to negative correlations, we take the absolute value (|r XY |).
As an alternative estimate of functional connectivity, with a particular focus on the phase relationships between signals, we test phase locking values (PLV) [9].PLV is determined by the difference of the instantaneous phases ∆φ of two signals as Two signals with a strong coupling exhibit stable phase differences, leading to a high PLV, defined in the range [0, 1].We build on the computationally efficient implementation proposed in [33] (see also [34]), while explicitly handling calculations with complex numbers as required by the PyTorch version used in this study.To decompose phase locking values (PLV) into efficient inner matrix products, we make use of the identity between the polar and Cartesian representation of the complex analytical signal, given by where H(x) is the Hilbert transform and A = x 2 + iH(x) 2 is the envelope of the analytical signal.We obtain the oscillatory component e iφ by normalizing the analytical signal by the amplitude A as and then define the real component u and the imaginary component iv of e iφ as The differences in instantaneous phase ∆φ used in the PLV metric can then be expressed making use of the complex representation u + iv as The resulting equation can be implemented based on four inner matrix products, handling the real and imaginary components separately.At this point, it is important to notice that functional brain connectivity features, derived from EEG recordings in sensor space, are generally expected to be contaminated by volume conduction effects [35], which make signals recorded at nearby sensors to appear correlated even in the absence of actual neural connectivity [36].Spurious connectivity, due to volume conduction, is expected to have only minimal impact on the features extracted via EEGminer for the following reasons.First, by training the classifiers to discriminate between two recording conditions (or groups), we extract only discriminative connectivity features and hence disregard spurious connectivity due to volume conduction, which is expected to behave like a constant/common factor [37].Aligned with this consideration, in our connectivity plots we present the mined edges weighted by classifier weights and not weighted by the absolute measure of connection strength.Second, by encompassing cross-frequency correlations in our correlation models, the influence of volume conduction is further reduced [37].Particularly for the results included in section 4, it is the spatial resolution of electrodes array in the case of MyBrainTunes dataset used for music vs rest and female vs male classification (see section 4.1) and the cross-frequency character of the detected dependencies in SEED datasets that make dispensable the further signal manipulation for overcoming volume conduction.Additional steps that could be considered for mitigating the effects of volume conduction are appropriate preprocessing (e.g. by means of Laplacian transform [38] or beamformers [39]) or modifying the differentiable feature module so as to overlook zero-lag correlations (e.g. by adopting the imaginary PLV [33] version of eq.17).

Classifier and Feature Interpretation
As a classifier we use a linear layer with sigmoid activation, essentially performing logistic regression.This has the advantage that the trained classifier weights can directly be used as importance weights attributed to the features.For non-linear classifiers the interpretation strategy would have to be adapted, e.g. by using prediction gradients with respect to the features [40].Since different features can have different distributions, they need to be standardized in order to allow for the usage of the regression coefficients directly as an importance measure.In a machine learning setting with mini-batch gradient descent, standardization is performed via batch normalization [41], which keeps a running mean and variance over mini-batches to approximate dataset statistics.In our case we use non-affine batch normalization, meaning the target mean and standard deviation are not trainable.We use a batch size of 256, since a larger batch size leads to a better estimation of the dataset statistics.To further improve model convergence and feature standardization, we perform cosine decay to zero on the learning rate [42].This ensures that the magnitude of model updates is reduced over time, allowing the batch normalization to capture final statistics of the discovered features.
For the band magnitude model we can investigate the full search space by plotting the relative difference in frequency magnitudes between the two classes.This allows us to check whether the discovered features are actually represented in the dataset.This is not possible for the connectivity model due to the vastly larger search space.We therefore save class-wise feature activations over the dataset and perform statistical t-tests to see whether the difference of feature means between the classes is statistically different from zero.

Training Procedure
The training of EEGminer models is performed via minibatch stochastic gradient descent (SGD) with Nesterov momentum [43].We found that heavy momentum on the bandpass filters helps training and improves performance.The momentum on the filters is set to 0.99, while it is set to the default value of 0.9 on the classifier.The increased momentum on the generalized Gaussian filters was beneficial for their convergence during training.The learning rate is initialized at 2e-3 and decayed to zero according to a cosine schedule.The learning rate decay is necessary for the batch normalization to get final estimates of statistical feature distributions and thereby provide standardized classification features.We train the EEGminer models for 5000 epochs on music vs rest, 300 on female vs male classification and 1200 epochs on positive vs negative emotion (these roughly correspond to the same number of total update steps).For the loss L we use the mean squared error (MSE) function between the targets t and model predictions x for simplicity.To sparsify discovered features, we use lasso regularization, which adds the L1 norm of the classifier weights w to the loss with scaling factor γ, as defined by The L1 penalty is tuned manually per model and classification task (EEGminer magnitude γ=2e-3 on all tasks, EEGminer signal correlations on music vs rest and female vs male classification γ=3e-3, on positive vs negative emotion γ=3e-2).The L1 penalty is mostly tuned for the desired sparsity of extracted features, while having only minor effects on classification performance.Note that no parameter regularization is performed on the bandpass filters.

EXPERIMENTAL EVALUATION
The usefulness of discovered features can be evaluated by the classification performance of EEGminer models on specific tasks.We benchmark the performance of the proposed models against conventional machine learning models and deep learning models that are commonly used in the EEG domain.This serves to illustrate whether the discovered features have similar discriminative power as conventional approaches with manual feature engineering as well as using modern black box classifiers.The aim is not to directly improve on decoding performance, but to provide a research tool for discovering features of brain activity that have discriminative power close to current state-of-the-art decoding models.

Datasets
The dataset introduced here was formed from the data recorded during a very recent large-scale experiment, organized by two of the authors of this study and collected in cooperation with the Science Museum London 1 .The experiment was approved by the Science, Engineering and Technology Research Ethics Committee (SETREC) at Imperial College London.It consists of brain activity recorded from 761 volunteer subjects (400 female, 361 male) aged 12 and above (mean=28.7,std=11.0).The experiment followed a passive music-listening design, similar to [7] [44].A 30 second EEG baseline was recorded for each subject, after which 30 second segments of 30 songs were played in random order while recording EEG activity, for a total of 30x30=900 seconds of music listening EEG per subject.During EEG recording, participants were asked to remain still to reduce recording artifacts, to keep their eyes open and to focus on a dot in the middle of a screen to avoid introducing noise from the muscular activity of eye movements.The recordings were conducted with Emotiv EPOC+ wet electrode headsets with 14 channels at 256 Hz (AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8, AF4 in accordance with the 10-20 system).The song playlist was pre-defined and comprised of songs from the top of the UK charts of the last decades including songs from various genres (e.g.pop, rock, ballads, hip-hop, etc.).The female vs male classification task uses all recordings of the MyBrainTunes dataset.For the music vs rest task a single song recording is randomly chosen per subject to provide an equal number of song and resting state trials.
The SEED dataset contains EEG recordings from 15 participants while watching emotional film clips [20] [21].Each subject participated in 3 recording sessions, viewing the same 15 clips with negative, neutral or positive emotional content.The EEG data provided include recordings from 62 electrodes at 200 Hz sampling rate.The power line noise at 50 Hz has been removed.For the positive vs negative emotion task we discard clips with neutral emotion.We break up the long recordings into windows of 20 seconds.Windows are sampled randomly and treated independently for training samples.To account for this windowing, every recording is sampled 9 times per epoch, which increases the total number of update steps.For validation samples, 1. https://blog.sciencemuseum.org.uk/music-and-the-brain/ each recording is split into consecutive non-overlapping windows and the model predictions are averaged across these windows.
Both datasets have been preprocessed as follows: the multichannel signals were filtered (1-45 Hz for our dataset, 0.5-99.5 Hz for SEED) using a zero-phase band-pass filter (3rd order Butterworth).To remove artifacts we resorted to independent component analysis (ICA) [45].Artifact suppression was carried out separately for each trial, based on an in-house implementation of wavelet-enhanced ICA (wICA [46]).Specifically, independent components (ICs) were extracted from the multichannel signal by means of the EEGLAB Matlab toolbox 2 .Subsequently, wavelet decomposition based on wavelets of the biorthogonal family and wavelet shrinkage with a hard threshold based on false discovery rate [47] were applied to each one of the ICs.The multichannel signal was then reconstructed based on the artifact-free ICs.Hence, in this study, the use of "wICAcleaned" brain activity is implied.The EEG signals were rereferenced by common average referencing.We downsample the MyBrainTunes dataset to 128 Hz.For the deep learning baseline models, the SEED dataset was also resampled to 128 Hz.For all models, inputs are standardized by a total trial mean and standard deviation.For SPDNet, the input is bandpass filtered in the 4-40Hz range before computing the spatial covariance matrix.

Evaluation Procedure
To equalize small class imbalances during training, we perform oversampling of underrepresented classes.Crossvalidation was performed for all models, using 10 random folds with held-out subjects for the respective validation set.This means that for each fold, none of the samples of the subjects in the corresponding validation set were seen during training (also considering oversampling and the three sessions per subject on SEED), and results should be seen as subject-independent.The main performance metric is given as the unweighted average recall (UAR) at the end of training, reporting the mean and standard deviation across folds.The UAR for classes c ∈ C, |C| = N is given by

SVM and Deep Learning Baselines
As a performance baseline from the conventional neuroscience perspective, we employ a linear support vector machine (SVM) on standardized features spanning the whole search space.Regarding band magnitude, the features are the signal magnitudes of all channels filtered in canonical EEG bands 3 .In the signal correlation case, the features are all possible connections across all sensors filtered within and across all EEG bands (3,486 for our dataset, 69,006 for SEED).Performance was evaluated in the same way as for the EEGminer and deep learning models, using 10-fold subjectindependent cross-validation.To benchmark our approach against deep learning models, we train ShallowConvNet, EEGNet, SyncNet and SPDNet on all three tasks.DeepConvNet is omitted from the baselines here due to inferior performance when compared to ShallowConvNet on the studied tasks.We adjust the filter kernel sizes of ShallowConvNet and EEGNet for input signals sampled at 128Hz.All four deep learning models are trained for 100 epochs using an Adam optimizer.In the case of the SPDNet, an Adam optimizer with manifold constraint was used 4 .The optimizer learning rate is 1e-3 for ShallowConvNet and EEGNet and 2e-3 for SyncNet, as 4. https://geoopt.readthedocs.io/en/latest/optimizers.html proposed by the respective authors.The learning rate used for SPDNet is 1e-3.The loss function is binary cross-entropy between targets and predictions for all deep learning models.ShallowConvNet and EEGNet use a batch size of 16, while SyncNet uses 10 and SPDNet uses 30.The dropout rates for ShallowConvNet and EEGNet are 0.5 and 0.25 respectively.SyncNet uses dropout of input channels, which we employ with a rate of 0.25 only on the SEED dataset, as it gave inferior performance when used on the MyBrainTunes dataset (likely due to the small number of electrodes).No weight decay is used.For EEGNet we use 4 temporal filters on music vs rest and positive vs negative emotion as the number of training samples is somewhat limited, while we Connections are scaled by the mean model weights across folds, with the colors indicating the feature direction.Filter lines show the averaged filters across folds scaled by the model weights on the strongest adjacent connection, with shaded areas denoting the maximum weight.Since two freely moving filters were used per electrode, averaging of the filters over folds is performed according to the respective center frequency (i.e.average all low-center filters and all high-center filters), and the stronger of the two model weights is used for the connection.The inset shows the filter initialization.The limits of canonical EEG bands are given as dotted lines.use 8 temporal filters on female vs male classification.For SPDNet we use 3 BiMap layers, with hidden size of 14 on the MyBrainTunes tasks and 16 on SEED.

Performance Evaluation
We evaluate the performance of all tested models using cross-validation and the unweighted average recall (UAR) metric.Table 1 illustrates the cross-validation performance on all three tasks.The performance of the proposed EEGminer models is on par with a linear support vector machine (SVM) [48] trained on pre-computed signal magnitude or signal correlations using canonical EEG bands.This shows that EEGminer models can discover the relevant features to match the performance of a conventional machine learning model that operates on a large pre-computed search space, while having a focus on clear feature interpretability.
Furthermore, the classification performance of EEGminer models is on a comparable level to the deep learning baselines, performing marginally better or worse depending on the given task.The temporal resolution that convolutional neural networks offer is well-suited for stimuluslocked recordings, such as event-related potentials (ERPs) or motor imagery (MI), but might have relatively limited use on the spontaneous nature of the classification tasks studied here.This indicates that our proposed model limits trainable components to the relevant feature space, cutting unnecessary capacity and thereby reducing the potential for overfitting.
Regarding connectivity features, both signal correlations and PLV achieved similar performance, with signal correlations having a small edge.This disparity could be due to the fact that PLV is limited to within-band connectivity, while signal correlations are also defined across frequencies and therefore have a larger feature space.We therefore focus on band magnitudes and signal correlations when discussing the mined features.

Mined Features
Benefiting from the clear interpretability of EEGminer models, we can investigate the trained features in terms of relevant frequencies and functional connectivity.On the music vs rest classification task we present the full feature space as utilized by the respective models, while we focus on the most impactful features for female vs male classification during music listening and positive vs negative emotion recognition.When plotting top features on male vs female and positive vs negative emotion classification, averaging across folds is difficult.We reverted to the fold with highest validation performance.Using two filter channels can result in duplicate features, which we manually removed.
Looking at the trained magnitude EEGminer model on music vs rest classification (Fig. 3), it is apparent that the proposed generalised Gaussian filters exhibit strong trainability of the center frequency, bandwidth and shape, departing substantially from their initialized state.Many of the filters with high model weight narrow down to specific EEG bands (e.g.alpha in O2, theta in P7 and high beta in AF3).Multiple bands of interest can be found for the same sensor when using a model with multiple filter channels (e.g.P7, F8).Features are sparsified by L1 regularization on the linear classifier, forcing the model to disregard locations with minor contribution (e.g.FC5, FC6 and F3).The trained filters also exhibit good stability over folds, as seen in the small difference of mean and max filter importance, meaning the discovered features are consistent across reruns on other training folds.
Relevant band magnitude features as discovered by the EEGminer model on music vs rest classification are decreased activity of lower frequencies in frontal electrodes (AF3, AF4, F7 and F8) during music listening and increased activity in high beta and gamma (AF3, AF4 and F8).Also used by the model are increased alpha activity in occipital electrodes (O1 and O2) during music listening and decreased gamma as well as increased theta activity in P7.Decreased gamma activity in T7 and in low frequencies in T8 are also associated with music listening.
When using signal correlations as the feature module to classify music listening vs resting state (Fig. 4), the most relevant features are cross-hemispheric connections between the left temporal/parietal with the right frontal areas.Most specifically during music listening there is increased coupling between high-frequency activity in T7 and low frequencies in F8.Also relevant is decreased local connectivity of gamma brainwaves between FC5 and F7 and of high frequencies in FC6 with low frequencies in F8.It can be observed that the model identifies a hemispheric specialisation of the temporal areas, focusing on cross-hemispheric connections of the left temporal area and within-hemisphere connections of the right temporal area.This agrees with the previously proposed respective specialisation of the temporal lobes regarding music perception [49].
Regarding female vs male classification during music listening, the magnitude EEGminer model focuses on increased low beta activity in occipital areas (O1 and O2) for female subjects, as well as increased high frequencies in anterior frontal electrodes (beta in AF3 and gamma in AF4; see Fig. 5).Increased activity of higher frequencies in female subjects has previously been reported in studies of resting state EEG [50].The model also finds decreased theta and alpha activity for female subjects in P8.On the other hand, the correlation EEGminer model mostly focuses on decreased connectivity in lower frequencies for females and increased connectivity in the gamma band, particularly in cross-hemispheric symmetric connections.Specifically, reduced connectivity between alpha in P7 and low beta in O1 is associated with female subjects, as well as reduced connectivity between FC5 and FC6 and reduced beta connectivity between O1 and O2 and FC5 and F7.Increased connectivity in female subjects is found in gamma activity between F3 and F4.
The magnitude EEGminer model uses differences in gamma activity to classify emotion (see Fig. 5).Positive emotion is associated with increased gamma activity in temporal areas (T7, T8 and FT8) compared to negative emotion, as well as decreased gamma activity in parietal electrodes (CP3 and P6).The association of emotional state and high frequency activity in the temporal areas has previously been reported on the SEED dataset [51] [52].The most important couplings detected by the correlation EEGminer model are increased within-hemisphere connectivity during positive emotion between central and frontal areas (C3 with FC5 and C4 with F8) and decreased within-hemisphere connectivity of temporal with adjacent electrodes (FT8 with T8, FT7 with T7 and TP7 with T7).

Feature Validation
To validate the features found by EEGminer models on our dataset (music vs rest, female vs male classification during music listening), we created the relative change of grand averaged magnitude profiles for the magnitude models, and violin plots with t-test statistics showing the distributions of the features used by signal correlation models.Since the search space for connectivity models is exceedingly large, we limit ourselves to the top features here.
Fig. 6 (a) illustrates the relative change in grand averaged magnitude profiles during music listening relative to resting states for all sensors and available frequencies.Most of the features mined by the model are strongly represented in the grand average (reduced gamma in P7, increased alpha in O1 and O2, decreased delta to alpha in F7 and F8 and increased high beta and gamma in F8).Some mined features are ambiguous in the grand average, specifically increased theta in P7, which is shown in the alpha band by the grand average, and increased high beta in AF4, which is shown in low beta by the grand average.
The relative change in grand averaged magnitude profiles for female relative to male subjects during music listening is given in Fig. 6 (b).All five of the top features discovered by the EEGminer model are strongly represented by the differences in grand averages (increased low beta in O1 and O2 for females, increased beta in AF3 and gamma in AF4, reduced theta and alpha in P8).
Violin plots showing the distributions of music listening and resting state EEG across all subjects (rest and one song each) of top features discovered by the EEGminer signal correlation model are given in Fig. 7.All feature directions agree between the EEGminer model and the feature distributions in the dataset, although some of the features show low statistical significance in the difference of means.This can be attributed to the limited number of training examples using only one randomly chosen song per subject to equalize class balance, as well as the somewhat limited classification performance of the model (0.68±0.02).
Fig. 8 shows the distributions of top features discovered by the EEGminer signal correlation model for male and female subjects during music listening.All feature directions agree between the model and the feature distributions and all of the top seven features show strong statistical mean differences in the t-tests.

CONCLUSION
We presented trainable filters parameterized by the generalised Gaussian function, which include Morlet wavelets and sinc filters as special cases and allow the model to select the appropriate filter during end-to-end training.Using differentiable implementations of neuroscientifically plausible EEG features like band magnitude and signal correlations, our proposed model can discover classification-relevant frequency bands and functional connectivity patterns among a  large repertoire of possible features, matching the decoding accuracy of established deep learning models.The trained classifier further offers clear insights into feature importance and interpretations.In future work, going beyond the static nature of features used here, a feature mining mechanism with higher time resolution should be developed for eventrelated tasks.
The current application of deep learning models on SPD matrices, like SPDNet, in the context of EEG is limited by the choice of input matrix.In the context of this paper, a wide-band covariance matrix was used as a reasonable choice.This approach could benefit significantly from the trainable filters and connectivity estimation developed for EEGminer, which in the case of signal correlations produces SPD matrices.This exploration is left for future work.
Implementing an EEG pipeline as a differentiable deep learning model could bridge the gap between conventional research in neuroscience and difficult to interpret black-box methods commonly employed in the deep learning field.Such a model promises comparable performance with deep learning models as well as neuroscientific insights.This has important implications in practical applications as well, where model interpretability is often a strong requirement.

Fig. 2 .
Fig. 2. Frequency domain representations of a) ideal sinc bandpass filter, b) sinc bandpass filter truncated in time domain to kernel size=129 with Hamming window, c) generalised Gaussian filter with shape parameter β=2, corresponding to a Morlet wavelet, and d) generalised Gaussian with β=10.Shaded areas denote regions of the filter functions that provide gradients, i.e. non-zero derivatives.

Fig. 4 .
Fig. 4. Trained correlation EEGminer model on music listening vs resting state classification, indicating important features of brain activity.Connections are scaled by the mean model weights across folds, with the colors indicating the feature direction.Filter lines show the averaged filters across folds scaled by the model weights on the strongest adjacent connection, with shaded areas denoting the maximum weight.Since two freely moving filters were used per electrode, averaging of the filters over folds is performed according to the respective center frequency (i.e.average all low-center filters and all high-center filters), and the stronger of the two model weights is used for the connection.The inset shows the filter initialization.The limits of canonical EEG bands are given as dotted lines.

Fig. 5 .
Fig. 5.The five most important features discovered by the EEGminer models.The height of filter lines and scale of connections respectively indicate the model weight for each feature, with colors indicating the feature direction.Panel a) shows the magnitude EEGminer features on female vs male classification during music listening and b) shows the respective correlation EEGminer features.Panel c) shows the magnitude EEGminer features on positive vs negative emotion classification and d) shows the respective correlation EEGminer features.The insets show the respective filter initializations.The limits of canonical EEG bands are given as dotted lines.

Fig. 6 .
Fig. 6.Relative change in grand averaged magnitude profiles for a) music-listening relative to resting state and b) female relative to male subjects during music listening.

Fig. 7 .
Fig. 7. Violin plots of unstandardised top features discovered by the correlation EEGminer model trained on music vs rest classification.Dashed lines indicate the quartiles of each feature distribution on all samples, including train and test set.The associated statistical difference between classes is given by an independent two-samples t-test above each distribution.

Fig. 8 .
Fig. 8. Violin plots of unstandardised top features discovered by the correlation EEGminer model trained on female vs male classification during music listening.Dashed lines indicate the quartiles of each feature distribution on all samples, including train and test set.The associated statistical difference between classes is given by an independent two-samples t-test above each distribution.