M3CV: A multi-subject, multi-session, and multi-task database for EEG-based biometrics challenge

EEG signals exhibit commonality and variability across subjects, sessions, and tasks. But most existing EEG studies focus on mean group effects (commonality) by averaging signals over trials and subjects. The substantial intra- and inter-subject variability of EEG have often been overlooked. The recently significant technological advances in machine learning, especially deep learning, have brought technological innovations to EEG signal application in many aspects, but there are still great challenges in cross-session, cross-task, and cross-subject EEG decoding. In this work, an EEG-based biometric competition based on a large-scale M3CV (A Multi-subject, Multi-session, and Multi-task Database for investigation of EEG Commonality and Variability) database was launched to better characterize and harness the intra- and inter-subject variability and promote the development of machine learning algorithm in this field. In the M3CV database, EEG signals were recorded from 106 subjects, of which 95 subjects repeated two sessions of the experiments on different days. The whole experiment consisted of 6 paradigms, including resting-state, transient-state sensory, steady-state sensory, cognitive oddball, motor execution, and steady-state sensory with selective attention with 14 types of EEG signals, 120000 epochs. Two learning tasks (identification and verification), performance metrics, and baseline methods were introduced in the competition. In general, the proposed M3CV dataset and the EEG-based biometric competition aim to provide the opportunity to develop advanced machine learning algorithms for achieving an in-depth understanding of the commonality and variability of EEG signals across subjects, sessions, and tasks.


Introduction
Since the discovery of electroencephalography (EEG) by Hans Berger in 1924( Berger, 1929, EEG has evolved for use in a wide range of applications ( Alotaiby et al., 2014 ;Cahn and Polich, 2006 ;Dietrich and Kanso, 2010 ;Wolpaw et al., 1991 ) for almost one hundred years. Typically, event-related potentials (ERPs) study focuses on the significant common or mean effects of a cohort, in which the intra-and inter-subject variability are treated as noise and are filtered out by averaging over trials and subjects ( Seghier and Price, 2018 ). Hence ERPs have been widely used for investigating the neurological functions of sensory, motor, and cognitive processes ( Kappenman et al., 2021 ). However, the conventional practice of ERP analysis takes the mean value of the EEG signals across trials and/or subjects to achieve a higher signal-to-noise ratio, but the group-level commonality may not have a reliable effect on the individual level ( Boshra et al., 2019 ;Fröhner et al., 2019 ;Hu et al., 2022 ;Infantolino et al., 2018 ).
The intra-and inter-subject variability pose a great challenge to the individual level explanation and decoding of the EEG signal. Multiple factors may contribute to intra-and inter-subject variability, in- Table 1 Open access EEG databases with characteristics of multi-session, multi-subject, or multi-task. Passive: Rest, P-SS, NS, Active: CCD, SL, A-SS Information processing in the developing brain Gaspar et al. (2011) 128 ActiveTwo/ Biosemi 10 5 ERPs (faces and noises) ERP reliability Kumar et al. (2021) 128 to another subject ( Krusienski et al., 2011 ;Satti et al., 2010 ). Although there are meaningful studies concerning cross-session and cross-subject transfer learning for BCI, the theory of common feature space construction in BCI applications is still being studied ( Autthasan et al., 2022 ;Lotte and Guan, 2011 ;Rodrigues et al., 2019 ). Similarly, severe overfitting was observed in the application of EEG-based biometrics in the within-session recognition.
In summary, the traditional ERP analysis and statistical test methods help us identify the group-level commonality of EEG, while the recently developed machine learning techniques provided us with a way to further explore the individual-level variability of EEG. However, the lack of generalizability across subjects, sessions, and tasks is still a major challenge for machine learning in neuroimaging. A large-scale multisubject, multi-session, multi-task EEG database is highly deserved to support this branch of research. Hence, in this study, we established an M 3 CV (Multi-subject Multi-session Multi-task Commonality and Variability) EEG database to support cross-subject, cross-session, and crosstask EEG studies.
The M 3 CV database contains 14 types of EEG tasks in 6 experiment paradigms from 106 healthy young adults, in which 95 subjects completed two experimental sessions repeated on different days. To record the EEG data for as many tasks as possible within the limited recording time and to ensure the data quality of each task, we designed the experiments based on suggestions from five experts in neuroscience and psychology (see Acknowledgments).
The experimental paradigms in each session included the following six paradigms with 14 tasks of EEG signals (bolded text).
• Paradigm 1: Resting-state with eye closed and eye open ( EC and EO , each lasting for 2 min). • Paradigm 2: Transient-state sensory with visual, auditory, and somatosensory stimulation ( VEP, AEP , and SEP , each having 60 trials) • Paradigm 3: Steady-state sensory with steady-state visual, auditory, and somatosensory stimulation ( SSVEP 1 min, SSAEP 2 min, and SSSEP 2 min) • Paradigm 4: P300 with oddball experiment ( target P300 with 30 trials, and nontarget P300 with 570 trials) • Paradigm 5: Motor execution with the movement of right foot, right hand, and left hand ( FT, RH , and LH , each having 80 trials) • Paradigm 6: SSVEP with selective attention ( SSVEP-SA, six classes, each class having 12 trials, each trial lasting for 10 s) When retrieving existing open-access EEG databases ( Table 1 ), we found few databases with multi-task and multi-session EEG signals. The number of subjects in most databases was less than 50. With some specific research objectives, the number of tasks was typically only one or a small number. Multi-session EEG recordings are even harder to collect because it is more difficult to require all subjects to repeat the experiment after a certain period. We also found that existing databases with some attributes of multi-task or multi-session were mainly used for reliability analysis, BCI, and EEG-based personal identification. Reliability analysis ( Gaspar et al., 2011 ) focused on the cross-session reproducibility of EEG signals, which plays a fundamental role in EEG research. BCI studies ( Brunner et al., 2008 ;Goldberger et al., 2000 ;Jeong et al., 2020 ;Korczowski et al., 2019 ;Kumar et al., 2021 ;Lee et al., 2019 ) often require the development of cross-session and cross-subject transfer learning algorithms. EEG-based personal identification techniques ( Arnau-Gonzalez et al., 2021 ;Kumar et al., 2021 ) use EEG features to identify certain persons among a large number of samples, which should be robust across tasks and sessions. In addition, SEED IV ( Zheng et al., 2019 ) has four sessions of EEG data for EEG-based emotion studies, Langer et al. (2017) presented a dataset combining electrophysiology and eye-tracking intended as a resource for the investigation of information processing in the developing brain, and Kappenman et al ( Kappenman et al., 2021 ) developed ERP Core with six tasks for ERP teaching studies.
Compared with the existing biometrics, such as iris, fingerprint, and face, the security of EEG signal could potentially provide a more secure biometric method as it is confidential, hard to mimic, and almost impossible to steal ( Chan et al., 2018 ;Bidgoly et al. 2022 ;Gui et al., 2015 ). The research on EEG-based biometrics has received increased attention in recent years ( Wang et al., 2020 ;Chan et al., 2018 ;Debie et al., 2021 ;Jin et al., 2021 ;Marcel and Millan, 2007 ). However, there are many challenges such as temporal permanence and robustness to mental state changes in an EEG-based biometric system ( Cahn and Polich, 2006 ;La Rocca et al., 2013 ;Maiorana, 2021a ;Maiorana and Campisi, 2018 ). At the same time, there is no open EEG-based biometric competition. The lack of a unified test benchmark and platform hinders the develop- ment of this field. To this end, we decided to open the M 3 CV dataset to launch an EEG-based biometric competition.
The remainder of the rest of this paper is organized as follows. Session II introduces the M 3 CV dataset. The details of the EEG-based biometrics competition are presented in Section III. Results are provided in Section IV. The discussion and conclusion are given in Section V.

Subjects
A total of 106 healthy subjects from Shenzhen University participated in this experiment. Of these, 95 subjects (Age: 21.3 ± 2.2 years; Gender: 73 males) participated in two sessions of the experiment, which were scheduled on different days, the between-session time is in the range of 6 days to 139 days, with a mean of 20 days. All subjects had normal hearing, normal or corrected-to-normal vision, and no history of neurological injury or disease (as indicated by a self-report). During the experiment, the subjects were seated in comfortable chairs and kept about one meter from the screen.
Ethical approval of the study was obtained from the Medical Ethics Committee, Health Science Center, Shenzhen University (No. 2019053). All subjects were informed of the experimental procedure, and they signed informed consent documents before the experiment, in which they agreed to make their data open to access for research aim on the premise of concealing their personal information. Fig. 1 shows that the whole experiment was arranged in two sessions on separate days. The entire experiment with 6 paradigms, 15 runs, and 14 tasks was completed within 2 h (around 50 min of recording time and 70 min for experiment preparation and rest between the consecutive runs).

Experimental paradigm
The description of the 6 experimental paradigms is given below: ( (2) Transient-state sensory : EEG elicited by visual, auditory, and somatosensory stimuli were recorded in Runs 04 and 11. For each run, 30 trials of VEP, AEP, and SEP were arranged in random order. Each stimulus lasted 50 milliseconds, and the interstimulus interval (ISI) was set at 2-4 s. On average, each run lasted 4.5 min.
(3) Steady-state sensory : A train of visual, auditory, and somatosensory stimuli were released in Runs 05, 09, and 12 respectively. Considering the different frequency responses of different modality stimuli, and the higher signal-to-noise ratio of SSVEP compared to SSAEP and SSSEP, the stimulation frequencies and recording times were different. They were 10 Hz ( Herrmann, 2001 ) and 1 min for SSVEP in Run 05, 45.38 Hz ( Galambos et al., 1981 ) and 2 min for SSAEP in Run 09, and 22.04 Hz ( Snyder, 1992 ) and 2 min for SSAEP in Run 12. (4) Visual Oddball : A visual oddball experiment was arranged in Run 07 with the red square as the target stimuli and a white square as the nontarget stimuli on the screen. Each square lasted 80 ms with the ISI 200 ms. Six hundred trials of stimuli were arranged in 2 min, in which target stimuli had a possibility of 5% (30 trials of target stimulation and 570 trials of nontarget stimulation). The subjects were asked to count the number of red squares and report after the run so that their attention would remain on the screen. (5) Motor Execution : Executed movement was performed in Runs 03, 06, 10, and 13. During these runs, the subjects were instructed to respond to a visual cue by gripping their left hand (LH) or right hand (RH), or by lifting their right ankle (FT) for a duration of 3 s, i.e., until the cue offset. No feedback was provided during the online recording. To ensure their motor areas were being fully activated, the subjects were required to perform the real executed movements of FT, RH, and LH at a rate of twice per second or faster, at approximately 80% of their maximum vol- There was no external tool, like a metronome or hint on the screen, to remind the subject, because it may have produced unnecessary evoked potential as an external stimulus. During the experiment, the experiment monitored the movements of the subjects. If the subjects did not move fast enough or were found that their body moved simultaneously, the experimenter would abandon the recording of the current run, correct the subject's movement and let them practice several more trials. The experimenters continuously monitored whether the movements of the subjects met these standards and corrected them when necessary. To minimize the intra-and inter-subject variability, we tried to ensure the consistency of the movements for all subjects. Hence, the executed movement was used instead of motor imagery in classical BCI experiments, and no feedback or training was given to the subjects. (6) SSVEP with selective attention : Six white squares flashed simultaneously at different frequencies in Run 08. First, the target square turned red for 200 ms. After 500 ms, all squares with the frequency of 7 Hz, 8 Hz, 9 Hz, 11 Hz, 13 Hz, and 15 Hz began to flash and lasted 10 s. The subject was asked to focus on the target square during the 10 s by covert visual attention with a fixation on the middle of the screen. In total, there were 12 trials arranged in Run 08. For each session of the experiment, they were instructed to perform 6 types of experimental paradigms, which were arranged in 15 runs, in which 14 tasks with rest, sensory, cognitive, and motor-related EEG signals were listed in Table 2 .

Experimental platform
The continuous EEG signals were recorded using an EEG amplifier (BrainAmp, Brain Products GmbH, Germany) and multichannel EEG caps (64 Channel, Easycap). The signals were recorded at a sampling rate of 1000 Hz by 64 electrodes, placed in the standard 10-20 positions. The electrodes FCz and AFz served as reference and ground, respectively. Before data acquisition, the contact impedance between the EEG electrodes and the cortex was calibrated to be lower than 20 k Ω to ensure the quality of EEG signals during the experiments.
An Arduino Uno platform was programmed to release the visual, auditory, and somatosensory stimuli in tasks 2 and 3, which communicated with the Matlab program (The MathWorks Inc., Natick, USA) on a PC through a serial port.
• Visual stimuli for VEP and SSVEP were delivered by a 3 W lightemitting diode (LED) with a 2 cm diameter circular light shield, which was placed in the center of the visual field of view 45 cm away from the subjects' eyes. The mean LED intensity was 1074 Lux as measured by a light meter (TES-1332A, TES). • Auditory stimuli for AEP and SSAEP were presented via a Nokia WH-102 headphone. 1000 Hz pure tone was applied for the stimuli.
The intensity was set at a comfortable level (75 dB SPL on average) for all subjects as measured by a digital sound level meter (Victor 824, Double King Industrial Holdings Co., Ltd. Shenzhen, China). • Somatosensory stimuli for SEP and SSSEP were generated by a 1027 disk vibration motor. Since there was no effective tool to measure the output intensity of the vibrator directly, we have to report the detail of the product parameters with the rated power of 3 W, the efficiency of 80%, and dimensions of 10mm * 2.7mm).
For the other tasks, A 24.5-inch screen (1920 * 1080) with a 240-Hz refreshing rate (Alienware AW2518H, Miami, USA) was used to present the visual stimuli or cues by programmed using Psychtoolbox-3 ( http://psychtoolbox.org/ ) in Matlab. The red and white squares were delivered in a sequence at the center of the screen with a black background for visual oddball. Three white squares were also used in the motor execution paradigm to indicate the movement. Six white squares flashed simultaneously at different frequencies to deliver the stimulus in SSVEP with selective attention paradigm. The size of the square was set to be 300 * 300 pixels in these paradigms.

EEG signal pre-processing
The data pre-processing pipeline on the M 3 CV database is illustrated in Table 3 . The raw EEG signals were recorded with BrainVision core data format (each recording consisting of a .vhdr, .vmrk, .eeg file triplet), which were managed with Brain Imaging Data Structure (BIDS) ( Gorgolewski et al., 2016 ;Pernet et al., 2019 ). For each recording, the bad channels were interpolated first. Channel FCz (the reference) was added back, and channel IO was removed. Then all signals were filtered by a 0.01-200-Hz band-pass filter and a 50-Hz notch filter. A 2-order Butterworth zero-phase filter was applied in the above two steps of filtering. After re-referencing to TP9/TP10, artifacts produced by eye blinks or eye movements were identified and removed manually by Independent Component Analysis (ICA) ( Huang et al., 2020 ).
During the signal preprocessing, we interpolate bad channels for 22 subjects and one subject was removed due to strong 10 Hz artifacts among the 95 subjects who complete the two sessions. For Motor execution, preventing the interference of EMG artifacts caused by the actual movement was crucial during our data collection, especially in the movement paradigm. Firstly, we asked the subject to keep their torso still in both foot and hand movements. That is also the reason why we set right foot movement instead of two-foot movement in the experiment design. Secondly, the experimenters continuously monitored the subjects' movement and online EEG display to ensure the quality of data collection. Thirdly, since the amplitude of EMG single is much higher than the EEG signal, the contamination of EMG artifacts is highly visible. Also, we did not see strong EMG artifacts during the signal preprocessing. Furthermore, no strong EMG artifact is observed in the timefrequency analysis. The main EEG responses came from C3, C4, and Cz  at and band. The band response was much smaller, which was the main frequency band for EMG artifacts. It should be mentioned that we did not perform bad epoch rejection in pre-processing. The machine learning task is different from conventional ERP analysis. The robustness to the outlier is also an important feature of the machine learning algorithm. Besides, the criteria for rejecting epochs were different across research groups, thus affecting the repeatability of the algorithm.

Basic EEG signal visualization
To show the basic characteristic of EEG signals in M 3 CV, we perform time domain analysis on ERP signals, Frequency domain analysis on resting-state EEG and steady-state evoked potentials, and timefrequency domain analysis on motor-related signals.

Time domain analysis
The ERPs of VEP, SEP, AEP, and target and non-target P300 were analyzed in the time domain. A Butterworth bandpass filter with 0.5 -30 Hz was applied to the preprocessed EEG signal. It should be noted that the 0.5-30Hz bandpass filter was used only for ERP data visualization, and the data set provided in the biometrics competition was filtered with a 0.01 -200Hz bandpass filter. After segmentation and averaging, baseline correction was performed from − 0.5 to 0 s to obtain the ERP for each subject.
For time-domain analysis. The grand-averaged waveforms of VEP, AEP, SEP, and target-and nontarget-P300 are shown in Fig. 2 (A-D), with the topographies at their corresponding peaks. The response of VEP mainly concentrated on the occipital area. AEP and SEP shared a similar N2/P2 waveform in the central area. A classical N75-P100-N135 complex is not seen here, which is typically observed from pattern-reversal VEP, for which high-contrast, black-and-white checkerboards are used as stimuli. As the vibrator was placed on the subject's left hand, the N2 response of SEP showed a response from the contralateral primary sensory area. The topographies for the P2 components from AEP and SEP are quite similar. In the oddball experiment, the target P300 response was mainly concentrated on the channel POz at around 300 ms. By comparison, the nontarget response mainly came from the occipital area because the visual stimuli were used in the oddball experiment.

Frequency domain analysis
For the task of resting-state with eye closed and eye open, the EEG signals were firstly re-referenced to a common average reference after preprocessing. Welch's method with a 2-s window and 50% overlap was applied. After transforming the total power level into decibels and averaging, the frequency domain responses for resting-state EEG with eye closed and eye open were obtained for each subject. The processing pipeline for EEG signals from steady-state sensory and SSVEP with selective attention tasks was similar. But no re-reference was performed, and TP9/TP10 was still used as the reference. FFT was applied instead of Welch's method to obtain a sharp frequency response for SSVEP, SSAEP, SSSEP, and six classes of SSVEP-SA signals.
For frequency domain analysis of the resting-state EEG with EC and EO ( Fig. 3 A), the difference was concentrated on the occipital area. With the increase in frequency, the amplitude of the EEG response decreased at a rate of approximately 1/freq for both EC and EO. The main differ- ence between EC and EO was the three peaks that appeared in sequence around 10 Hz, 20 Hz, and 30 Hz. For the first peak in the interval of 8-12 Hz, the main response was in the occipital area, the frequency power at Cz (0.45 dB) is larger than C3 (0.17 dB) and C4 (0.21 dB) in the condition of EO, while the frequency power at Cz (-1.05 dB) would be smaller than C3 (-0.88 dB) and C4 (-0.73 dB) in the condition of EC. The topographies for the second peaks in the interval of 18-22 Hz had a similar shape, while the difference mainly came from the magnitude. For SSEP analysis in Fig. 3 B, the brain areas and frequency band of the induced SSEP were different for visual, auditory, and somatosensory stimuli, which were around 10 Hz at the occipital area for SSVEP, around 45 Hz at the frontal-central area for SSAEP, and around 22 Hz at the primary sensory area for SSSEP. Fig. 3 C shows how attention modulated the foundation frequency (black triangles) and their harmonic (gray triangles) responses at different frequency points. With selective attention, the target frequency (downward-pointing triangles) responses were larger than those at other nontarget frequencies (upward-pointing triangles).

Time-frequency domain analysis
For the motor execution tasks, since the EEG responses were complex in different time intervals and frequency bands, a continuous wavelet transform was applied for the time-frequency domain transform, in which the complex Morlet wavelet was used with the central frequency 1.5 Hz and bandwidth 1 Hz. Considering the computation complexity, the EEG signal was downsampled to 200 Hz to reduce the time for computation. After the continuous wavelet transform, the time-frequency response was further downsampled to 50 Hz to reduce the storage space on the hard disk and computer memory.
Time-frequency analysis was performed for the motor task because foot and hand movements led to rich neural oscillation changes in different time-frequency-spatial regions. Fig. 4 gave a comprehensive illustration of these changes caused by these three types of movements, which are FT, RH, and LH. Further, no strong EMG artifact is observed in the time-frequency analysis. The main EEG responses came from C3, C4, and Cz at and band. The band response was much smaller, which was the main frequency band for EMG artifacts.
The corresponding topographies from six regions of interest (ROI) are detailed below.
• ROI #1 (from 0.1 to 0.3 s, 2 to 7 Hz): The response in this region was due to Motor-related Cortical Potential (MRCP), which is a super low-frequency negative shift in EEG recording. MRCP may come earlier than the movement because the cortical processes are employed in the planning and preparation of movement. Hence, MRCP can be used for predicting the movement in BCI applications. In this experiment, subjects did not know which movement and when they were going to execute. Hence the MRCP response came out after the zero point. The brain area of its response was the same as the motor area of the corresponding action. • ROI #2 (from 1 to 3 s, 9 to 13 Hz): The ERD response from the rhythm oscillation was observed in this region. By carefully comparing the responses of RH and LH, we found that the ERD phenomenon happened in motor areas of both left and right hemispheres. But the ERD in the contralateral brain area was larger than that in the ipsilateral brain area. • ROI #3 (from 1 to 3 s, 20 to 30 Hz): The ERD response from the rhythm oscillation was also well reported for motor-based BCI application. Similar to ROI #2, the ERD phenomenon happened in the motor areas of both the left and right hemispheres for RH and LH. The interhemispheric difference was mainly located in the parietalcentral area in ROI #2 and located in the frontal-central area in ROI #3. • ROI #4 (from 1 to 3 s, 55 to 90 Hz): During the movement, high oscillation also showed an interhemispheric difference for RH and LH. But the magnitude was weaker than that of ROI #2 and ROI #3. We found that the magnitude change in ROI #4 was ERD for both RH and LH, which was ERS for FT. But the magnitude was quite small for the high-frequency EEG recording. • ROI #5 (from 4 to 4.5 s, 12 to 14 Hz): After the movement, rhythm ERS phenomenon occurred in the corresponding motor area for FT, RH, and RL. With the large sample size in our dataset, some more detailed results can be observed. For example, the rhythm ERS for FT occurred earlier at a higher frequency band (around 12.5 Hz, and 4.2 s) than ERS for RH and LH (around 9 Hz, and 5 s). • ROI #6 (from 4 to 4.5 s, 20 to 30 Hz): rhythm ERS phenomenon was seen in this region. We found that from 3.7 to 4.2 s, the rhythm ERD and rhythm ERS occurred simultaneously for RH and RL. The rhythm ERS and rhythm ERS did not occur at the same time for RH and LH, but occurred at almost the same time interval for FT.

Literature review
Several studies have reported that EEG-based identification and verification systems are capable of high recognition accuracy ( Kong et al., 2019 ;Koike-Akino et al., 2016 ). However, many of them have not been evaluated under more rigorous conditions, such as testing in a database with large population sizes, across sessions, and across tasks, which were detailed as follows: • Uniqueness: Although many studies have revealed the uniqueness of EEG signals and their ability to discriminate between subjects, the tested databases normally have a small population size of fewer than 50 subjects ( Delpozo-Banos et al., 2018 ;Arnau-Gonzalez et al., 2021 ;Chen et al., 2016 ). For employing EEG as a biometric trait, it must be tested on a larger database. • Permanence: EEG signals could be different in days even for the same task, which poses a major challenge to decoding subject identity ( La Rocca et al., 2013 ;Maiorana et al., 2016 ;Maiorana and Campisi, 2018 ). Further, in single-session settings, machine learning algorithms may recognize the special electrode situation and preprocessing parameters rather than the EEG-based personal trait. According to the three points mentioned above, we did a short review of the existing studies on multi-session and multi-task EEGbased biometrics as shown in Table 4 . Compared with our proposed database, the population size of these studies is quite small, especially for La Rocca et al. (2013) and Zeynali and Seyedarabi (2019) . As for feature extraction, hand-crafted features, such as AR, MFCC, and PSD, are still the dominant choice while only Maiorana (2020) and Kumar et al. (2021) applied deep learning methods. In terms of performance metrics, the recognition accuracy and the equal error rate were the most used metric for identification tasks and verification tasks, respectively ( Arnau-Gonzalez et al., 2021 ;Kumar et al., 2021 ). It should be noted that these studies were performed on multi-session and multitask EEG databases, but only Kumar et al. (2021) applied cross-session and cross-task evaluation to their proposed algorithm.

Learning task
Based on the M 3 CV database, the EEG-based biometric competition was launched by focusing on the problems of personal identification and verification ( Fig. 5 ). The contestant needs to train the classification model for the biometric system with the enrollment dataset. After the model has been trained, the contestant would be asked to complete the following two classification tasks: (1) Identification : Determine whether a given EEG signal comes from the enrollment set, and further determine  (2022) Table 1 . the subject ID if the epoch is not judged as from an intruder and (2) Verification: Determine whether a given EEG signal comes from a certain subject.

Feature extraction
According to the literature review, three types of commonly used features were selected for EEG-based biometrics based on the EEG signal pre-processing: (1) Power Spectral Density (PSD): To estimate the power spectrum of each channel for each epoch, we used the Welch periodogram algorithm. Specifically, we divided the whole 4s epoch into 1s segments with 0.5s overlap, then we averaged the FFT of each segment using a hamming window. The PSD spectrum from 2 to 45 Hz was evenly divided into 12 frequency bands, then the band power of these frequency bands was extracted.
where L is the number of preserved cepstral coefficient, c i is the -th cepstral coefficient, with = 1 , 2 , 3 , … , . is the number of triangular band-pass filters in a mel-scaled filter bank and is the number of the first cepstral coefficients used to obtain signal representation. In this study, M = 18 and L = 12 were employed.
(1) Autoregressive Coefficient (AR): AR features are commonly used for EEG-based identification and verification ( Maiorana and Campisi, 2018 ). In this study, AR features were extracted from each epoch from a 12-th order autoregressive model created by solving the Yule-Walker equations. Thus 12 AR coefficients were extracted for each EEG channel.

Classification
In the biometric system, the verification task is a one-to-one classification problem and the identification task is a one-to-N classification problem. With the existence of the intruders, identification becomes a one-to-(N + 1) classification problem. Considering the extensibility of the enrollment set, using N one-to-one weak classifiers from each subject to compose a one-to-N classifier is a practical way for the biometric system. The method can effectively avoid the retraining of the whole classifier after the enrolling of new subjects.
In this study, one-to-one classifiers, including the similarity-based method based on Euclidean distances (L2) and the One-class Support Vector Machine (SVM), were applied as the baseline classifiers for both verification and identification tasks in the subsequent analysis for simplicity. More specifically, after extracting features, an L2-based or SVM classification model was trained by epochs from each subject in the enrollment set.
In the verification task, the corresponding classifier would confirm the subject ID claimed by test epochs when the acquired score was greater than the rejection threshold . In an offline system, this thresh-old was varied to obtain the equal error rate ( ), which was a commonly used threshold-independent measure in biometric verification tasks ( Bidgoly et al., 2022 ), defined as the intersection point of false acceptance rate ( ) curve and false rejection rate ( ) curve. In an online system, the threshold has to be pre-determined, which was normally decided by obtaining EER on the training set ( Marcel and Millan, 2007 ;Kang et al., 2018 ).
In the identification task, the intruder test was performed firstly, in which the test epoch was fed into each subject's classification model to generate a score, when the maximal score obtained from all subjects' models was below the threshold, the test epoch was judged as an intruder. For the L2-based model, the rule of judging the test epoch as an intruder was defined by Eq. (2) where is the number of subjects in the enrolled set, is the test epoch, is the nearest epoch of from all epochs of i th subject with = 1 , 2 , 3 , … , , ( ) is the nearest epoch of from all epochs of i th subject, is the rejection threshold determined by . The subject ID would be assigned based on the maximum score from all subjects if the test epoch passed the intruder detection. For an SVM-based classifier, the likelihood is applied instead of the distance in the L2-based classifier.

Performance metrics
To evaluate the performance of the different biometrics models, Accuracy ( ), Equal Error Rate ( ), , , , 1 − , , and the final leaderboard were applied as performance metrics.
For offline evaluation, , as the recognition accuracies, is the most commonly used for the identification task. For the verification task, , as the equal error rate of False Acceptance Rate ( ) and False Rejection Rate ( ), is the most commonly used ( Maiorana, 2021b ;Bidgoly et al., 2022 ;Kang et al., 2018 ). With the growth of the rejection threshold in the classifier model, ( 1 − ) increases and ( 1 − ) decreases, is defined as the intersection point of the and curve ( Wu et al., 2018 ;Yingnan et al., 2019 ). Hence, for offline evaluation, is not only used to determine the rejection threshold during the training phase, but also used as the metric to evaluation the performance of the biometric algorithm.
For online evaluation, since is determined during the training phase, it could not be used as a performance metric for online testing. Hence, the most common and traditional performance metrics, , , , , 1 − , , and are applied for both the identification task ( Fig. 6 A) and the verification task ( Fig. 6 B) Kong et al., 2019 ;Abolfazl et al., 2019 ). It should be noticed the definition of , , , , and 1 − would be different between the multi-class identification task in Fig. 6 A and two-class verification task in Fig. 6 B. and represent the time of traing and testing. In this study, represents the time of training the classifier by the epochs from Sub11-Sub18 separately.
represents the time of testing all epochs of Sub11-Sub20. All the test are run on the computer with CPU i9-9900KF Inter(R) Core(TM).
To avoid information leakage, the performance metrics of , , , , and 1 − are only calculated in the mock online test. For the real online competition, the evaluation of these metrics needs the true labels from the testing set. Hence, only the leaderboard score, as the weighted sum of in the identification and verification task ( and ) in Fig. 6 C, was reported for online submission, which is provided by the platform of Kaggle. The leaderboard score was used as the unique ranking in the competition to evaluate the contestant's overall performance in both tasks to decide the final winner.

Dataset composition
As is shown in Fig. 7 , the whole set for this competition was divided into three parts: Enrollment Set, Calibration Set, and Testing set (containing 11 intruders).
• Enrollment Set: it provides the 1 st session of EEG from 95 subjects.
The enrollment set is used for the contestants to train the biometric model. • Calibration set: the EEG data of the 2 nd session from 20 subjects is provided in the calibration set, which is used to help the contestants get familiar with the data format, refine effective feature extraction, and tune hyper-parameters of the machine learning model. The number of subjects in the calibration set is relatively small as compared with the enrollment set and testing set. • Testing set: it is used to evaluate the performance of the algorithm in the competition with the public and private leaderboards, for which epochs belonging to public or private leaderboards are hidden. The testing epochs come from the 2nd session of 86 subjects, among which 11 subjects are treated as an intruder.

Organization of the competition
The challenge has already been launched currently and would be closed on Apr 30 th , 2023. Challengers were ranked based on the score of recognition accuracy of their submission computed on the private leaderboard. The platform Kaggle is used to hold the competition. During the challenge, contestants can submit their solutions on the Kaggle, and we provided the contestants with the score computed on the public leaderboard. At the end of the challenge, contestants were ranked based on the score of these submissions computed on the private leaderboard. Later submission is still allowed on the Kaggle platform for the researcher to evaluate their methods.

Competition platform
Kaggle is an online community platform for data scientists and machine learning enthusiasts, which provides the world's largest data science community with powerful tools and resources. The link to the platform for our competition are as follows, • Kaggle: https://www.kaggle.com/competitions/eeg-biometriccompetition

Data and code availability
The public dataset with the code of the baseline methods and an example submission file are available at • Kaggle: https://www.kaggle.com/competitions/eeg-biometriccompetition/data .

Benchmarks
To have a comprehensive understanding of the challenges in the EEG-based biometrics competition, we divided the benchmark test into three levels.
(1) Offline evaluation aims to reveal the challenge of the competition with the increasing population size and cross-session, and cross-task; (2) Mock online test aims to evaluate the performances of different baseline models without information leakage. (3) Online submission provides an example code and the baseline score in the public.

Offline evaluation
For offline evaluation, three analyses were performed to show the challenge of the increasing population sizes, cross-session, and crosstask tests in this EEG-based biometrics competition. These analyses were Fig. 6. Illustration of performance metrics for identification task, verification task, and leaderboard ranking.
performed on the data from 20 subjects in the calibration set and enrollment set, whose subject ID of the second session was known. To avoid any possible information leakage, all the analyses were limited to these 20 subjects, not any data and label information from the testing set has been used. The intruder was not considered. The train/test pipeline and performance metric for these analyses were detailed as follows: (1) Testing on increasing population sizes: To evaluate the influence of increasing population sizes on the performance of EEGbased biometrics, we performed a within-task and cross-session testing the number of subjects was increased from 2 to 20 with the step of 2 subjects. To avoid training bias, subjects were randomly selected 10 times for each number of subjects. The epochs from different EEG tasks were used for the testing separately. For each task, the data from the first session was used to train, and the data from the second session was used to test.
(2) Testing across tasks: To compare the performance between within and cross tasks, we performed cross-session testing with a population size of 20. EEG data of a particular task from one session (session 1/session 2) was used to train the model, and EEG data of another task from the other session (session 2/session 1) was used to test. And then the training set and the testing set would be exchanged Since AEP, SEP and VEP are in the same runs, we treated them as one condition/task, EP (i.e., Evoked Potential). Also, the term ME (Motor execution) was used to represent the LH, RH, and RF conditions. (3) Testing across sessions: To compare the performance between within and cross sessions, we performed the within-task test with a population size of 20. To make a fair comparison of the performance between within-session and cross-session tests, the training set was the first half of the epochs in the first session, which was the same for the within-session tests and cross-session tests. For the within-session test, the testing set was the second half of the epochs in the first session. For the cross-session tests, the testing set was all epochs from the second session.
In the offline evaluation, was adopted as the performance metric of the identification task, which is the larger the better, and was used for the verification task, which is the smaller the better ( Maiorana, 2021b ;Bidgoly et al., 2022 ;Kang et al., 2018 ).

Mock online test
To evaluate the performances of different models without any information leakage from the testing set, mock online test was performed on the 20 subjects in the true calibration set and enrollment set. Similar to the dataset composition of the online competition, we divided these 20 subjects into the mock enrollment set, calibration set, and testing set.
• The mock enrollment training set consisted of epochs from the first sessions of the first 18 subjects (Sub001 -Sub018). • The mock calibration set consisted of epochs from the first sessions of the first 10 subjects (Sub001 -Sub010). • The mock testing set consisted of epochs from the second session of another 10 subjects (Sub011-Sub020), in which Sub019 and Sub020 were set as intruders. The epochs from the first session of Sub019 and Sub020 were not used.
For mock online test, the performance metrics, Accuracy ( ), , , , 1 − . , and the final were applied for both identification and verification tasks. As is shown in Fig. 6 , , , , , and 1 − were adopted to comprehensively evaluate the prediction result, which is the larger the better. and were the evaluation of the time consumption for training and testing, which is the smaller the better. More specifically, represents the training time of the classifier by the epochs from Sub11-Sub18 in the first session.
represents the testing for all epochs of Sub11-Sub20 in the second session. Finally, the is the overall accuracy considering both tasks, which decides the ranking of different baseline models in the mock online test, the larger the better.

Online submission
For online submission, an example code was provided with different extracted features (PSD, AR, MFCC) and classification (L2 and SVM), in which the optimal threshold was determined by making the = on the 20 subjects in the calibration set and enrollment set, whose subject ID of two sessions was known. An example code and the submission file with the method of PSD + L2 with were provided on the website of Kaggle ( https://www.kaggle.com/competitions/eegbiometric-competition/data ).
To avoid the information leakage of the subject ID for the epochs in the testing set, only the leaderboard of the total recognition accuracy was used to evaluate the performance, which can be obtained from the platform of Kaggle. We did not set multiple leaderboards for the competition. As a unified and comprehensive index will make the competition focused and competitive, the leaderboard was used to evaluate the contestant's overall performance in both identification and verification tasks to decide the final winner.

Offline evaluation
Since this section only aims to evaluate the performance degradation resulting from larger population sizes, cross-session, and cross-task tests, only the MFCC feature and SVM classifier was adopted. The results for the investigation of the uniqueness of individuals, robustness to mental state changes, and permanence over recording sessions of EEG-based biometrics are shown in Fig. 8 (A-C) correspondingly. is used to evaluate the performance in the identification task. The larger value of indicates a better result. , as the equal error rate, is used to evaluate the performance in the verification task. The smaller value of indicates a better result. Chance levels in Fig. 8A and C are marked by black curves. The influence of increasing population sizes (A) and cross-task tests (B) were both evaluated in the data from the second session. These results were obtained on 20 subjects in the enrollment set and calibration set. Left: identification results. Right: verification results. , , , , 1 − were applied to evaluate the recognition efficiency for both identification and verification task which is the larger the better. and were the time consumption for training and testing., which is the smaller the better. The leaderboard metric is the overall accuracy considering both tasks, which is the larger the better. The numbers in bold indicate the best results.
In terms of the uniqueness of individuals, it can be observed that the accuracy decreased for each task as the population sizes increased from 2 to 20. The averaged recognition accuracy decreased from 0.81 to 0.34. It should be noted that with the different population sizes, the chance levels are different. With the population sizes rising to a larger scale (e.g., 106 subjects in our testing set), the identification problem for machine learning algorithms to identify one subject would be more challenging. Since the verification task is a one-to-one matching problem, increasing the population size will not lead to its performance degradation. Fig. 8B showed the robustness of EEG-based biometrics to mental state changes, in which the row represents the task used for training, and the column represents the task used for testing. The blue square border on the diagonal elements in Fig. 8B indicates the result of the withintask test, as compared with the result of the cross-task test in the offdiagonal elements. In the left figure of Fig. 8B , the black square border on the off-diagonal elements (where EP was used to train and SSSEP was used to test) shows a higher classification accuracy than the blue square border (where EP was used to train and test) in the third row, which is 0 . 44 versus 0 . 40 . There was a similar case for the verification task, when SSAEP was used to train, the best result was obtained when SSSEP was used to test. Except for these two, the best performance for other rows was achieved at the diagonal line, in which the training task and the testing task is the same. This result is consistent for both identification and verification tasks. Due to the unequal number of training samples for each task, it's unfair to compare the error rate across different rows. The within-task and cross-task test results were obtained in cross-session settings.
As for the permanence across recording sessions in Fig. 8C , the averaged increased from 0.16 to 0.31, and the average accuracy for identification decreased from 0.70 to 0.31.

Mock online test
To make a comprehensive comparison of different methods without information leakage, we hold the mock online test by using the first 20 subjects in the enrollment set and calibration set to mock the online biometric competition. Table 5 shows the model performances on the mock online test for the combination of extracted features (PSD, MFCC, AR) and classification (L2 and SVM) methods. The value of , which decides the ranking on the mock leaderboard for six models, is also pro-vided. PSD + L2 provided the best result as = 0 . 721 . It can be observed that the precision for the verification task is quite low, which is caused by the class imbalance in the mock online test.
In terms of extracted features, the AR features have the worst performance no matter which classifier was used. PSD features have a better or similar performance, as compared with more elaborated features such as MFCC across each metric. It is interesting to note that with the threshold determined by the equal error rate (i.e., = ) in an independent training set, the performance of MFCC and AR features are very different on the metrics of specificity ( 1 − ) and sensitivity ( 1 − ). Especially, the specificity of the AR feature is 0 . 438 and 0 . 389 for L2 and SVM classifiers, but the corresponding sensitivity is 0 . 797 and 0 . 880 , exhibiting strong instability for AR-based verification systems.
In terms of classifier, the SVM classifier performed a little better than L2 classifier with the PSD and MFCC features. As for the time consumption, since the L2-based classifier has to compare the new input with all points in the training set, its evaluation time is slower than the SVM classifier. Since the training and testing time depend on the code, the hardware, and the software environment, it's difficult to make a relatively consistent evaluation for online competition.

Online submission
For the example submission, the PSD + L2 method obtained a of 0.4418 in public leaderboard. It should be noted that the method with a better result on the mock online test may not necessarily directly lead to higher scores in the online submission. That is caused by the different samples used for training and testing. For the same reason, it is also possible that the score varies from public leaderboard to private leaderboard.

Challenges in EEG-based biometrics competition
In the field of EEG-based biometrics, many studies have mentioned that their major limitation is lacking a database containing large population sizes, recorded under different conditions and different sessions ( Wang et al., 2020 ;Chan et al., 2018 ;Delpozo-Banos et al., 2018 ). Based on our proposed M 3 CV database, we launch the EEG-based biometric competition. In view of the current development of EEG-based biometric technology, this competition proposes the following challenges for machine learning algorithms from many aspects: 1. Challenge with large population size : the population size of the testing database is an important indicator of whether the current EEG-based biometric technology can be practically applied. As illustrated in Fig. 8 (A), the identification accuracies decreased greatly with the increase of the population size. The identification task on the online competition with a population size of 106 would be more challenging for the machine learning algorithm. 2. Challenge for cross-session tests : a practical biometric system should be robust to cross-session variability. However, as illustrated in Fig. 8 (C), the machine learning model would easily achieve a good performance in the within-session test but degraded severely in the cross-session test. Hence, the cross-session test would be one big challenge in this competition. 3. Challenge for cross-task tests : similar to the cross-session tests, the robustness of the biometric system to the cross-task variability is also a challenge in this competition. For real application, the biometric system should recognize the subject even when they are in different mental states. 4. Challenge for intruders : Whether the biometric system can prevent an attack from intruders outside the enrollment set is determined by the identification performance on an open set ( Kumar et al., 2021 ). That would lead to very different strategies for the classifier design in terms of machine learning techniques ( Gunther et al., 2017 ).
To our best knowledge, no previous study has tested the permanence and robustness of EEG-based biometric features over 100 subjects in an open-set setting. Hence, the M 3 CV database can facilitate the development of advanced machine learning algorithms for EEG-based biometrics.
It should be mentioned that there are samples from the same subject in both the testing set and calibration set, which may lead to these epochs in the testing set being easy to recognize. However, we chose this scheme to construct the testing set and calibration set for two reasons. First, the large sample size is one of the major challenges for this competition. We had to divide the dataset for calibration, which includes 20 subjects in the competition. But if we did not use this part of subjects at all in the testing set, the number of subjects would be decreased by almost 20%, which makes the competition less challenging. Second, this type of information leakage may also occur in practical applications. Once the biometric model has been trained, users can use the biometric system immediately. For using the system, it is not necessary for them to re-wear the EEG cap.

From competition to real application
Due to the limitations of the competition conditions, there may still be some distance away from the real application.
Firstly, the competition needs to have a unique ranking. Hence the was applied to comprehensively evaluate the contestant's performance in the identification and verification task to decide the final winner. However, the evaluation of a system's performance should consider multiple aspects. On the calibration set, accuracy, precision, sensitivity, specificity, F1-score, , and have been proposed as performance metrics for a comprehensive evaluation of the system. For the submission of each team, the error rate under different conditions can be calculated offline, but the computation complexity of training and testing is difficult to make a relatively consistent evaluation. Hence, it is required for the contestants to report the time for training and testing, and their hardware and software environment.
Furthermore, the baseline models provided in the studies are all oneclass classifiers (L2 and one-class SVM). If a new subject is added to the enrollment set, the biometrics system just needs to add the model for the new coming subject alone. It doesn't need to retrain the whole classifier model. We believe the multi-class classifier would achieve better performance in the competition. But one-class classifier would be more practical in the real application.

Intra-and inter-subject variability
Intra-and inter-subject variability poses a great challenge for the interpretation and decoding of EEG signals ( Seghier and Price, 2018 ;Wei et al., 2021 ). Traditional ERP analysis and statistical test methods were typically used to analyze the group-level commonality of EEG signals, while machine learning techniques are considered to be more powerful in dealing with intra-and inter-subject variability. The dataset for this competition is not only restricted to the study of EEG-based biometric algorithms but can also be used for other applications, such as crosssession reliability analysis of EEG signals from different tasks, transfer learning in cross-subject, and cross-session BCI studies. Furthermore, the cross-run and cross-trial variability within subject is also a meaningful topic to study. The multi-run EEG recording, which included restingstate, transient-state sensory, and motor execution, made the M3CV database useful in studies mentioned above.

Advanced machine learning methods
To hold the competition, we only provide a few simplistic models as the baseline to help the contestant to calibrate their work, no advanced machine learning approaches have been applied in this study. Nevertheless, different strategies have been proposed to deal with intrasubject variability caused by mental state changes and the influence of external factors, such as varying electrode placement and electrical impedance. For example, some studies have started to explore the evidence of task-independent person-specific signatures in EEG. Kong et al. (2019) have claimed that phase synchronization of EEG signals has task-free biometric properties but it lacked cross-session evaluation, while Valizadeh et al. (2019) conducted a cross-session evaluation for various connectivity measures, only 5 participants were measured twice . Besides, some advanced machine learning methods, such as low-rank learning ( Kong et al., 2018 ) and adversarial deep learning ( Ozdenizci et al., 2019 ) have also been applied to deal with mental state changes. Also, some strategies, such as multi-task learning ( Sun, 2008 ) and instance-based learning ( Yang et al., 2022 ) have been proposed to obtain better cross-session results.
Undoubtedly, deep learning is the biggest advance in machine learning in recent years, which has also been applied in the EEG-based biometrics study ( Behrouzi and Hatzinakos, 2022 ;Jin et al., 2021 ;Wang et al., 2019 ;Maiorana, 2021b ). However, due to the lack of support from large datasets, the generalizability of deep learning has always been doubtable in this field. On the other hand, most of the newly proposed ideas of transfer learning are based on the framework of deep learning. Based on this competition on the M 3 CV database, it is expected that some cross-session and cross-task deep learning methods can emerge. Furthermore, it is also expected that these methods can not only work well in the field of EEG-based biometrics but also provide a universal solution in dealing with Intra-and inter-subject variability of the EEG signals.

Conclusion
The lack of a large-scale comprehensive EEG database, which contains data recorded from multiple subjects in multiple sessions and over multiple tasks, limits our understanding of intra-and inter-subject variability and hinders the development of advanced machine learning algorithms. In this study, we established an M 3 CV database with 106 subjects, 2 sessions, and 6 tasks to reveal the commonality and variability of cross-subject, cross-session, and cross-task EEG signals. Based on this M 3 CV dataset, a machine learning competition about EEG-based biometrics was launched to help this field grow. Except for advancing the development of machine learning algorithms for EEG decoding (such as biometrics and BCI), we believe the M 3 CV database can help researchers gain a deep understanding of the relationship between different types of EEG signals, such as resting state, motor, sensory-related, or cognitiverelated EEG signals.

Data and code availability statement
This is an open-access database. All the data for Enrollment Set, Calibration Set and Testing Set with the baseline methods and an example submission file can be download.
The detail description about the EEG-based biometric competition can be referred to