Real-time inference of word relevance from electroencephalogram and eye gaze

Objective. Brain-computer interfaces can potentially map the subjective relevance of the visual surroundings, based on neural activity and eye movements, in order to infer the interest of a person in real-time. Approach. Readers looked for words belonging to one out of five semantic categories, while a stream of words passed at different locations on the screen. It was estimated in real-time which words and thus which semantic category interested each reader based on the electroencephalogram (EEG) and the eye gaze. Main results. Words that were subjectively relevant could be decoded online from the signals. The estimation resulted in an average rank of 1.62 for the category of interest among the five categories after a hundred words had been read. Significance. It was demonstrated that the interest of a reader can be inferred online from EEG and eye tracking signals, which can potentially be used in novel types of adaptive software, which enrich the interaction by adding implicit information about the interest of the user to the explicit interaction. The study is characterised by the following novelties. Interpretation with respect to the word meaning was necessary in contrast to the usual practice in brain-computer interfacing where stimulus recognition is sufficient. The typical counting task was avoided because it would not be sensible for implicit relevance detection. Several words were displayed at the same time, in contrast to the typical sequences of single stimuli. Neural activity was related with eye tracking to the words, which were scanned without restrictions on the eye movements.

applied in settings where semantic content has to be interpreted. Readers looked for words belonging to one out of five semantic categories, while a stream of words passed at different locations on the screen (see figure 1). The words were dynamically replaced (when they had been fixated with the eye gaze) by new words fading in. It was estimated in realtime during the experiment which words and thus which semantic category interested the reader, based on information implicitly contained in the measured EEG and eye tracking signals. The estimates were visualised for demonstration purposes on the edge of the screen, and were updated as soon as a new word had been read. In this way, the reader could learn about the current estimates (for each of the five categories), and could observe how evidence was accumulated over time. Prior to the online inference (see section 2.2), a classifier had to be trained to estimate the word relevance based on the signals (see section 2.1).
In contrast to recent investigations with similar objectives [9][10][11][12], several words were displayed at the same time on the screen. The participants could scan the words without restrictions on the eye movements. Neural activity was related with eye tracking to the respective word looked at, like in studies on reading (e.g. [13][14][15][16][17]) and on visual search, that have shown that sought-for items evoke a detectable neural response when they are fixated with the eye gaze (see [18][19][20][21][22][23][24][25]).
The subjective relevance of the visual surrounding can be mapped with this approach by assigning relevance scores to the single items in view. The obtained information could be aggregated in order to characterise the current interest of the individual person. The resulting dynamic user interest profile would render possible novel types of adaptive software and personalised services, which enrich the interaction between human and computer by adding implicit information to the explicit interaction (see [11,12,21,[25][26][27][28][29][30]). Less obtrusive and more convenient EEG systems with sufficient signal quality are prerequisite for the application in practice (see [31][32][33][34][35][36]).

Calibration
Labelled EEG and eye tracking data were recorded in order to train a classifier that could predict the relevance of the single words in the subsequent online phase (see section 2.2). The participants selected one out of five given semantic categories. Subsequently, twenty-two words were drawn randomly from the five categories, with a contribution of 20% per category on average. Words faded in on the screen at predefined positions in random order (see figure 1), and were faded out when they had been fixated with the eye gaze (with a delay of one second).
Examples of the categories and words are: • Astronomy: orbit, galaxy, universe, meteorite.
The participants were requested to remember the words that belonged to the chosen category. When the participants had looked at all words, they were asked to recall the relevant words from their memory. For this purpose, the words reappeared truncated (to about 40% of the original number of letters) at shuffled positions. Relevant words had to be selected with the mouse. Subsequently, the accuracy of the recall was checked and reported. This procedure helped to involve the participants in the task in order to mimic intrinsic interest in certain words but avoided interference of motor activity during the acquisition of the EEG data.
For the study, a corpus had been generated of seventeen semantic categories with twenty words each, both in English and German depending on the language skills of the participant (see section 3.1). The seventeen categories were: animals, furniture, transportation, body parts, family, food, literature, country names, astronomy, music, finance, buildings and structures, healthcare, sports, time, clothes, and visual art. The calibration phase consisted in seventeen blocks with four repetitions each (see figure 2). At the beginning of each block, a semantic category (out of five options) could be chosen. The categories offered for selection changed during the course of the experiment. It was possible, but not necessarily the case, that each of the seventeen categories was chosen once, because the selection was not restricted. During the recording, it was tracked which category had been chosen by the participant and thus which single words were relevant.
Feature vectors were extracted from the recorded EEG and eye tracking data with the intention to capture processes related to word reading and categorisation (details below). The feature vectors were labelled depending on whether the word fixated at this moment was relevant or irrelevant to the chosen category of interest. Subsequently, a classification function was trained with regularized linear discriminant analysis [37] to discriminate the feature vectors of the 'relevant' and the 'irrelevant' class [1]. The shrinkage parameter was calculated with an analytic method [38,39].
2.1.1. Feature extraction. The multi-channel EEG signal was re-referenced to the linked mastoids and lowpass filtered (with a second order Chebyshev filter; 42 Hz pass-band, 49 Hz stop-band). The continuous signal was segmented by extracting the interval from 100 ms to 800 ms after the onset of every eye fixation. Slow fluctuations in the signal were removed by baseline correction (i.e. by subtracting the mean of the signal within the first 50 ms after fixation onset from each epoch). The signal was downsampled from the original 1000 Hz to 20 Hz in order to decrease the dimensionality of the feature vectors to be obtained (14 values per channel). A low dimensionality in compariso n to the number of available samples is beneficial for the classification performance, because the risk of overfitting to the training data is reduced [1]. The multi-channel signal was vectorised by concatenating the values measured at the 62 scalp EEG channels at the 14 time points resulting in a × = 62 14 868 dimensional vector per epoch. The fixation duration was concatenated as additional feature to the EEG feature vector.
Note that other eye tracking features, e.g. the gaze velocity, could not be exploited, because they are not provided in realtime by the application programming interface of the device, and that two additional EEG electrodes, which were not situated on the scalp and served for re-referencing and electrooculography, were excluded from the set of 64 electrodes in total. The distance between the words and the font size were chosen such that the words had to be fixated for reading, which made it possible to relate the continuous EEG signal to the respective word looked at. However, it can not be excluded that some words could be recognised also in peripheral vision (see [23]). Eye-movement-related signal components were not removed from the EEG, which makes online operation simpler. Moreover, the employed multivariate methods can project out artefacts of various kinds.

Online prediction
The subjective relevance of words to a semantic category was inferred online with the previously trained classifier. Again, the participants read words and were asked to look for words related to one out of five semantic categories. The words faded in and out similar to the calibration phase but vacant positions were replaced by new words fading in. In this way, all hundred words of the five involved categories were shown per iteration (see figure 2). Usually, several words were present on the screen at the same time. The classifier predicted online for each fixated word if it was relevant to the category of interest or not, based on the incoming EEG and eye tracking data.
The class membership probability estimates for the single words were assigned to the corresponding semantic category Overview of the procedure during the experiment. During the calibration phase, there were seventeen blocks with different semantic categories of interest. Each block was split into four repetitions. In each repetition, twenty-two words were viewed. After each repetition, words related to the respective category of interest had to be recalled (symbolised by black squares). During the online phase, there were seventeen iterations with different semantic categories of interest. In each iteration, hundred words were viewed, while feedback on the estimated interest was given in real-time. and all estimates obtained so far were averaged per category. The resulting five-dimensional vector indicated how likely each category was of interest. The vector was normalised to unit length, determined the font size and luminance of the visualisation of the five category names on the right side of the screen (see figure 1), and was updated when a new word had been fixated with the eye gaze. It was initialised with neutral values for the initial period when only few words had been read and not every category was captured. The participants were informed about the predictive mechanism underlying the adaptive visualisation in order to foster task engagement. The feedback may have driven strategies in the participants that would have not occurred otherwise. However, if there was no feedback, the participants would have had hardly any intrinsic motivation to look for words 'of interest'. Besides, relevance feedback would be also part of the envisioned novel types of adaptive software (see section 1).
A memorisation task like in the calibration phase (see section 2.1) was not included in view of the objective to exploit only implicit information. Otherwise, the detected information may be related to the memorisation and not to the subjective experience of finding a relevant word. The procedure was iterated seventeen times with new combinations of five categories (see figure 2). At the beginning of each iteration, the participants indicated the selected category of interest for later validation, and the previously collected relevance estimates were cleared.
The participants became more familiar with the corpus of words during the course of the experiment. Nevertheless, the participants had to read each word, interpret the word meaning, and decide if the word belonged to the chosen category of interest. In contrast, only a small set of few shapes/ colours is repeatedly flashed in brain-computer interfacing and stimulus recognition is sufficient (see figure 1 in [4]).
The difference in the stimulus presentation between the calibration and the online phase has the following two reasons. (a) The words were not replaced during the calibration phase in order to limit the number of words to remember and thus the difficulty of the memory task. (b) The replacement during the online phase allowed for accumulating evidence over more data during one task iteration. Note that the spatial resolution of the eye tracker limits the words that can be displayed at the same time. For maximal similarity with the online phase, words faded in and out in the calibration phase as well.
Remark for the sake of completeness: the classifier output was dichotomised to zero or one in the actual visualisation during the experiment. In contrast, class membership probability estimates ranging between zero and one were employed for the figures presented in this paper.

Experimental setup
An apparatus was developed that allowed for making inferences from combined EEG and eye tracking data in realtime and displaying this information in an adaptive graphic visualisation.

2.3.1.
Key constituents of the system. The system comprised an EEG device, an eye tracker, two computers and a screen that the test person was looking at (see figure 3). EEG was recorded with 64 active electrodes arranged according to the international 10-20 system (ActiCap, BrainAmp, BrainProducts, Munich, Germany; sampling frequency of 1000 Hz). The ground electrode was placed on the forehead and electrodes at the linked-mastoids served as references. An eye tracker, connected to a computer (PC 1), detected eye fixations in real-time (RED 250, iView X, SensoMotoric Instruments, Teltow, Germany; sampling frequency of 250 Hz; details of the online fixation detection algorithm were not disclosed by the manufacturer). A second computer (PC 2) acquired raw signals from the EEG device (with the software BrainVision Recorder, BrainProducts, Munich, Germany), and obtained preprocessed eye tracking data from PC 1 over network using the iView X API and a custom server written in Python 2.7 (https://python.org). EEG and eye gaze data were then streamed to in-house software written within the framework of the BBCI-Toolbox (https://github.com/bbci/bbci_public) running in Matlab 2014b (MathWorks, Natick, USA). The graphic visualisation was computed with custom software written in Processing 3 (https://processing.org) and displayed on the screen (60 Hz, × 1680 1050 pixel, 47.2 cm × 29.6 cm).

Synchronisation of EEG and eye tracking sig-
nals. When the data acquisition started, the Python server sent a sync-trigger into the EEG signal and transmitted the current time stamp of the eye tracker to the BBCI-Toolbox. These simultaneous markers allowed for synchronising the two measurement modalities.     situated on a word displayed on the screen, according to the received x-y-coordinates. During the calibration phase, the feature vectors were labelled, depending on whether the word belonged to the category of interest or not. Labels and feature vectors were matched according to a unique identifier (ID) of each eye fixation. During the online phase, the graphic visualisation adapted according to the incoming predictions. The architecture of the system is modular and the visualisation module can easily be replaced by other software for novel applications that depend on making real-time inferences from EEG and eye tracking signals. The communication protocol that enables the visualisation module to interact with the other parts of the system offers three types of interactions. The visualisation module can (a) switch between calibration and online phase and an initial adjustment of the eye tracker, (b) can receive relevance estimates from the BBCI-Toolbox, and (c) can mark events and stop data acquisition by sending markers into the EEG.

Data acquisition
Experiments with three female and twelve male participants with normal or corrected to normal vision, no report of eye or neurological diseases and ages ranging from 21 to 40 yr (median of 28 yr) were conducted while EEG, eye tracking and behavioural data were recorded. Ten people performed the experiment in their mother tongue of German and five people with other first languages accomplished the task in English, which was not their mother tongue. The subjects gave their informed written consent (a) to participate in the experiment and (b) to the publication of the recorded data in anonymous form without personal information. The study was approved by the ethics committee of the Department of Psychology and Ergonomics of the Technische Universität Berlin (reference BL_03_20150109).

Calibration
The participants recalled the words that were relevant to the category of interest with an average accuracy of 80%, ranging from 72% to 84% in the individuals. Classifiers were trained individually for each participant to detect relevant words with EEG and eye tracking data recorded during the calibration phase (see section 2.1). In the subsequent online phase, the classifiers were applied to the data incoming in real-time (see section 2.2).
Additionally, the performance of the classifiers was assessed in ten-fold cross-validations using only the data recorded during the calibration phase. The area under the curve (AUC) of the receiver operating characteristic served as performance metric [40]. An AUC of ± 0.63 0.01 (mean ± standard error of the mean) was measured for the single-trial classifications with EEG feature vectors from the calibration phase, which was significantly better than the chance level of 0.5 (Z = 3.37, p < 0.05). Adding the fixation duration as extra feature did not improve the results, the AUC remained at the same level (significantly better than chance; Z = 3.37, p < 0.05). When only the fixation duration served as feature, an AUC of ± 0.51 0.01 was obtained, which was not significantly better than chance (Z = 1.05, p > 0.05, Bonferroni corrected for the three Wilcoxon signed rank tests on the population level).
Furthermore, the EEG patterns corresponding to relevant and irrelevant words were characterised in order to understand on which processes the classification success was based on (see figure 4). The EEG signal was inspected that followed the landing of the eye gaze on the words. The onset of the eye fixation was situated at t = 0 ms. Early components (until about 150 ms) were related to the saccade offset (respectively the fixation onset) and occurred equally in both conditions. Later components differed depending on whether the word was relevant or irrelevant. Relevant words evoked a left lateralised posterior negativity in comparison to irrelevant words and a positivity that shifted from fronto-central to parietal sites on both hemispheres. For this analysis, all EEG epochs of all participants were averaged separately for relevant and irrelevant words (see figure 4, top) and the difference between the two classes was assessed with signed squared biserial correlation coefficients (see figure 4, centre and bottom). Each time point measured at each EEG electrode was treated separately in order to characterise the spatio-temporal evolution. A significance threshold was not applied in order to show also subtle differences that can potentially be detected by a multivariate classifier.
Relevant words were fixated for about ± 227.4 ms 8.7 ms and irrelevant words for about ± 216.8 ms 7.8 ms during calibration (mean ± standard error of the mean). A paired t-test detected a significant difference between the two classes on the population level; t(14) = 4.3, p < 0.05.

Online prediction
The previously trained classifiers were applied during the online phase to the incoming data and it was predicted for each word if it was relevant to the category of interest or not. The class membership probability estimates were averaged per semantic category and the obtained five-dimensional vector was normalised to unit length (see section 2.2). Figure 5 displays the evolution of the resulting scores corresponding to the category of interest and to the four other categories, which were sorted according to the respective final score (combined EEG and gaze features; average over all participants). With more words being read by the participant, the score of the category of interest grew in comparison to the other categories. Note that the (blue) score curves of the four 'other' categories in figure 5 diverge due to a selection effect: for each iteration, the 'other' categories were ranked according to their final score (other #1, , other #4) and the statistics were calculated separately for each of those ranks, across iterations. The ranking allows for comparing the score curve of the category of interest (red) with the best competitor per iteration (top blue curve). Without the ranking, the blue curves would look alike. Figure 6 shows the evolution of the rank of the category of interest among the five semantic categories (combined EEG and gaze features; average over all participants). The category of interest started with an average rank of three and moved towards the top of the ranking with more words being read (note the direction of the y-axis). Table 1 lists the average final rank of the category of interest for each single participant (i.e. when all hundred words per iteration had been read; see section 2.2). The predictions were based on feature vectors including either the EEG data or the fixation duration, or a combination of the two measurement modalities (columns in the table). The final rank was below three in every single participant when only EEG features were used and even smaller when the fixation duration was added as extra feature. Deploying the fixation duration as single feature resulted in a comparably large final rank. On the population level, the final rank was significantly below three for all feature types , Bonferroni corrected for the three Wilcoxon signed rank tests). Figure 7 displays the EEG patterns during the online phase for relevant and irrelevant words. Relevant words evoked a posterior negativity and a central positivity in comparison to irrelevant words, which is similar to the calibration phase (see figure 4). Additionally, a negativity arose on the left hemisphere in the online phase, in contrast to the calibration phase. Relevant words were fixated for about 239.5 ms ± 12.4 ms and irrelevant words for about 208.2 ms ± 7.0 ms during the online phase (mean ± standard error of the mean). The two classes differed significantly on the population level according to a paired t-test; t(14) = 4.7,p < 0.05.

Calibration
All participants complied with the task instructions because they recalled the words that were relevant to the selected semantic category with an accuracy of at least 72% (giving random answers would result in an expected accuracy of about 20% due to the five possible categories). EEG and eye tracking signals recorded during the calibration phase were used to train classifiers (individually for each participant) to discriminate relevant words from irrelevant words.
The trained EEG-based classifiers were able to generalise to unseen data, because the cross-validation results with calibration data were significantly better than it can be expected from random guessing (see section 3.1; note that the AUC served as straightforward metric here, in contrast to the online phase where the ranking of the categories provided a more descriptive metric). Classification was apparently possible because relevant words evoked a different neural response than irrelevant words (see section 3.1 and figure 4). In previous research on brain-computer interfacing, the stimuli of interest evoked a similar neural response with a left lateralised negativity and a central positivity (see figure 2, right panel, in [4]), even though the stimuli used in the cited study were not words but geometric shapes flashed on the screen while the eyes did not move. Hence, it was shown with the present investigation that the methods developed for brain-computer interfacing can be employed for inferring the relevance of words under unrestricted viewing conditions.
Concatenating the fixation duration to the feature vectors did not improve the predictive performance, and single-trial classification based on the fixation duration alone was not possible better than random (when data from the calibration phase were used). Nevertheless, a small but significant difference of the fixation duration between the two classes was found on average (see section 3.1).

Online prediction
It was predicted in real-time which words were relevant for the reader, who was looking for words related to a semantic category of interest. The five categories were ranked according to the normalised five-dimensional average score vector. Perfect prediction of the category of interest would have resulted in a score of 1 and a rank of 1 for the category of interest. If each word was classified randomly as relevant or irrelevant, an average score of 0.2 and an average rank of 3 can be expected. The score and the rank of the category of interest started at this chance level, as it can be assumed. With more words being read, the score grew and the rank decreased (see figures 5 and 6).
Apparently, evidence could be accumulated by integrating information over the incoming single predictions.
The combination of EEG and fixation duration resulted in the best predictive performance (see table 1). The gaze did not contribute much to the relevance estimate because features from the EEG alone were more informative than when the fixation duration was used as single feature (while it has to be considered that information about the eye gaze is required for the EEG feature extraction, because the EEG signals had to be related to the corresponding words looked at; see section 2.1).
The successful transfer of the classifiers from the calibration phase to the online phase is reflected in the underlying data. The EEG patterns, that made it possible to distinguish relevant and irrelevant words, evolved similarly in the calibration and in the online phase in the first period after fixation onset (see figures 4 and 7). The later discrepancy is presumably a result of the different tasks, because the relevant words had to be memorised only in the calibration phase. Moreover, during the online phase, fixated words were replaced by new words fading in, the words were already familiar, and relevance feedback was displayed (see sections 2.1 and 2.2). Despite of these differences, generalisation from the calibration to the online phase was possible. The discriminative EEG patterns may correspond to two components of the event related potential: the 'P300', which is associated with attention mechanisms (and subsequent memory processing) [41,42], and the 'N400', which is related to language processing [43]. The found fixation durations might be comparable to the average numbers reported in the literature, e.g. of ± 224 ms 25 ms in table 1 in [14].

Conclusion
The study demonstrates that the subjective relevance of words for a reader can be inferred from EEG and eye gaze in realtime. The methods employed are rooted in research on braincomputer interfacing based on event-related potentials, where stimulus recognition is usually sufficient, and where sequences of single stimuli are typically flashed. In contrast, the investigation presented here is characterised by the requirement to interpret words with respect to their semantics. Furthermore, several words were presented at the same time and neural activity was related with eye tracking to the respective word read. The typically employed counting task was avoided because it would not be sensible for implicit relevance detection (see [44]). The task instruction during the online phase was merely to look for (and not to count) words relevant to the category of interest. In this way, the subjective experience of encountering a relevant word should be approximated, which can be vague in comparison to the well-defined counting task.
Task engagement was additionally fostered by explaining the predictive mechanism underlying the adaptive visualisation. The experiment exploits a situation that allows for integrating implicit information across several single words. In a next step, the methods could be applied to a situation where sentences or entire texts are being read, which will entail a number of new challenges for the data analysis, because the single words are syntactically and semantically interdependent in this case. While this study serves as a proof-of-principle, the methods can potentially be used in the future for mapping the subjective relevance of the field of view in novel applications (see section 1). In summary, this study represents a further step towards inferring the interest of a person from information implicitly contained in neurophysiological signals.