Implicit relevance feedback from electroencephalography and eye tracking in image search

Objective. Methods from brain–computer interfacing (BCI) open a direct access to the mental processes of computer users, which offers particular benefits in comparison to standard methods for inferring user-related information. The signals can be recorded unobtrusively in the background, which circumvents the time-consuming and distracting need for the users to give explicit feedback to questions concerning the individual interest. The obtained implicit information makes it possible to create dynamic user interest profiles in real-time, that can be taken into account by novel types of adaptive, personalised software. In the present study, the potential of implicit relevance feedback from electroencephalography (EEG) and eye tracking was explored with a demonstrator application that simulated an image search engine. Approach. The participants of the study queried for ambiguous search terms, having in mind one of the two possible interpretations of the respective term. Subsequently, they viewed different images arranged in a grid that were related to the query. The ambiguity of the underspecified search term was resolved with implicit information present in the recorded signals. For this purpose, feature vectors were extracted from the signals and used by multivariate classifiers that estimated the intended interpretation of the ambiguous query. Main result. The intended interpretation was inferred correctly from a combination of EEG and eye tracking signals in 86% of the cases on average. Information provided by the two measurement modalities turned out to be complementary. Significance. It was demonstrated that BCI methods can extract implicit user-related information in a setting of human-computer interaction. Novelties of the study are the implicit online feedback from EEG and eye tracking, the approximation to a realistic use case in a simulation, and the presentation of a large set of photographies that had to be interpreted with respect to the content.

(Some figures may appear in colour only in the online journal) Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Introduction
Signals from the brain may contain implicit information about the users of computers, which can potentially be decoded with methods from brain-computer interfacing (BCI) [1][2][3][4]. Such a direct access to the mental processes of the users offers particular benefits in comparison to standard methods for the inference of user-related information, e.g. asking the user for explicit feedback, or observing the user's interaction with the device. Physiological signals can be recorded unobtrusively in the background, and their analysis would circumvent the timeconsuming and distracting need for the user to give explicit feedback to questions concerning the individual interest, as well as a possible response bias. The obtained implicit information could augment standard input devices (e.g. computer mouse and keyboard) for the interaction between human and machine.
Research on BCI has shown that humans can volitionally generate 'neural signatures' that can be detected in the electroencephalogram (EEG) with pattern recognition methods in real-time. The extracted information can be translated into a signal serving for control or communication [5][6][7][8][9]. Some BCI methods exploit the phenomenon that stimuli of interest, which are flashed in a stimulus sequence, elicit a detectable attention-related neural response [10][11][12][13]. Combining this BCI technique with eye tracking makes it possible to infer the subjective relevance of the single elements of the visual surrounding [14][15][16][17][18][19][20][21].
The present study demonstrates that it is possible to decode from EEG and eye tracking signals which images were subjectively relevant for the user of a simulated web image search engine (see 'Flickr' or 'Google Images'). The resulting relevance map of the computer screen, where numerous images were displayed at the same time in a grid, made it possible to characterise the current interest of the individual user. Implicit relevance information can be aggregated in dynamic user interest profiles, that could be taken into account by novel types of adaptive, personalised software. This potential is explored here with a demonstrator application that infers the user interest online from implicit information hidden in the signals. Novelties of the study are the implicit online feedback from a combination of EEG and eye tracking signals, the approximation to a realistic use case in a simulation, and the presentation of a large set of photographies that had to be interpreted with respect to the content (which goes beyond the mere recognition of previously known simple stimuli that are typical for BCI paradigms based on event-related potentials). The demonstrator is not considered to be a final application of its own right, but may be an important step towards future applications that are informed by the insights gained.
The presented novel approach may show promise in light of the increasing interest of customers and large technology companies in wearable physiological sensors [22] and recently developed, deployable eye tracking and EEG systems, which will make the signal acquisition during daily life more and more feasible-in contrast to the bulky, expensive, inconvenient, and stationary equipment of the past. Examples of the technological innovations are affordable eye trackers [23] and mobile EEG systems [24][25][26] with gel-free [27][28][29][30], miniaturised [31] electrodes that can be placed hardly visible in/ on/around the ear [32][33][34][35][36]. Moreover, in-ear headphones with different physiological sensors including EEG, which connect with a smartphone, are under development (e.g. 'The Aware' from 'United Sciences', Atlanta, USA).

Experimental design
The participants of the study queried for ambiguous terms in a simulated image search engine, and viewed different images that were related to the respective search term. During image viewing, the EEG was recorded and the eye movements were tracked. Feature vectors were extracted from the signals in order to train a classifier that estimated the intended interpretation of the ambiguous search term. First, the participants were asked to choose one of two possible interpretations (like 'animal-nature-wildlife' versus 'baseball-ball-sports') of an ambiguous search term (here 'bat'). Then, they viewed 24 square images arranged in a four-times-six grid on the screen that were related to either one or the other meaning of the query (see figure 1; non-square images were cropped). Finally, they were asked to report the number of the pictures belonging the chosen category and got feedback on whether their response was correct. This procedure was repeated 154 times with different ambiguous search terms. Further examples of the queries are 'jam' with the possible interpretations 'cream-tea-scone' versus 'music-guitar-band', 'deck' ('ship-sea-boat' versus 'skateboard-skate-board'), and 'tick' ('macro-insect-bug' versus 'time-clock-tock'). The participants were instructed to quickly skim the images instead of prioritizing the correct accomplishment of the counting task, assuming that this behaviour is typical when browsing image search results. Before the appearance of the image mosaic, a fixation cross directed the gaze to the upper left corner of the screen. Each picture shown in the image mosaic was picked randomly from one of the two given categories with a probability of p = 11/24. In addition, few 'odd' pictures, which were not related to the query, were displayed with a probability of p = 2/24. The odd pictures were randomly selected from the remainder of the image collection.

Experimental stimuli
All pictures were obtained from Flickr [37], a service for sharing pictures aimed at amateur and professional photographers. Flickr provides access to a large collection of user annotated pictures via an application programming interface ('API'; [38]). Flickr clusters the images into categories that contain images with similar content according to the user annotations (tags). These clusters can be accessed via the API with the 'cluster search' function. Called with a single search term, the function returns up to four clusters. Each cluster is described by a list of tags and named after the first three tags. Several lists of homonyms (e.g. [39]) served as query terms for the cluster search function, and a collection of 63 110 images related to 936 ambiguous terms was downloaded. Search terms were picked that generated two clusters with more than 18 pictures each that could be clearly associated with the name of the respective cluster. A manual review was necessary, because many pictures were hardly in any relation to the cluster name or query term.
Ambiguity was rarely the result of lexical homonymy, but more often due to underspecified search queries. The search term 'filter' resulted, for instance, in images of coffee filters, in pictures of filter lenses made of glass and in photographies processed by different digital filters. The two categories were illustrated for the participant by the first three tags and one example picture per cluster (see section 2.1 and figure 1). Some categories could be easily distinguished, others not. For instance, the categories 'hyacinth-flower-blue' and 'fruitgreen-macro' of the search term 'grape' could be easily discerned. The former consisted of close-up photographies of blue hyacinth flowers in a grape shaped form, the latter contained grapes and other fruits that were never blue. In contrast, it was difficult to distinguish the categories 'paintart-painting' and 'makeup-eyeshadow-cosmetics' of the search term 'palette', because the images of both categories depicted colour palettes, that contained either make-up or paint for drawing.

Data acquisition
Fourteen persons with normal vision and no report of eye or neurological diseases participated in the experiments. The age of the five female and nine male subjects ranged from 22 to 33 yr with a mean age of 27.7 yr (standard deviation: 2.96). The first subject viewed 123 result pages and all others 154 result pages. One recording session included giving an informed written consent to take part in the study, vision tests for eye dominance, preparation of the sensors, eye tracker calibration and validation, introduction to the task and the main experiment (with a duration of about 1.5 h). The study was approved by the ethics committee of the Department of Psychology and Ergonomics of the Technische Universität Berlin (application number BL_03_20150109).
The participant sat at a distance of 60 cm in front of a comp uter screen and entered the number of the counted target pictures with a keyboard. Physiological signals were recorded with two amplifiers with 62 active EEG electrodes (BrainAmp, ActiCap, BrainProducts, Munich, Germany; sampling frequency of 1000 Hz) and one active electrode for electrooculography (EOG). An eye tracker (RED 250, SensoMotoric Instruments, Teltow, Germany; sampling frequency of 250 Hz) was attached to the screen. A chin rest gave orientation for a stable position of the head. The screen had a resolution of 1680 pixels × 1050 pixels, a size of 47.2 cm × 29.6 cm and subtended a visual angle of 38.2 • in horizontal and 26.3 • in vertical direction.
EEG was acquired and analysed with Wyrm and Mushu [40,41]. The synchronously recorded EEG and eye tracking signals were aligned with the help of sync-triggers. Clientside JavaScript Ajax (asynchronous JavaScript and XML) calls sent HTTP requests every 500 ms that in turn called a function on the backend (Flask web server) that elicited the subsequent recording of EEG and eye tracking time-stamps. These time-stamps were used to estimate the parameters of a linear regression function for the mapping of eye-trackertime to EEG-time. The EEG data were low-pass filtered with a second order Chebyshev filter (42 Hz passband, 49 Hz stop band), down-sampled to 100 Hz, re-referenced to the digitally linked-mastoids and high-pass filtered with a Butterworth filter at 0.5 Hz. The last 500 ms of each stimulus presentation were not considered for the analysis in order to avoid confounds from the terminating button press. The first three result pages were only used for practice and not for analysis.
The proper calibration of the eye tracker was re-validated at least four times during the experiment and more often if the subject was unsteady and moved a lot. A picture was considered as fixated if the location detected by the online algorithm of the eye tracker was situated within the borders of the picture plus 20 pixels (0.52 • ). The pictures had a side length of 186 pixels and subtended a visual angle of 5.0 • . The size was picked to fit approximately into the area with high foveal resolution. The distance between the pictures was 35 pixels (0.9 • ) in horizontal direction, 40 pixels (1.1 • ) in vertical direction, The stimuli were presented with web technologies in order to explore the compatibility of the BCI-based relevance detector with common software applications, which are not optimised for the presentation of experimental stimuli (frontend: HTML5, CSS, JavaScript, jQuery, Ajax, Bootstrap, backend: Flask). The experiment was interactive and not a static prearranged sequence of stimuli. The user could navigate between different menu pages (e.g. a page for trial selection) and could calibrate and validate the eye tracker inside the browser under the supervision of the experimenter. For demonstration purposes, it was additionally possible to train a classification model with the data recorded so far using a preliminary version of the classification procedure presented in section 2.4.1. This option was given to the participants after the end of the main recording session. Then, a 'feedback mode' could be launched, that allowed for an online prediction of the respective category of interest (not described further in this paper).

Data analysis 2.4.1. Prediction of the category of interest.
Every result page contained pictures of the two possible interpretations of the ambiguous search term, which will be referred to as categories. In addition, few odd pictures were mixed in, which did not belong to any of the two categories. The subjects selected one category of interest before the display of each result page and labelled it as target category by pressing a button. The respective other category was labelled as non-target category. The selected target category of every result page was inferred from feature vectors extracted from the EEG and eye tracking signals in two steps (see figure 2). First, EEG-and eye-tracking-based feature vectors were classified separately (details are set out below). Then, information from EEG and eye tracking was combined by averaging the classifier estimates of the two measurement modalities. The category with the larger average target estimate was considered to be the target category of the respective result page. Binary classifications were performed, because the odd images were not considered. Linear discriminant analysis (LDA) with shrinkage served as classifier, which regularises the estimated covariance matrix and, thereby, reduces the likelihood of overfitting in the case of high-dimensional data and a limited number of samples [42,43]. The optimal shrinkage parameter was calculated analytically using the closed form equation derived in [44], which is computationally less expensive than choosing the optimal parameter by cross-validation. Posterior probabilities were computed from the classifier scores because probabilities are well suited for combining different classifier estimates due to the clear upper and lower limit and the same scale. The predictive performance was assessed in ten-fold cross-validations using the classification accuracy as metric.
For the EEG-based prediction, feature vectors corresponding to each fixated image were classified as being either members of the target or the non-target category. The target probabilities of all feature vectors per category were averaged. The category with the larger average target probability can be assumed to be the selected category of interest of the respective result page. Feature vectors were extracted from the continuous multi-channel EEG signals as follows. One second long epochs aligned to the onsets of the longest eye fixations of each image were cut out (fixation-related potentials; 'FRPs') and downsampled to 20 Hz (which reduced the dimensionality of the feature vectors and thereby the risk of overfitting to the training data). The data of all 62 channels were concatenated in one feature vector with 1240 dimensions. The number of samples (longest fixations on either target or non-target images) ranged from 2821 to 5165 per single subject, with slightly unbalanced classes, because target images were fixated more often than non-target images. Note, that only fixated images could contribute to the inference. For a performance comparison, the first and the last fixation were also tested as time markers of reference-in addition to the default usage of the longest fixation. Methods for artefact rejection were not applied in order to let the classifier learn to deal with potential artefacts in the signals. From experience, this approach is superior to artefact rejection/correction in laboratory experiments with artefacts that are not too severe. A robust classifier can deal with artefacts during online operation, while artefact rejection would lead to missing data, which is critical in many online applications.
For the eye-tracking-based prediction, feature vectors were extracted separately per category and result page, and were classified with shrinkage LDA. These screen-based eye tracking features comprised the mean dwell time, the median and maximum fixation duration and the average fixation number. The category with the larger target probability was considered to be the selected category of interest of the respective result page. In addition, an alternative classification strategy was examined, which resembled the procedure of the EEG-based prediction: each image was first classified as member of the target or non-target category based on the dwell time on each image (single-image eye tracking features). Then, the single probabilities were averaged per category (aggregated eye tracking probabilities). Shrinkage was not necessary in this case because covariances can not be considered for this univariate feature.
In addition, feature vectors extracted from the EOG were classified in order to assess a possible contribution of eye movements to the EEG-based prediction (horizontal eye movements were captured by subtracting channel F10 from channel F9 and vertical eye movements by subtracting channel Fp1 from the signal of the electrode below the eye).

Characteristics of the EEG and eye tracking features.
The characteristics of the EEG epochs, which served as features for the classifications, were assessed separately for the three groups of the corresponding images (targets, non-targets, odds). Discriminative information between target versus non-target EEG epochs, between target versus odd EEG epochs, and between non-target versus odd EEG epochs was inspected for each time point and each EEG channel with the point biserial correlation coefficient, which was squared while retaining the sign (r 2 ). The eye movements were characterised with fixation maps of the result pages, and by computing the statistics of the dwell time, of the number of fixations and of the median and maximum fixation duration of target, nontarget and odd images.

Task performance.
The behavioural performance and compliance of each participant with the task instructions was assessed by computing the percentage of correct answers, the deviation of the number entered by the subject from the true number of images belonging to the selected category, and the trial durations.

Prediction of the category of interest
The chosen category of interest of the ambiguous search term could be inferred with an accuracy of 85.9% ± 5.8%, when information from EEG and eye tracking was combined (mean ± standard deviation; the results of the single subjects ranged from 73% to 95%; see figure 3). This outcome is significantly better than the chance level of 50% that can be expected from random guessing ( p < 0.05, Wilcoxon signed rank test on the population level). When only EEG features were used, the estimates were correct in 76.9% ± 8.7% ( p < 0.05; ranging from 56.0% to 90.1%), and in 81.0% ± 6.7% for predictions with screen-based eye tracking features only ( p < 0.05; ranging from 67.6% to 92.8%). The complementarity of information provided by the single modalities was evaluated separately for EEG and eye tracking. A subset of the samples was selected where the prediction based on the respective alternative modality was wrong (i.e. the full set of samples was reduced by about 81.0% and 76.9% respectively). The predictive performance on the subset decreased merely for about five percentage points in comparison to the full set, and was still significantly better than random (EEG if eye tracking wrong: 71.5%, eye tracking if EEG wrong: 76.8%), which indicates complementarity (Wilcoxon signed rank tests, p <= 0.05). EOG features resulted in a predictive performance closer to the chance level of 50% in comparison to the other modalities (see figure 3).
The predictive performance based on EEG features only is shown in figure 4. The category of interest was estimated by aggregating the category membership probability estimates of the single images (see black and grey boxplots in figure 4). The class-wise normalised accuracy, which is insensitive to class imbalances, served as performance metric in the case of the single image classification, because target images were fixated more often than non-target images. Using the longest fixation as time marker of reference (for the feature extraction from the continuously recorded EEG) resulted in a slightly better accuracy in comparison to the usage of the first or the last fixation on an image.
The category of interest could be predicted better than random with screen-based eye tracking features, but not with single-image eye tracking features (also when the resulting probabilities were aggregated per category; see figure 5).

Characteristics of the EEG and eye tracking features
Characteristic neural responses were elicited when either target, non-target or odd images were fixated. An EEG component occurred at about 500 ms to 700 ms after the onset of the longest fixation, and allowed for discriminating targets from non-targets and odds (see figures 6 and 7). Differences between the corresponding EEG epochs were most prominent at central and parietal electrodes. For conciseness, we only display the results of the longest fixation, because the spatial distributions and time courses of the different fixations were very similar (with a small time lag).
Result pages were scanned in a systematic order, starting in the upper left corner (at the position of the fixation cross) and then continuing row by row to the bottom right (see figure 8 for a typical fixation map). Few subjects examined column by column, almost all subjects applied the same search strategy to most of the search screens.
Different fixation patterns were observed for target, nontarget and odd pictures (see figure 9). Dwell time, number of fixations and median and maximum fixation duration were significantly larger for targets than for non-targets

Task performance
Correct answers were given in 45.7 ± 14.2% of the cases (mean ± standard deviation), ranging from 20% to 63%. Participants tended to miss a target rather than counting too many (see figure 10, bottom). The participants spent a median time of about 15 s and rarely more than 20 s on each result page with 24 images. Accordingly, single images were typically viewed less than one second.

Prediction of the category of interest
Ambiguity in image search was resolved by inferring the intended meaning of the underspecified query term from information present in EEG and/or eye tracking signals. Predicting the category of interest was possible with both measurement modalities. Combining the modalities improved the predictive performance, which suggests that EEG and eye tracking provide complementary information (see section 3.1 and figure 3). The following findings give further evidence for this claim: testing only samples that were misclassified by the respective other modality resulted in an accuracy that was still significantly better than random (see section 3.1). Thus, the classifiers made different mistakes and exploited different information. Moreover, discriminative information present in the fixation-related EEG  epochs was found mainly at central electrodes, which are presumably less confounded by eye movements than electrodes at outer positions (see figure 7; eye movements may have influenced the EEG responses of the single classes; see the topographies in figure 6). Besides, differences in the EEG started at about 500 ms after fixation onset (see figure 7), and, therefore, mainly after the onset of the following eye movement (see figure 9).
Accumulating evidence (classifier probabilities) over several feature vectors considerably improved the EEG-based predictive performance (see section 3.1 and 4). Thus, the findings demonstrate that the inherent uncertainty of the single relevance estimates (here: for single images) can be overcome by including information about the membership to a more general category (here: possible interpretations of an ambiguous term). This insight can be taken into account also by  Statistical differences (signed r 2 values) between target versus non-target EEG epochs (top), between target versus odd EEG epochs (centre), and between non-target versus odd EEG epochs (bottom). The epochs were aligned to the longest fixations of the images. The channels are ordered from the front to the back and from the left to the right side of the head. Averages over all subjects of the study are shown for all time points (left) and for two selected intervals as scalp maps (right). A significance threshold was not applied in order to keep also subtle differences that can potentially be exploited by the multivariate classifier (see section 2.4.1). future efforts that apply brain-computer interfacing to humancomputer interaction.
The effect of the number of test samples used for evidence accumulation on the certitude of the final prediction is inspected in more detail in [21]. In addition, the expectable generalization performance of a predictive model typically grows with more training samples available, but has to be weighed up against the effort and the duration to acquire more training samples. This trade-off depends on the specifics of the future application. We therefore decided not to investigate this dependency in more detail for the current study, which investigates merely a demonstrator.
The longest fixations may have resulted in the best EEGbased predictive performance (see section 3.1 and figure 4) because they presumably served for a closer inspection of informative spots of the picture (and were not only intermediate stops on negligible spots).

Characteristics of the EEG and eye tracking features
The fixation of non-target and odd images evoked a late positive complex, in contrast to target images (see section 3.2 and figures 6 and 7). The effect occurred later than it can be expected from the EEG component 'P300', which is evoked by the oddball paradigm [45]. The stimuli were photographies that differed not only in low-level features, which could be quickly recognised (e.g. texture, contrast, colour), but also in high-level features, which had to be interpreted (e.g. scene or object depicted). Note that the experimental design does not exactly match the classic oddball paradigm, because the probabilities of target and non-target stimuli were equal. Nontarget and odd images did not fit the expectations of the participant, stood out in the 'regular train of standard stimuli' [45], and might be compared to the so called target stimuli of the classic oddball paradigm. For this reason, the late positive complex may appear to be inverted at the first glance (see an alternative explanation below).
Images were often fixated only once (see section 3.2). Thus, the longest fixation was in many cases the first and the last fixation at the same time. The distributions of the eye tracking features corresponding to the three image categories (target, non-target, odd) overlap, but are clearly not the same (see figure 9). Target images were, in general, fixated longer and more frequently than non-target images. Thus, an image was more likely followed by a target image than by a non-target (or odd) image, even though the presentation probability was the same for target and non-target images (see section 2.1). Imbalanced dwell times and transition probabilities may have systematically distorted the event-related potentials at later time points, when the next image was already fixated, and could have resulted in the found late positive complex.

Task performance
The participants complied with the task instructions, because the images were skimmed quickly and not inspected thoroughly, as suggested by the comparably short time spent on each result page and the rather low counting accuracy (see section 3.3 and figure 10).

Conclusion
The study shows that EEG and eye tracking signals can be used to infer the subjective relevance of screen content. This implicit information can be extracted from the signals in the background and makes it possible to create dynamic user interest profiles in real-time without an explicit relevance feedback from the user. A whole new range of applications can be conceived on the basis of the introduced technologies, even though the purpose of use presented in this paper is rather specific (ambiguities in image search were resolved). Computer users could navigate rapidly through large data sets with little effort using novel interfaces tailored to the implicit relevance feedback from the sensors. Eye tracking is especially promising considering the progress made with regard to technology and cost [23]. Nevertheless, recently developed miniaturised EEG systems with dry electrodes can be set-up quickly and hassle-free (see section 1), and a small set of electrodes may be sufficient, because central areas of the scalp were particularly informative (see section 3.2 and figures 6 and 7). While both measurement modalities turned out to be complementary (see sections 3.1 and 4.1), information provided by eye tracking might vanish in a more realistic setting (but is nevertheless required for the feature extraction from the EEG). Discriminative information present in fixation duration and dwell time could be corrupted when the user starts pondering and interrupts the flow of the eye movements. In contrast, spatio-temporal patterns in short fixation-related EEG epochs may remain unaffected. Besides, EEG contained information about the relevance of the single images, which could be used for more fine-grained user interest profiles (see figure 4), in contrast to eye tracking, which allowed only for estimating the relevance of the entire page (see figure 5).