A field test of computer-vision-based gaze estimation in psychology

Valtakari, Niilo V.; Hessels, Roy S.; Niehorster, Diederick C.; Viktorsson, Charlotte; Nyström, Pär; Falck-Ytter, Terje; Kemner, Chantal; Hooge, Ignace T. C.

doi:10.3758/s13428-023-02125-1

A field test of computer-vision-based gaze estimation in psychology

Open access
Published: 26 April 2023

Volume 56, pages 1900–1915, (2024)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

A field test of computer-vision-based gaze estimation in psychology

Download PDF

Niilo V. Valtakari ORCID: orcid.org/0000-0003-2813-4839¹,
Roy S. Hessels¹,
Diederick C. Niehorster^2,3,
Charlotte Viktorsson⁴,
Pär Nyström⁵,
Terje Falck-Ytter^4,6,
Chantal Kemner¹ &
…
Ignace T. C. Hooge¹

2272 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Computer-vision-based gaze estimation refers to techniques that estimate gaze direction directly from video recordings of the eyes or face without the need for an eye tracker. Although many such methods exist, their validation is often found in the technical literature (e.g., computer science conference papers). We aimed to (1) identify which computer-vision-based gaze estimation methods are usable by the average researcher in fields such as psychology or education, and (2) evaluate these methods. We searched for methods that do not require calibration and have clear documentation. Two toolkits, OpenFace and OpenGaze, were found to fulfill these criteria. First, we present an experiment where adult participants fixated on nine stimulus points on a computer screen. We filmed their face with a camera and processed the recorded videos with OpenFace and OpenGaze. We conclude that OpenGaze is accurate and precise enough to be used in screen-based experiments with stimuli separated by at least 11 degrees of gaze angle. OpenFace was not sufficiently accurate for such situations but can potentially be used in sparser environments. We then examined whether OpenFace could be used with horizontally separated stimuli in a sparse environment with infant participants. We compared dwell measures based on OpenFace estimates to the same measures based on manual coding. We conclude that OpenFace gaze estimates may potentially be used with measures such as relative total dwell time to sparse, horizontally separated areas of interest, but should not be used to draw conclusions about measures such as dwell duration.

Eye Tracking Methodology

MouseView.js: Reliable and valid attention tracking in web-based experiments using a cursor-directed aperture

Article Open access 29 September 2021

Eye tracking for public displays in the wild

Article 03 July 2015

Introduction

Eye movements are integral to human behavior, and gaze direction can be used to infer different aspects of cognitive function. As such, studies have been conducted on how infants model and navigate the world and categorize objects (Franchak et al., 2011; Gredebäck et al., 2009; Johnson et al., 2003; Oakes, 2012), how adults acquire information in specific situations (Ballard et al., 1995; Hayhoe, 2004; Hayhoe & Ballard, 2005), and how people interact with each other face-to-face (Hessels, 2020). In sum, there is an abundance of research investigating the role of gaze in different aspects of human behavior (see, e.g., Land & Tatler, 2009; Duchowski, 2017).

Current gaze research typically employs eye trackers using the pupil minus corneal reflection (p-CR) technique (see, e.g., Hooge et al., 2016, for an explanation). Although eye trackers can provide an accurate and reliable estimate of gaze direction, they need to be placed either in front of or on the head of the participant, usually need to be calibrated for each participant, and often only allow for limited head and upper body movement (see Holmqvist et al., 2011; Holmqvist & Andersson, 2017, for further discussion on the different types of p-CR eye trackers and setups). An alternative technique to estimate gaze direction, computer-vision-based gaze estimation, shows promise in mitigating these limitations.

Computer-vision-based gaze estimation refers to methods that can be used to estimate gaze direction solely from video recordings or images of the eyes and/or face without the need for an eye tracker (see Hansen & Ji, 2009, for detailed discussion on different gaze estimation methods). Computer-vision-based gaze estimation typically employs either a model- or appearance-based approach (Zhang et al., 2017). Model-based approaches first detect the locations of specific eye region landmarks and then fit those into a three-dimensional (3D) model of an eyeball to estimate gaze direction (e.g., Baltrušaitis et al., 2018; Park et al., 2018). Model-based approaches may also further employ machine learning to improve detection performance. Appearance-based approaches, on the other hand, estimate gaze direction directly from how the eyes and/or face appear in a 2D video image using machine learning, without any underlying model of the eyeball (Tan et al., 2002). A recent development in the field has been the addition of deep learning (Pathirana et al., 2022). This has allowed the methods to improve on previous shortcomings such as low accuracy and generalizability to unconstrained environments and across people (Cheng et al., 2021).

Arguments in favor of computer-vision-based gaze estimation methods are typically based on their low initial cost, wide availability, and high scalability in terms of hardware (see, e.g., Krafka et al., 2016; Valliappan et al., 2020). First, cameras with video-recording capabilities can be quite cheap, especially when high spatial and temporal resolution are not required. Second, video cameras are already found nearly everywhere in modern society. They are used not only for consumer products such as smartphones and laptops, but also for surveillance, and are commonplace in many research settings as well. Third, cameras have a wide range of optics available. With a long-focus lens, a camera can be used to zoom in on a scene of interest, even from far away. This alone affords considerable freedom for where a camera can be placed to get a good view of the face. With these advantages, one can conceive many experimental situations where computer-vision-based gaze estimation could be applied, ranging from screen-based experiments to, for example, observations of parent–infant interactions. If the methods for computer-vision-based gaze estimation deliver what they promise, namely cheap, easily attainable, and accurate methods to estimate gaze, they could potentially be applied to the many situations where gaze behavior is of interest, or as Krafka et al. (2016) put it, “to bring the power of eye tracking to everyone” (p. 2183).

When examined from the perspective of someone without background or experience in computer science, however, the researcher interested in applying computer-vision-based gaze estimation is faced with two major problems. First, the developers of these methods either do not publish their code or publish it only for research-oriented purposes (namely, intended for other researchers in computer science) (Zhang et al., 2019). Second, the evaluations done thus far have largely been concerned with only absolute errors and comparisons with previous methods utilizing similar techniques (e.g., Baltrušaitis et al., 2018; Bao et al., 2021; Chen & Shi, 2018; Cheng et al., 2021; Chong et al., 2018; Fang et al., 2021; Krafka et al., 2016; Park et al., 2018; Wood & Bulling, 2014; Zhang et al., 2015, 2017), although a few notable exceptions with more comprehensive evaluations do exist (see Eschman et al., 2022; Kellnhofer et al., 2019; Valliappan et al., 2020; Zhang et al., 2019). In comparison, eye trackers bought from manufacturers are generally well documented. In addition, for many p-CR eye trackers, comprehensive performance evaluations even with regard to several specific populations are already readily available for researchers (e.g., Dalrymple et al., 2018; De Kloe et al., 2021; Hessels & Hooge, 2019; Morgante et al., 2012; Niehorster et al., 2020; and Table 2 from Holmqvist et al., 2022). When contrasted with what is available for p-CR eye tracking, the evaluations for computer-vision-based gaze estimation are severely lacking.

The goal of this paper is thus threefold. First, we introduce how computer-vision-based gaze estimation can be applied from the perspective of an experimental psychologist. Here, we define an experimental psychologist as someone with extensive experience conducting psychological experiments, but with very little or no experience in computer science. As it currently stands, it is not clear which methods are easily accessible and how they can be applied. Second, we provide a validation of the methods that can be used by researchers without expertise or a background in computer science. Third, we provide useful information for the developers of such methods so that they can become more aware of how their techniques can be improved and validated specifically for use in psychological research.

We evaluated computer-vision-based gaze estimation methods in two separate experiments. We then narrowed down the list of all the computer-vision-based gaze estimation methods we cite in this paper to those that (1) do not require calibration, (2) can be downloaded and installed without any required background knowledge in applying such methods, and (3) have proper documentation. This led to a final sample of two methods: those provided in the OpenFace (Baltrušaitis et al., 2018) and OpenGaze toolkits (Zhang et al., 2019). In the first experiment, we assessed the gaze estimation performance of the OpenFace and OpenGaze toolkits under optimal conditions, meaning experienced adult participants, good camera optics, and a task in which the head of the participant remains as stationary as possible without having to fix it in place with, e.g., a chin or forehead rest. We consider the methods viable under optimal conditions if they can be used to estimate the gaze position of the participant with good data quality. In the second experiment, we evaluated the gaze estimation performance of OpenFace in an experiment with infant participants. We were not able to assess the performance of OpenGaze in this experiment for several reasons, which we outline in the discussion. We consider OpenFace viable for infant gaze research if we can derive meaningful gaze-based measures that match closely to measures based on manual coding of gaze direction. The results and considerations of these two experiments will allow researchers interested in using computer-vision-based gaze estimation in fields such as psychology and educational science to better judge whether the tested methods might be suitable for their specific experiments.

Experiment #1: Optimal conditions

Method

Participants

A total of nine colleagues from the department of Experimental Psychology at Utrecht University participated in the experiment. The study was approved by the Ethics Committee of the Faculty of Social and Behavioural Sciences, Utrecht University (protocol number 21-0367).

Setup and stimuli

A Basler ace acA2500-60um industrial camera equipped with a 25 mm Tamron C-Mount lens recording at a resolution of 2592 by 2048 pixels and a frequency of 60 Hz was used to record the participant’s face. It was placed on a table in front of the participant, filming from below with approximately 85 cm from the lens to the eyes. A 60 × 34 cm ASUS ROG computer screen using a resolution of 2560 by 1440 pixels was placed behind the camera at approximately 95 cm from the center of the screen to the eyes of the participant. There were some minor variations in these distances due to differences in participant heights. MATLAB was used to communicate with the camera (Hooge et al., 2021), and the Psychophysics Toolbox extensions (Brainard, 1997; Kleiner et al., 2007; Pelli, 1997) were used to present stimuli on the screen. The stimuli consisted of nine points with a distance of 18.6 cm between adjacent points on the horizontal axis and 11.6 cm between adjacent points on the vertical axis. When assuming the fixed positions we have described, these distances correspond to a difference in visual angle of roughly 11 and 7 degrees between adjacent points on the horizontal and vertical axis of the stimulus grid, respectively. The configuration and specific measures of the setup are illustrated in Fig. 1.

Procedure

Participants were first instructed to seat themselves in a chair placed in front of the computer screen and camera. The experimenter then adjusted the camera angle to make sure that the whole face of the participant was visible in the camera image. Next, the experimenter started the recording script in MATLAB. The first stimulus point was then presented in the top left corner. Participants were instructed to look at a set of points with only their eyes, starting from the first point already presented on the screen, while keeping their head fixated at the center, and to press the space key on the keyboard. This is hereafter referred to as the eyes-only condition. When the space key was pressed, the computer played a beep to signal that recording had started, then recorded a video for three seconds, followed by another beep to signal that recording had finished, after which the second point (i.e., the center top point) was presented on the screen. Each participant first fixated on all nine stimulus points in this fashion. After this, the experimenter instructed the participants to repeat the same procedure once. Next, the experimenter instructed the participant to again fixate on the points, but this time with both the eyes and the head, hereafter referred to as the head-and-eyes condition. The procedure was similarly completed twice for all nine points. As each condition was done twice per participant, the experiment resulted in two recordings per participant per condition. With nine participants, this totaled 18 recordings per condition.

Gaze data

As mentioned in the introduction, the only computer-vision-based gaze estimation methods we found to fulfill our selection criteria were those provided in the OpenFace (Baltrušaitis et al., 2018) and OpenGaze (Zhang et al., 2019) toolkits. We further considered Gaze360 (Kellnhofer et al., 2019), for which the developers also provide step-by-step instructions for installation and use of a (beta version of a) demo code. Unfortunately, we were not able to successfully install the software even when following the step-by-step instructions and testing on multiple systems. Gaze360 was therefore not included in our analyses. OpenFace reports eye gaze direction estimates in world coordinates averaged for both eyes. In the OpenFace wiki, it is stated that looking straight forward should result in a gaze angle of roughly zero degrees on both axes, while looking from left to right should result in a change in gaze angle from positive to negative on the horizontal axis, and looking from top to bottom should result in a change in gaze angle from negative to positive on the vertical axis. OpenGaze, on the other hand, reports estimated gaze position in normalized coordinates. For further explanation of the different coordinate systems used for gaze estimation, we refer the reader to Pathirana et al. (2022).

Results

Our goal was to assess the gaze estimation performance of the computer-vision-based gaze estimation methods that filled our criteria. We examined the performance of the methods by computing three commonly used measures to assess eye-tracking data quality: accuracy, precision, and data loss (Holmqvist et al., 2022). To note, we ran all analyses separately with data from only the first recording from all participants, with data from only the second recording from all participants, and with data from both recordings from all participants. Varying the number of recordings used per participant did not change our conclusions. Therefore, we used all 18 recordings per condition for the reported analyses.

Were the gaze estimates sufficiently accurate?

In eye-tracking research, accuracy is used to represent how close the estimated gaze position is to the true gaze position. To determine the accuracy of the gaze estimates in the present study, we would need to know the true (physical) position of each stimulus point in the world as well as the estimated gaze position for when each participant fixated on each point in the same coordinate system. The coordinate system of neither method was sufficiently clear for us to be able to do this. In the OpenFace output format wiki, it is stated that the estimated gaze angles represent eye gaze direction in world coordinates, converted into a format that is easier to use than gaze vectors. Despite a comprehensive output format wiki, examining absolute accuracy was not possible, as we were not able to determine what the origin of the coordinate system was and could therefore not compare the estimated gaze angles directly to the physical gaze angles observed in the setup. For OpenGaze, we were not able to find out what the assumed geometry of the setup was or how a custom setup could be defined. Because of this, it was not possible to directly relate the normalized coordinates to our stimulus grid. Thus, to answer our question, rather than computing absolute accuracy, we examined it on the level of the configuration and scale of the nine fixations as represented in the gaze estimates of both methods.

For the methods to be accurate on the level of the configuration, we would expect to be able to reliably distinguish between the nine fixations when plotting the horizontal and vertical gaze estimates of a recording against each other. Figure 2A represents our stimulus grid with the nine stimulus points on which participants fixated. Figure 2B represents what we would expect to see in the gaze estimates of an ideal recording, meaning nine clusters representing the nine fixations in a similar configuration as in our stimulus grid. Figure 2C–F illustrate the best and worst recordings as visually judged by the researchers. The best and worst recordings were defined as most and least visually similar to the ideal recording (Fig. 2B) in terms of separation between fixations, spread within fixations, and the orientation of the complete configuration of fixations. As Fig. 2C–F illustrate, it was possible to distinguish between the nine fixations for some but not all recordings. In the worst case, it was not possible to distinguish between any fixations (Fig. 2C). In the best case, however, the nine fixations could clearly be seen in the plotted gaze estimates. Notably, the best observed configuration for both conditions was with OpenGaze estimates (Fig. 2D and F), while the worst observed configuration for both conditions was with OpenFace estimates (Fig. 2C and E).

For OpenFace, three columns of fixations could be seen in the gaze estimates for 61% (11) of the 18 recordings for the eyes-only condition and in 56% (10) of the recordings for the head-and-eyes condition. Three rows of fixations could only be seen in 28% (5) of the 18 recordings for the eyes-only condition and for 0% (0) of the recordings in the head-and-eyes condition. OpenGaze performed better, as three columns could be seen in 89% (16) and 72% (13) of the 18 recordings for the eyes-only and head-and-eyes conditions, respectively. Three rows could be seen in 50% (9) of 18 recordings for the eyes-only condition and in 44% (8) for the head-and-eyes condition. For both methods, when it was possible to distinguish between three columns of fixations, it was also possible to distinguish between three rows of fixations, meaning that the complete grid of stimulus points was clearly represented. These results are illustrated in Fig. 3.

Next, we examined the accuracy of the methods in terms of scale. For the scale of the estimates to be accurate, we would expect the distances between fixations to resemble the distances between points on the stimulus grid. Examining the scale was possible for OpenFace estimates, as we could directly compare angular separation between fixations to angular separation between stimulus points. Points on the stimulus grid were spaced 11 degrees horizontally and 7 degrees vertically with respect to the participant’s eyes. Therefore, in an ideal recording (Fig. 2B), we would expect to see an 11-degree difference in gaze angle between adjacent horizontal fixations and a 7-degree difference in gaze angle between adjacent vertical fixations. To determine whether this was the case, we first computed the median horizontal and vertical OpenFace gaze angle for each fixation. We then further calculated the median angular distance between all adjacent fixations for each axis and divided it by the corresponding angular distance between the points on the stimulus grid (either 11 or 7 degrees, depending on the axis). The ratio obtained represents the accuracy of the estimates provided by OpenFace in terms of scale. A ratio of exactly 1 indicates an accurate representation of the stimulus grid, while ratios greater than 1 indicate overestimation and ratios less than 1 indicate underestimation. We found that when fixating with only the eyes, the distance between adjacent fixations was less than the distance between points on the stimulus grid (see Fig. 4A). The ratios for both axes ranged from roughly 0.3 to 0.7, and the median ratio was lower for the vertical axis. When fixating with the head and eyes, the results differed by a larger margin, with ratios ranging from roughly 0.5 to 1 for the horizontal axis and from roughly 0.3 to 1 on the vertical axis, as well as a greater difference in median ratio between the axes. Similar comparisons were not possible for OpenGaze estimates, as they were in normalized coordinates and could not be directly related to the angular separation of the stimulus grid. We elaborate on this in more detail in the general discussion. Nevertheless, it was still possible to examine whether there was a difference in the accuracy of the estimates in terms of scale between the two axes and two conditions. To this end, we performed the same calculations with the OpenGaze estimates. These results are shown in Fig. 4B. Based on the plotted data, it appears that estimates by OpenGaze were more consistent across axes and across conditions than their OpenFace counterparts; OpenGaze seems to report gaze estimates in a similar scale regardless of whether one fixates with only the eyes or with both the head and the eyes for both the horizontal and vertical axis.

Taken together, the estimates for both OpenFace and OpenGaze were more accurate in the horizontal direction than in the vertical direction. It was possible to distinguish between three columns of fixations for over half of the recordings for OpenFace and for nearly all recordings for OpenGaze. For the vertical direction, there was more overlap between fixations for both OpenFace and OpenGaze, hence making it more difficult to distinguish three rows of fixations in the plotted data. This can be at least partly explained by the smaller difference in gaze angle between the stimulus points in the vertical direction. Importantly, there was high variability in the accuracy of the estimates in terms of scale between the conditions for OpenFace but not for OpenGaze, indicating that for OpenFace, the gaze estimates were affected by head rotation, while OpenGaze produced similar estimates regardless of whether one fixated with the eyes only or with both the head and the eyes. OpenFace underestimated gaze angles for both conditions, while for OpenGaze it was not possible to determine whether there was underestimation or overestimation. On average, the estimates were accurate enough to distinguish between fixations to the nine stimulus points in the horizontal direction for OpenGaze but not for OpenFace. As variability between the two axes and conditions for OpenGaze was low, we conclude it to also be sufficiently accurate on the vertical axis, as long as there is at least a difference of 11 degrees of gaze angle between stimulus locations, as was the case for the horizontal axis. Estimates by OpenFace were much more variable and not accurate enough in the vertical direction.

Were the gaze estimates sufficiently precise?

Precision refers to how consistent the reported gaze estimates are. If the estimates by the methods are precise, we would expect that when a participant fixates on a point presented on a screen, the reported gaze estimates over the duration of the fixation should be close to each other. In eye-tracking research, precision is typically operationalized by either the standard deviation or the sample-to-sample root mean square of the gaze signal (Holmqvist et al., 2012). A reasonably precise recording would look like the ideal recording presented in Fig. 2B, where there is little variation in the gaze estimates within fixations. As Fig. 2D and F illustrate, precision was high for the best recordings for both conditions, as variation within fixations was low. In the worst recordings (Fig. 2C and E), there was considerable variation within fixations. As the worst recordings illustrate, low accuracy combined with low precision can lead to major overlap between fixations.

To examine precision relative to the estimated grid of fixations, we divided the median spread (i.e., the standard deviation of the gaze signal) within fixations on each axis by the median distance between fixations on that axis for each recording. A resulting ratio close to zero indicates that distinguishing between fixations on that axis was generally possible. Conversely, a higher ratio means that it is difficult to distinguish between fixations on that axis and overlap is likely. For the horizontal axis (Fig. 5A), the ratios for both methods and both conditions were similarly low (except for two OpenFace recordings, both belonging to the same participant), suggesting that there was little overlap between adjacent fixations. Ratios for the vertical axis were higher (Fig. 5B), indicating more overlap between adjacent fixations. This was expected because the stimulus points on the vertical axis were closer to each other than the stimulus points on the horizontal axis. There was substantially more variability in the ratios between the methods for the vertical axis; ratios for OpenFace were much higher than those for OpenGaze.

Based on visual examination of the spread within fixations as well as the low variation in the ratios of spread to separation between the two conditions, we conclude that average precision in the horizontal direction was sufficient for both methods. For OpenGaze, precision in the vertical direction was similar to precision in the horizontal direction, and variability between conditions was low for precision on both axes. For OpenFace, precision in the vertical direction was substantially lower than precision in the horizontal direction, and variability in the ratios of spread to separation between conditions was high. Thus, for estimates in the vertical direction, precision for OpenGaze was sufficient but the precision for OpenFace was not.

Were there any lost data?

Another important aspect of data quality in eye-tracking research is data loss. Data loss refers to how much missing data there were for a given recording. Data loss could be due to error in the recording tool (e.g., when the participant blinks or when an eye tracker is not able to find the eyes even when they are within the area the eye tracker is capable of measuring), or due to the behavior of the participant (e.g., when the eyes or the face of the participant has moved outside the area the eye tracker is able to measure). The participants in our experiment were always facing the camera and a video of their face was only being recorded when they were fixating on the points, meaning that the eyes and face were always visible to the camera. Any data loss should be due solely to recording error, or if, for some reason, the eyes of the participant were not visible to the camera (e.g., when the participant blinks). We computed the percentage of data loss, defined as the number of samples with a missing gaze estimate, for all 36 recordings (18 per condition). There were no lost data in any of the recordings for either method. Upon examining the recorded videos, however, we found that a few participants blinked on a few occasions when fixating on the points. Notably, both OpenFace and OpenGaze reported gaze estimates even for periods when the eyes of the participant were fully closed in the video recording. When the eyes were closed, both methods showed a sharp change in the gaze estimate on the vertical axis (see Fig. 6). It thus appears that with a camera filming the face from below under optimal conditions, data loss does not seem to be a problem for either method. Blinks can potentially be identified by sharp changes in the velocity of the vertical component of the gaze signal.

Interim summary

Participants fixated on nine stimulus points presented on a computer screen while a camera recorded a video of their face. The videos were processed using the OpenFace and OpenGaze toolkits. We investigated the accuracy, precision, and percentage of data loss of the gaze estimates. In summary, the accuracy and precision of the estimates by OpenGaze were sufficient to distinguish between fixations on the horizontal axis for nearly all recordings. The estimates by OpenFace, on the other hand, were only accurate enough to distinguish between the points for a bit over half of the recordings. Importantly, OpenGaze estimates were resilient to changes in looking behavior (i.e., fixating with only the eyes vs. fixating with both the head and eyes) while OpenFace was not. Data loss was sufficiently low for both methods. We conclude that OpenFace and OpenGaze can potentially be used in sparse environments (e.g., three relatively large areas of interest [AOIs]) with horizontally separated stimuli. OpenGaze can possibly also be used in less sparse environments and for both horizontally and vertically separated stimuli, with stimuli separated by at least 11 degrees of gaze angle with respect to the participant’s eyes.

Experiment #2: Infant participants

In the second experiment, we examined whether computer-vision-based gaze estimation can be used to derive meaningful dwell-based measures from the data concerning infant participants. For this experiment, only OpenFace gaze estimates were evaluated. We address the reasons for this in the general discussion.