Keywords
face, emotion, saliency, spatial heterogeneity, nonlocal contrast, second-order visual mechanisms
This article is included in the Software and Hardware Engineering gateway.
face, emotion, saliency, spatial heterogeneity, nonlocal contrast, second-order visual mechanisms
In Introduction, some statements have been changed and several references have been corrected. Some statements seemed too categorical have been softened. Thus, the term "mechanism" was replaced by "filter"; a more balanced view on the issue of preattentive nature of face recognition was given. A more detailed description of the second-order visual filter model has been given as well.
The section "Methods" has been structured and expanded to make the design of the study more understandable to the reader. The details about the observer's task and the stimuli used was added. In the section "Results", the confidence boundaries have been added to Figure 2. The conclusions has been expanded.
The content of the OSF repository linked to the article has been expanded. All raw results obtained by the authors and the stimuli created by processing photographs from the FERET collection have been provided for free access. The names of the original face images from this collection have been added to readme.txt file. Also, some additional information about the study design and descriptions of some scripts used have been added to this file.
See the authors' detailed response to the review by Yuri E. Shelepin
See the authors' detailed response to the review by Tina Tong Liu
Experiments involving a saccadic task (Crouzet et al., 2010; Kirchner & Thorpe, 2006), registration of MEG (J. Liu et al., 2002), ERP (Cauchoix et al., 2014; Dering et al., 2011; Herrmann et al., 2005) and measurement of intracranial field potentials (H. Liu et al., 2009) showed that face detection and identification are so fast that we can most probably speak here of a feedforward processing (Crouzet & Thorpe, 2011; Muukkonen et al., 2020; VanRullen, 2006; Vuilleumier, 2000; Vuilleumier, 2002; but see T. Liu et al., 2022), that is without any involvement of attention (Entzmann et al., 2021; Kovarski et al., 2017; Reddy et al., 2004; Reddy et al., 2006). However, there is also an opposite point of view (Pessoa et al., 2002; Schindler & Bublatzky, 2020; Tomasik et al., 2009). This may mean that low level information is used to distinguish a face from the background and to define its characteristics.
Many researchers believe that faces are holistically coded within the low-frequency range, and this description is sufficient not just to detect the face but also to determine its emotional expression (Calder et al., 2000; Schyns & Oliva, 1999; Tanaka et al., 2012; White, 2000). Meanwhile the classical work by A.L. Yarbus (1967) clearly demonstrated that while viewing a face we fix our eyes at quite definite details. Further eye tracking experiments and experiments with the “bubbles” method showed that not all areas of the face are equally useful for emotion recognition (Blais et al., 2017; Duncan et al., 2017). Different facial features are significant for the discrimination of different emotions (Atkinson & Smithson, 2020; Calvo et al., 2014; Eisenbarth & Alpers, 2011; Fiset et al., 2017; Jack et al., 2014; Smith & Merlusca, 2014; Smith & Schyns, 2009; Smith et al., 2005; Wang et al., 2011), these emotions being probably processed at different rates too (Ruiz-Soler & Beltran, 2012).
The problem is that the lower levels of the human visual system, which are classified as preattentive stages of processing, lack neurons which would be selective to certain facial features. While recent evidence suggests that V1 activity is modulated by the amygdala in the perception of emotional faces (T. Liu et al., 2022), this feedback is unlikely to be involved in feedforward processing. Nevertheless, there should exist a mechanism permitting the detection of faces automatically and to extract significant information quickly. The aim of this investigation was to identify the possible candidate for the above mechanism.
Realization of the importance of defining those areas of interest in the images that attract visual attention, was the impetus for those research trends aimed at finding the algorithm of formation of saliency maps (Borji et al., 2013; Judd et al., 2012; Rahman et al., 2014). At the same time, the choice of the attention goal should be based on the principle of information maximization (Bruce & Tsotsos, 2005).
In respect of the human visual system, one can only speak of the preattentive processes actualized within low-level vision and able of “bottom-up” attention control. It is clear that attention is attracted to what changes in time (on- and off-reactions) and in space (changes in luminance). For the saliency problem, the latter is the most important. Indeed, there are specialized cells for finding brightness gradients in the visual system, and these are striate neurons (Hubel & Wiesel, 1962). However, these can only find local heterogeneities. To find areas of interest, there should exist mechanisms beyond local operations. Yet we first have to answer the question about the characteristics of these nonlocal areas of interest. In recent years, there appeared a viewpoint stating that the image areas whose information content differs from the surroundings are of the greatest interest for the visual system (Baldi & Itti, 2010; Hou et al., 2013; Itti & Baldi, 2009). This refers to difference in low-level feature distribution in the field of view (Itti et al., 1998), while salience in this case is determined by the degree of total difference of features within the analyzed area from features in the surrounding area (Bruce & Tsotsos, 2009; Gao & Vasconcelos, 2007; Perazzi et al., 2012).
Important is that the human visual system can find space heterogeneities of brightness gradients (see review Graham, 2011). This operation is implemented by so-called second-order filters. They are localized mainly in the ventral extrastriate regions (Larsson et al., 2006) and unite the outputs of simple striate neurons according to a certain rule. Two successive stages of linear filtering are separated by a rectifying non-linearity (for a more detailed introduction to the model of the second-order filters, see Kingdom et al., 2003). In this case, the description of the carrier by the first-order filters is transformed by the second-order filters into a description of the envelope. The receptive fields of the second-order filters are organized in such a way that these elements do not respond to homogeneous textures, but are activated when the texture has modulations of contrast, orientation, or spatial frequency of brightness gradients.
So far these processes have been predominantly studied and considered as an instrument of segmentation of textures (e.g. Graham & Sutter, 2000; Graham & Wolfson, 2004; Kingdom et al., 2003; Schofield & Yates, 2005). Here we raise the question whether the second-order visual filters can be of use in segmenting natural images and finding in them those saliency areas that are used for categorization. Our expectation was to obtain the answer through the task of detecting faces in a series of successively presented objects and determining their emotional expression.
It was shown earlier that the second-order filters are specific to the modulated visual feature, i.e. whether it is contrast, orientation or spatial frequency of brightness gradients (Babenko & Ermakov, 2015; Kingdom et al., 2003). Then it was revealed that modulations of contrast take priority in competition for attention (Babenko et al., 2020). All this enabled us to work out a hypothesis stating that areas of maximum modulation of nonlocal contrast contain information helpful in identifying emotional facial expressions. To test this hypothesis, we developed a software program (gradient operator of nonlocal contrast) imitating operation of the second-order visual filters and calculating the space distribution of contrast modulation amplitude in the input image.
Participants. A total of 38 students between the ages of 19 and 21 took part in this investigation. All the participants had normal or corrected to normal vision and reported no history of neurological or psychiatric disorders. All the research participants were informed about the purpose and procedures of the experiment; they all signed a consent form that outlined the risks and benefits of participating in the study and indicated that they believed in the safety of the investigation. The study was conducted in accordance with the ethical standards consistent with The Code of Ethics of the World Medical Association (Declaration of Helsinki) and approved by the local ethics committee. The design of the experiment, the methodological approach, the conditions of confidentiality and use of the consent of participants were performed according to the Code of Ethics of Southern Federal University (SFU; Rostov-on-Don, Russia) and approved ethically by the Academic Council of the Academy of Psychology and Pedagogy of SFU, on 25 March, 2020.
Stimuli. Initial digitized photos of faces and objects brought to a single size (8 ang.deg.), medium brightness (35 cd/m2) and RMS contrast (0.45), were processed by the nonlocal contrast gradient operator. A total energy of the image filtered at a frequency of 4 cycles per a diameter of this central area with a 1 octave bandwidth, was calculated in the center of the operator’s concentric area. In the peripheral part of the operator (a ring whose width equaled the central area diameter), the spectral power of the entire range of spatial frequencies perceived by a person was calculated, per 1 octave on average.
The contrast modulation amplitude amounted to the difference of values of the power spectrum obtained in the operator’s central and periphery areas. Operators of various diameters were used, and for each operator we defined those areas where the total contrast was maximum different from the surroundings, i.e. had the highest modulation amplitude.
The algorithm of stimuli formation is shown in Figure 1. An initial image example can be seen on the left. Then there are spatial frequencies in cycles per image (cpi) for which space distribution of the total nonlocal contrast was defined. On the right, one may see 3D maps of space distribution of contrast modulation amplitude when using operators of various diameters. The next column demonstrates the same maps in a 2D format. Red dots on them show local maximum apexes. While processing the image with the gradient operator of the largest size with its central area diameter making one half of the image size, we selected 2 maximums, after which, in the course of operator diameter two-fold reduction, selected were 4, 8 and 16 maximums correspondingly. A round aperture with a Gaussian transfer function transmitting four image cycles (hereinafter this aperture is referred to as a “window”) was placed within positions found this way. Areas of maximum contrast modulation amplitude were combined in a new image (the right column). The total diameter of the areas found at different spatial frequencies equaled the diameter of a conventional circle with the initial image fit to it. Stimuli were the faces synthesized from the areas extracted at one spatial frequency (examples can be seen in the right column of Figure 1), as well as those resulting from the combination of these images within one aggregate image (i.e. containing all the previously used spatial frequency ranges).
To create stimuli, we used 56 initial images of faces and 235 initial images of natural objects.Two sets of object images and one set of face stimuli, 120 each, were formed. Objects and faces repeated in different sets contained different spatial frequencies. Photos of faces in frontal view (actually the angle is slightly different) were taken from FERET Database collected under the FERET program, sponsored by the DOD Counterdrug Technology Development Program Office (Phillips et al., 1998; Phillips et al., 2000). This database was created with the consent of participants and contains photographs of men and women of different races with different emotional facial expressions. We used part of the images from the database provided to us in full accordance with Color FERET Database Release Agreement. In fact, we used the “bubbles” method (Gosselin & Schyns, 2001), yet unlike the traditional approach with the aperture located at random, the aperture of our research was placed in definite, previously pre-estimated positions which corresponded to the areas with a definite modulation value of the total nonlocal contrast.
Then the same way we formed stimuli consisting of areas with the minimum contrast modulation amplitude, as well as images consisting of areas with a modulation having the medium amplitude between the closest minimums and maximums.
Study Design. We employed a one-way design for independent samples having a three-level factor “Amplitude of modulation” (min, med, max). The percentage of correct identification of facial expressions was the dependent variable. The sample size was determined based on Anova's power = 0.8 and expected Cohen's f > 0.5 effect size. The minimum expected effect size was determined based on the results of the preview of the prepared images performed by the researchers themselves.
The first group of observers (13 people) saw faces composed of areas of minima, objects of the first set composed of areas of maxima, and objects of the second set composed of intermediate areas. The second group of observers (12 people) saw faces composed of maxima, objects of the first set composed of intermediate regions, and objects of the second set composed of regions of minima. The third group (13 people) were presented with faces composed of intermediate regions, objects of the first group composed of minima regions, and objects of the second set composed of maxima regions. Faces and objects were shown mixed, the order of presentation was random.
Procedure. The observers were demonstrated synthesized images of Caucasian and Asian faces (male and female) with neutral and happy facial expressions. These randomly alternated with synthesized images of objects of different categories, the probability of faces within the chains of consequent stimuli making 33%. The observers' task was to categorize any presented image as accurately as possible. The observer had to inform about the appearance of a face and possibly define its expression (the answer “I don’t know” was allowed). Exposure time was not limited. The percentage of correct recognitions of facial expressions for the images formed from the areas of different contrast modulation amplitudes, was calculated.
In order to anonymize the identity of the observers, all names were encrypted by md5 algorithm, and initial raw data files were saved on the local disk storage with limited access.
First, we compared task solution effectiveness where the face images had been formed from maximum nonlocal contrast areas belonging to the narrow spatial frequency range. It is worth reminding that the lesser the diameter of the areas, the higher the spatial frequency (cpi) contained in them and the greater the general number of the areas found. Where synthesized face images contained space frequencies of just one range of 1 octave, the general result of facial expression recognition was low (Figure 2). The performance was higher for the stimuli formed from the areas with the maximum increase in contrast having the central spatial frequency of 16 cpi. Somewhat lower were the values of 32 cpi frequency, and much lower these were for the lowest and the highest frequency ranges. The obtained distribution generally agrees with the data suggesting that the medium spatial frequency range expressed in cycles per face is more important in face recognition (Boutet et al., 2003; Collin et al., 2006; Näsänen, 1999; Parker & Costen, 1999; Tanskanen et al., 2005; see also review Ruiz-Soler & Beltran, 2006).
However, our main purpose was to test the hypothesis stating that the most informative image areas are those with the greatest increase in nonlocal contrast using the example of faces of different emotional expressions.
To answer this question, we compared the effectiveness of task solution for the faces formed from the areas of different contrast modulation amplitudes: maximum, minimum and medium (Figure 3). The stimuli were combined from the areas found in all the applied spatial frequency ranges.
It was found that in the task of identifying the facial emotional expression the result approximately improves from 5% to 61% with the increase in the modulation amplitude of the total contrast in those fragments from which the stimulus is formed (see Figure 4).
Using ANOVA (JASP software, RRID:SCR_015823) has revealed the statistical significance of the dependence obtained (see Table 1). The Levene's test calculation showed a need to use homogeneity corrections.
The obtained effect is very high (Cohen’s f = 1.3). Post Hoc analysis with the application of Tukey’s test with Bonferroni and Holm’s corrections (see Table 2) also showed that accuracy with which the observers recognize emotions in the faces formed from the areas of different contrast modulation amplitudes, significantly grows with the amplitude increase.
Mean Difference | SE | t | ptukey | pbonf | pholm | ||
---|---|---|---|---|---|---|---|
max | med | 22.035 | 7.359 | 2.995 | 0.014 | 0.015 | 0.005 |
max | min | 55.769 | 7.210 | 7.735 | < .001 | < .001 | < .001 |
med | min | 33.734 | 7.359 | 4.584 | < .001 | < .001 | < .001 |
Thus the obtained results have verified our hypothesis stating that the face image areas of the greatest increase of total nonlocal contrast contain information which can be used by the visual system in recognizing emotional expressions.
We used the task of recognizing face emotional expressions in order to demonstrate that the areas of the greatest nonlocal contrast modulation amplitude might possibly be the most informative ones, hence they may be used in categorizing face expressions. Meanwhile the same areas may be revealed by the second-order visual filters.
It should be noted that in recent years there have been publications of a number of model studies where the assessment of the image area aggregate energy is making the basis of the algorithm of segmenting the scenes and selecting objects from the background (Cheng et al., 2011; Fang et al., 2012; Perazzi et al., 2012). These calculation operations demonstrate really good effectiveness, yet they generally have little in common with the true-life filters in the human visual system.
In our study we too proceeded from the assumption that space heterogeneities of the image energy might contain helpful information. Yet the most important item of our work is that we propose a definite physiological process able of detecting these areas of interest in the image. The developed gradient operator calculating the nonlocal contrast modulation amplitude imitates the functioning of the second-order visual filters with different spatial-frequency tunings. Moreover, we tried to maximally approximate these operators’ parameters to the well-known characteristics of the second-order filters. Thus, for example the spatial frequency (in cycles per “window”) passed from the extracted areas is constant for the “windows” of all the used sizes. This emphasizes the presence of a fixed ratio of the frequency tunings of the first- and second-order filters (Dakin & Mareschal, 2000; Sutter et al., 1995) and thus ensures the invariance of the description when changing the scale. We have also used a “window” resizing step which provides a change step in the spatial frequency passed by the “windows”, this step equaling 1 octave, which roughly corresponds to the step in the change of the spatial-frequency tuning of pathways in the human visual system (Wilson & Gelb, 1984). The bandpass of the second-order filters also corresponds to the given bandwidth of our operator and is equal to 1 octave (Landy & Henry, 2007). We have used the Gaussian envelope in passing the extracted image area, thus imitating the spatial characteristics of the filters at the human visual system input. We have defined that a “window” transmits namely four cycles of the input image. This value is also based on the previously obtained results (Babenko et al., 2010).
At the same time there were parameters whose optimality remains doubtful to us. So, for example, the number of identified areas grows exponentially in cases where the operator’s size reduces, this chain starting from two “windows”. We have proceeded from the requirement that the total diameter of the identified areas should be equal to the diameter of the whole image. In this case the spatial frequency of the synthesized face may be easily calculated in cycles per image. However, in reality there might be some other number of areas identified at each frequency that is optimal. No doubt, increase in their number will lead to an improved result. Neither did we introduce eccentricity correction since we assumed that in natural conditions saliency maps may also be formed by the human visual system with the use of eye movements. However, the data concerning the time of facial expression perception might indicate that one fixation is sufficient for this (L. Liu & Ioannides, 2010; Pourtois et al., 2010; see also reviews George, 2013; Vuilleumier & Pourtois, 2007), although another opinion also exists (Duncan et al., 2019; Eimer & Holmes, 2007; Eimer et al., 2003; Erthal et al., 2005; Okon-Singer et al., 2007; Pessoa et al., 2002).
Nevertheless, it is impossible to take into account every parameter of the processes providing search for areas of interest in the image and can hardly put into question the conclusion that the information content of the facial image reflecting its emotional expression increases with the growth of the nonlocal contrast amplitude of areas which form this image.
It is also noteworthy that the areas of a maximum nonlocal contrast amplitude can generally be found specifically around the eyes and the mouth (see Figure 1 and Figure 3), i.e. those parts of the face that are considered to be most informative in conveying emotional signals (Bombari et al., 2013; Eisenbarth & Alpers, 2011; Yu et al., 2018).
However, another question arises. Nonlocal luminance contrast has been effective in the task of discriminating facial expressions, but will it be a salience feature in another tasks, such as gender or race determination, for example? And what about non-facial image recognition? The answer to this question should be given by future experiments. However, since we consider the process of finding areas with the largest increase in nonlocal contrast as preattentive, its result should not depend on the visual task. Preattentive processing only offers a set of image areas, information from which can be used by higher processing levels with the help of attention. The strongest saliencies automatically attract 2–3 initial fixations. These few hundred milliseconds are enough to recognize an emotional facial expression (Du & Martinez, 2013). Subsequent top-down attention allows for more information.
The obtained experimental results have supported the hypothesis stating that the image areas of the greatest increase in the nonlocal contrast contain information that contributes to the identification of emotional facial expressions. The second-order visual filters are able to find such information.
We also suppose that the second-order visual filters that highlight the image areas with the highest modulation amplitude of nonlocal contrast are able to attract visual spatial attention; these filters are the windows through which subsequent processing levels receive significant information.
Open Science Framework: Nonlocal contrast calculated by the second order visual mechanisms and its significance in identifying facial emotions, https://doi.org/10.17605/OSF.IO/KGRWA (Yavna, 2021).
This project contains the following underlying data:
emotions.csv contains main data,
emotions.jasp contains main statistics,
raw_ results folder contains raw anonymized response logs,
calc_result2.py and give_me_res.sh from the raw_results folder are scripts for processing response logs (*.json) and creating faces.csv file,
faces.jasp contains statistics of emotion recognition at all frequencies,
stimuli.tar.bz contains all the stimuli used in the study
readme.txt contains some additional information and comments
Data are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).
Source code available from: https://github.com/dvyavna/2ord_contrast
Archived source code as at time of publication: https://doi.org/10.17605/OSF.IO/5YZGW (Yavna, 2021).
License: MIT
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Vision, face recognition, emotional processing, primary visual cortex, neuroplasticity
Is the work clearly and accurately presented and does it cite the current literature?
No
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Partly
Are the conclusions drawn adequately supported by the results?
Partly
References
1. Bo K, Yin S, Liu Y, Hu Z, et al.: Decoding Neural Representations of Affective Scenes in Retinotopic Visual Cortex.Cereb Cortex. 2021; 31 (6): 3047-3063 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Vision, face recognition, emotional processing, primary visual cortex, neuroplasticity
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Vision perception and pattern recognition
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 29 Aug 23 |
read | |
Version 1 06 Apr 21 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)