The effect of face masks on the stereotype effect in emotion perception ☆ Journal of Experimental Social Psychology

The accurate and swift decoding of emotional expressions from faces is fundamental for social communication. Yet, emotion perception is prone to error. For example, the ease with which emotions are perceived is affected by stereotypes (Bijlstra, Holland, & Wigboldus, 2010). Moreover, the introduction of face masks mandates in response to the Covid-19 pandemic additionally impedes accurate emotion perception by introducing ambiguity to the emotion perception process. Predictive coding frameworks of visual perception predict that in such sit- uations of increased ambiguity of the sensory input (i.e., faces with masks), people increasingly rely on their prior beliefs (i.e., their stereotypes). Using specification curve analysis, we tested this prediction across two experiments, featuring different social categories (Study 1: Gender; Study 2: Ethnicity) and corresponding emotion stereotypes. We found no evidence that face masks increase reliance on prior stereotypes. In contrast, in Study 1 (but not in Study 2), we found preliminary evidence that face masks decrease reliance on prior stereotypes. We discuss these findings in relation to predictive coding frameworks and dual process models and emphasize the need for up-to-date analytic methods in social cognition research.


Introduction
The accurate and swift decoding of emotional expressions from human faces is fundamental for social communication. As such, failures and deficits in facial emotion perception can have negative consequences, such as poor social functioning, decreased quality of social interactions, and inappropriate behavioral responses (e.g., in autism: Baron-Cohen, Richler, Bisarya, Gurunathan, & Wheelwright, 2003). Successful emotion perception depends on a multitude of factors, such as the facial features of the observed (Becker, Kenrick, Neuberg, Blackwell, & Smith, 2007;Marsh, Adams Jr., & Kleck, 2005;Sacco & Hugenberg, 2009), neurological disorders of the observer (e.g., Ashwin, Chapman, Colle, & Baron-Cohen, 2006;Harms, Martin, & Wallace, 2010) or evaluative associations between social categories and emotions (Craig, Koch, & Lipp, 2017;Hugenberg, 2005;Hugenberg & Sczesny, 2006). In an example of the latter, Hugenberg (2005) presented White observers with White and Black faces displaying different emotions. He found that the observers correctly identify happiness comparatively faster on White faces, and anger and sadness comparatively faster on Black faces and concluded that the race of the target face provides an evaluative context in which emotions are interpreted. This suggests that White observers have more positive evaluative associations with White faces compared to Black faces and more negative evaluative associations with Black faces compared to White faces.
Other studies argued that social categories do not merely influence emotion perception via general negative or positive evaluative associations, but also via specific stereotype associations Bijlstra, Holland, Dotsch, & Wigboldus, 2019), a claim well in line with prior research differentiating both processes (Amodio & Devine, 2006). In their research, Bijlstra et al. (2010) employed a speeded categorization task comparing the time required to perceive anger or sadness on male and female faces and on White-Dutch and Moroccan-Dutch male faces. Congruent with the gender stereotype that anger is more typical of males than of females and sadness more typical of females than of males (Plant, Hyde, Keltner, & Devine, 2000), and the in the Netherlands widespread stereotype association between Moroccan-Dutch males and danger (Bijlstra, Holland, Dotsch, Hugenberg, & Wigboldus, 2014;Dotsch & Wigboldus, 2008;Verkuyten & Zaremba, 2005), a stereotype effect was found. That is, this research showed that anger was more quickly perceived for male compared to female faces and for Moroccan-Dutch compared to White-Dutch faces, whereas sadness was more quickly perceived on female compared to male and White-Dutch ☆ This paper has been recommended for acceptance by Dr. Rachael Jack compared to Moroccan-Dutch faces, respectively (= stereotype effects).
On a theoretical level, one framework that can explain these findings are Bayesian models of social perception (Otten, Seth, & Pinto, 2017). Bayesian models, often also referred to as predictive coding frameworks, posit that the perception of a stimulus is influenced by two separate sources of information, namely the likelihood function and the prior. Consider, for example, a hypothetical experiment in which participants are presented with White-Dutch and Moroccan-Dutch faces displaying either anger or sadness at either 50% or 100% emotion intensity. The likelihood would represent the probability of seeing an emotional expression if there is indeed an emotion displayed or P(visual input | emotion). The likelihood here is influenced by whether the emotion is displayed at 50% or 100% emotion intensity. The prior would represent the probability of seeing an emotional face or P(emotion), which would be influenced by stereotypical associations related to the ethnicity of the face. Finally, the posterior would represent the probability of categorizing a face as angry or sad given the face, or P(emotion | visual input). Following Bayesian models, the perception of the stimulus is influenced by both the prior and the visual input, which means that both sources of information interact: The weaker the visual input (i.e., 50% emotion intensity) and the stronger the prior (i.e., stronger emotionsocial category associations), the stronger the tendency to respond in a stereotype-congruent manner regardless of the emotionsocial category combination actually being presented.
In the research of Bijlstra et al. (2010), the likelihood can be described as P(face | emotion), the posterior as P(emotion | face), and the prior, which is influenced by participants stereotypical associations between Moroccan-Dutch and anger, and White-Dutch and sadness, respectively, as P(emotion). The experimental design facilitated reliance on the prior, i.e., the stereotypes, because the emotional faces were presented only briefly. Further experiments showed that the strength of the stereotype effect on emotion perception indeed depends on participant's priors. Individuals with stronger stereotype associations showed stronger stereotype effects on the emotion perception task (Bijlstra et al., 2014). In sum, a stronger prior has a larger influence on the subsequent interpretation of the sensory input, and thus leads to a greater stereotype effect, just as generally predicted by predictive coding frameworks.
A further prediction made by predictive coding frameworks is that the influence of the prior should increase as the strength of the visual input decreases (as postulated in our hypothetical example above). Accordingly, behavioral studies have long established that people are more likely to rely on stereotypes and heuristics in ambiguous or uncertain situations (Correll, Hudson, Guillermo, & Ma, 2014;Correll, Park, Judd, & Wittenbrink, 2002;Eberhardt, Goff, Purdie, & Davies, 2004;Neth & Gigerenzer, 2015;Tversky & Kahneman, 1974). Similarly, one study on emotion perception presented participants with racially ambiguous faces and showed that participants more often and faster perceived ambiguous angry faces as Black than as White, i.e., stereotypecongruent (Hutchings & Haddock, 2008).
In the present study, we applied the predictive coding framework (Otten et al., 2017) to the effects of stereotypes on emotion perception  by testing the prediction that decreasing the strength of the visual input increases reliance on the prior. That means, we expect that decreasing P(face | emotion) or the probability of seeing an emotional expression if there is indeed an emotion displayed should increase the effect of the stereotypical associations between social group and emotion on P(emotion | face) or the probability of categorizing a face as angry or sad.
There are two reasons why this prediction is important to test: Firstly, it is a novel prediction of the predictive coding framework that is not made by dual process models (e.g., Strack & Deutsch, 2004). Secondly, and related, by testing this prediction that is uniquely based on predictive coding framework, it may further help to better understand the applicability of predictive coding frameworks to social cognition in general and emotion perception in particular.
To do so, we relied on a source of uncertainty that was introduced by necessity: Face masks. To limit the spread of the Covid-19 pandemic, many governments recommend the use of surgical face masks and other face coverings to their citizens. Typically, such face masks cover the mouth area and the nose, and often also any remaining area below the eye region. Since the mouth area is fundamental to emotion perception (Blais, Roy, Fiset, Arguin, & Gosselin, 2012), face coverings make it harder to perceive emotions (Carbon, 2020;Rinck, Primbs, Verpaalen, & Bijlstra, 2022). Importantly, perceivers focus on different parts of the face, depending on the emotion displayed. For example, for anger and sadness, perceivers typically focus on the upper part of the face, which means that although there is important information obscured by the mouth masks, perceivers should still be able to perceive the displayed emotion (Smith, Cottrell, Gosselin, & Schyns, 2005). For example, Rinck and colleagues investigated the accuracy of emotion judgements for models displaying the 6 basic emotions and a neutral expression with or without a face mask and showed that face masks drastically reduced the accuracy with which most emotions were perceived, by almost 20% across emotions. In terms of a predictive coding model, this means that face masks decrease the strength of the visual input.
In sum, the present research investigated whether face masks affect the reliance on stereotypical associations between certain social categories and emotions, when perceiving emotions. Across two studies, employing a speeded categorization task (Hugenberg, 2005) and different social categories and stereotypes, participants were asked to categorize emotions when presented with various emotional faces with and without face masks. Study 1 investigated whether face masks increase gender-emotion stereotypes. In Study 2, we extended the findings of Study 1 to ethnicity-emotion stereotypes. Based on predictive coding frameworks, we predicted that decreasing the strength of the visual input will increase reliance on the prior. In short, we expected that introducing face masks increases the size of the stereotype effects.
The present research specification curve analysis demonstrates the use of specification curve analysis (Simonsohn, Simmons, & Nelson, 2020) as a viable and necessary tool for the analysis of reaction time data in social cognition research. Recent surveys of reaction time analyses have shown that researchers vary substantially in the type of reaction time data pre-processing they employ (Kerr, Hesselmann, Räling, Wartenburger, & Sterzer, 2017;Primbs, Holland, Quandt, & Bijlstra, 2022), and that differences in data pre-processing have a considerable influence on the outcomes of statistical tests and potential conclusions that can be drawn from the data . Crucially, many of these decisions are arbitrary and there are only few evidence-based guidelines on how to make such decisions available (e.g., André, 2021;Ratcliff, 1993). Recognizing this arbitrariness, specification curve analysis allows researchers to draw inferences and obtain p-values across many different data pre-processing and analysis pathways, increasing replicability and decreasing the chance of false positives.

Study 1
In Study 1, participants categorized male, and female faces as angry or sad. In line with common gender-emotion stereotypes (Plant et al., 2000), namely that anger is more typical for men and sadness is more typical for women and replicating earlier research on stereotypes and emotion perception , we expected a two-way interaction between Emotion and Model Gender. That is, we expected that anger is perceived faster on male compared to female faces (maleanger stereotype effect), and that sadness is perceived faster on female compared to male faces (female-sadness stereotype effect). Importantly, and testing the main research question of the present study, we expected that both stereotype effects would be larger for masked compared to unmasked face (i.e., a three-way interaction between Emotion, Model Gender, and Mask Status).

Participants
We recruited a sample of 262 adult, English-speaking participants via Prolific. After application of our pre-registered exclusion criteria, a final sample size of 155 participants remained. Please note that most excluded participants (n = 102) did not actually complete the experimentthey failed the attention check presented during the instructions and were directly forwarded to the end of the experiment, skipping all experimental trials. The other participants were removed because they were too slow (3SD from the mean reaction time; n = 3) or made too many mistakes (n = 2). The final sample consisted of 97 males and 58 females between 18 and 64 years (M = 25.58, SD = 8.75) from 25 countries. Participants were paid in accordance with Prolific guidelines on fair compensation and received at least 1.75 British pounds for 14 min of their time. The study was reviewed independently by the Ethics Committee Social Sciences (ECSS) of Radboud University, and there was no formal objection to this study (reference number: ECSW2017-3001-45).

Sensitivity power analysis
We further conducted a sensitivity power analysis for a three-way ANOVA using MorePower (version 6.0.4; Campbell & Thompson, 2012). With α =.05, our sample of 155 participants has 80% to detect effect sizes as small as η p 2 = .049.

Procedure & materials
Participants signed up on Prolific and were subsequently linked to the Qualtrics platform, where they were presented with an information letter and a consent form. After agreeing to participate, they were instructed to download the Inquisit Web Launcher (2016), which launched the actual experiment. Next, participants filled in demographic information and then completed a speeded categorization task featuring sad and angry faces . Twenty-four faces (12 male and 12 female actors) were selected from the Radboud Faces Database (Langner et al., 2010), based on average correct emotion identification rates across emotions in the validation study (see Langner et al., 2010). For each actor, the frontal view pictures (90 degree camera angle) displaying the emotions anger and sadness were selected, and a masked version of each image was created by superimposing a surgical face mask, covering the mouth-nose region of the face (see Fig. 1). We used the images actors 01, 02,03,04,07,08,12,14,20,22,23,24,25,27,28,31,32,33,46,47,49,57,58, and 71 of the Radboud Face Database.
The speeded categorization task consisted of 192 trials: 2 Emotion (anger, sadness) * 2 Model Gender (male, female) * 2 Mask Status (masked, unmasked) * 12 Actors, with each unique face being presented twice. For each trial, participants were presented with a fixation cross for 1000 ms, followed by a 280 × 350 pixels large face for 200 ms. Participants were instructed to categorize the emotion displayed on the face as fast and accurately as possible, using the 'a' and the 'l' key (key mapping was counterbalanced across participants). Please note, that participants hereby always had to choose between categorizing a face as sad or as angry. Response key reminders were presented in the left and right top corner of the computer screen. The trials were divided into two blocks of 96 trials and participants had the opportunity to take a break between blocks. The order of trials was fully randomised within blocks and each unique face was presented once per block. Please note that the visual anglethat is the size of the image on the retina calculated from the size of the stimuli and the distance from the stimulivaried between participants, because they completed the task at home on their personal computers and therefore varied in distance to the stimuli.
After the experiment, participants completed a seriousness check (Aust, Diedenhofen, Ullrich, & Musch, 2013) by indicating either "I have taken part seriously" or "I have just clicked through, please throw my data away" and were informed that their answer to this question would not affect compensation. To further enhance data quality, participants were presented with an attention check as part of the experimental instructions shown at the beginning of the study. The attention check required participants to read the instructions and subsequently do nothing for 20 s, after which they were forwarded to the next page. Participants who failed the attention check were immediately forwarded to the end of the experiment, skipping all experimental trials and thus not contributing any data.

Data analysis
2.1.4.1. Confirmatory analyses. Both reaction times and accuracy data were analysed. Recognizing the large number of possible pre-processing decisions in the analysis of reaction time data, we employed specification curve analysis (Simonsohn et al., 2020).

Specification curve analysis.
The steps necessary to set up and conduct our specification curve analysis are shown in Fig. 2. First, we considered our design and our independent and dependent variables (Step 1), consulted prior research on typical ways to analyse reaction time data obtained (Step 2), and we used that information to determine a sensible statistical testin our case a three-way ANOVA with latency as dependent variable, and Emotion (Sadness vs. Anger), Model Gender (Male vs. Female) and Mask Status (Masked vs. No Mask) as independent variables (Step 3). Next, based on our experiences with the analysis of reaction time data and prior research using our paradigm, we determined all sensible pre-processing decisions (Step 4) and used those to create a multiverse of sensible pre-processing pathways (Step 5). That is, we altered (i) the data transformation employed, (ii) the minimum threshold for reaction times to be included in the analyses, and (iii) the data-based outlier trimming technique. For data transformation, we used either no transformation, log-transformation, or latencynormalisation, which is a procedure that removes between-subject variability in overall reaction times (Gayet & Stein, 2017), resulting in 3 levels. For the minimum threshold, we varied the response time cut-off from 0 ms to 300 ms in steps of 50 ms, resulting in 7 levels. For the databased outlier trimming method we varied the number of median absolute deviations from the median (Leys, Ley, Klein, Bernard, & Licata, 2013) from 1 to 3 in steps of 0.5 or applied no data-based trimming, resulting in 6 levels. In total, the full combination of data transformation, minimum threshold, and data-based outlier trimming gave rise to a multiverse of 126 data pre-processing pathways (Steegen, Tuerlinckx, Gelman, & Vanpaemel, 2016). Followingly, we considered the equivalence of the different pre-processing pathways (for details please consult Del Giudice & Gangestad, 2021; Step 6) and concluded that for our purposes, they can be considered equivalent. We want to explicitly recognize that the specified multiverse represents one of many possible multiverses of analyses. Therefore, we pre-registered the pathways and evaluation criteria used in the multiverse analysis (Step 7).
To draw statistical inferences from this multiverse, we first conducted 126 within-subjects ANOVAs; one for each of the 126 possible pre-processing pathways in the multiverse (Step 8). Subsequently, we calculated the median effect size across all pathways and the number of significant pathways for each effect of interest respectively. Then, to obtain p-values, we compared these test statistics to test statistics from datasets where the null hypothesis is true. 2 To obtain such datasets, we employed a resampling technique called permutation testing using a four-step approach (Step 9; Simonsohn et al., 2020): First, we randomly shuffled all independent variables. By randomly shuffling the independent variables, we created a dataset where the null hypothesis is true by construction, but which maintains all other properties of the observed data, such as within-subjects or between pre-processing pathway correlations. Second, we re-ran the 126 ANOVAs based on the multiverse of pre-processing pathways. Third, we repeated this process of shuffling and re-running the analyses 500 times. Finally, we compared the median effect size and the number of significant pathways found in the observed data with the same test statistics of the permutated datasets. That is, we calculated the proportion of permutations which yielded an equally large or larger median effect size and the same or a larger number of significant pathways than the observed data. This proportion of test statistics obtained under the null hypothesis corresponds to a classical p-  Overview chart of all steps required to conduct a specification curve analysis. For users who wish to merely conduct a multiverse analysis, steps 9 and 10 can be skipped. value (Step 10; Simonsohn et al., 2020). A p-value of <.002 thus means that not a single permutation yielded equally or more extreme test statistics than the observed data.
We followed up on the ANOVA by conducting follow-up t-tests aimed at the focal contrasts. First, to further scrutinize the Emotion * Model Gender interaction, we conducted multiverse t-tests investigating whether the difference between male and female faces is significant for both anger and sadness. The t-tests and the subsequent permutation tests followed the same procedure as the ANOVA described above, with one crucial difference: Cohen's d can distinguish directional hypotheses, whereas η p 2 cannot, and thus we also calculated the number of significant pre-processing pathways with an effect in the expected direction as third test statistic.
Second, to further investigate the Emotion * Model Gender * Mask Status interaction, we conducted multiverse t-tests to test whether the effect of masks on the difference scores between male and female faces is significant and in the respective expected direction for both anger and sadness. Hereby we followed the same procedure as for the Emotion * Model Gender interaction.
Third, to see whether we replicated the results of Bijlstra et al. (2010), we investigated the Emotion * Model Gender interaction in the subset of unmasked faces. Here we used multiverse t-tests and permutation testing identical to the tests described for the Emotion * Model Gender interaction in the full sample.

Accuracy analyses.
For the accuracy data, we conducted a 3way within-subjects ANOVA with proportion of correct responses as dependent variable, and Emotion (Sadness vs. Anger), Model Gender (Male vs. Female) and Mask Status (Masked vs. No Mask) as independent variables. To further investigate the accuracy data, we conducted follow-up t-tests for each significant interaction. All accuracy analyses were completed on the dataset with no transformation or outlier removal applied. We define accuracy as the proportion of responses in which a participant perceives the emotion as the actor intended to display it (see Langner et al., 2010). We recognize that the facial expression displayed by the actor does not relate to the internal state of the actor (see Feldman Barrett, Adolphs, Marsella, Martinzed, & Pollak, 2019).

Exploratory analyses.
To further explore our data, we tested the possibility that face masks introduced a stereotype-congruent response bias. To that end, we calculated hits (defined as angry face / angry response), false alarms (sad face / angry response), misses (angry face / sad response) and correct rejections (sad face / sad response) for each participant. 3 Afterwards, we calculated the sensitivity measure d' (d prime) and the response bias indicator beta (Pallier, 2002). For each of the two measures, we conducted a 2-way within-subjects ANOVA with the respective measure as dependent variable, and Model Gender (Male vs. Female) and Mask Status (Masked vs. No Mask) as independent variables.
Moreover, to fully explore the Model Gender by Emotion interaction, we also investigated the within-gender contrasts. Further explored the Model Gender by Emotion interaction by investigating within-gender contrasts. That is, we conducted a specification curve analysis identical to the main analyses described above based on multiverse t-tests with reaction time as dependent variable and emotion as independent variable for each gender separately.

Transparency statement
All confirmatory analyses were pre-registered on the Open Science Framework (https://osf.io/pz4xh/?view_only=c05a128203944b9e 81fdf7e285ef909e), whereas all exploratory analyses were conducted post-hoc with the aim of further understanding the data. The confirmatory analyses did not deviate from the pre-registered analyses. The data and analysis code are accessible on the Open Science Framework. The functions used to run the multiverse analysis were programmed by the authors and the respective version of the functions used in each study is also available on the OSF. An overview of all packages we used can be found in Appendix A. We report all measures, manipulations, and exclusions used in Study 1. Gender interaction revealed that only the male-anger stereotype (126/ 126 specifications, p = .004) -but not the female-sadness stereotype (0/ 126 specifications, p = 1) -was observed in the data. That means, anger was perceived more quickly on male faces compared to female faces, but sadness was not perceived more quickly on female faces compared to male faces (Table 2). For the Emotion * Model Gender * Mask Status interaction, follow-up t-tests showed that the effect of mask on the stereotype effects was also only significant for the male-anger stereotype (84/126 specifications, p = .016), but not for the female-sadness stereotype (0/126 specifications, p = 1). Importantly, whereas we hypothesized that the stereotype effects would be larger for masked than unmasked faces, the data indicated that the male-anger stereotype effect was smaller in masked than unmasked faces (84/126 specifications, p = .022. Table 1 and Fig. 3 provide an overview over the full test results. Table 2 displays the summary statistics. 3 Please note that defining hits as correct identification of anger is arbitrary: We could have also defined hits as correct identification of sadness. This decision does not influence the interpretation of the results. .813, η p 2 < .001, were non-significant.

Within-gender contrasts.
Finally, we conducted a specification curve analysis identical to the main analyses described above based on multiverse t-tests with reaction time as dependent variable and emotion as independent variable for each gender separately. The multiverse ttests and subsequent specification curve analysis for the effect of emotion in the subset of male faces revealed that the median effect size (d = 0.487, p < .002) the number of significant specifications (126/126, p < .002), and the number of significant specifications in the observed direction (126/126, p < .002) were all significantly higher than would be expected if the null hypothesis were true. Likewise, the analysis of the female faces showed that the median effect size (d = 0.21, p < .002), the number of significant specifications (126/126, p = .002), and the number of significant specifications in the observed direction (126/126, p = .002) were all significantly higher than would be expected if the null hypothesis were true. That means that for both male and female faces, anger is perceived faster than sadness.

Discussion
The goal of Study 1 was to investigate whether face masks affect the reliance on stereotypical associations of male and female faces with anger and sadness. In line with our predictions, we successfully replicated the male-anger stereotype effect , with anger being perceived more quickly on male compared to female faces. However, in contrast to our predictions, we found that face masks decreased the size of this male-anger stereotype effect. Moreover, we failed to find a female-sadness stereotype effect, with sadness not being perceived more quickly on female compared to male faces, and consequently we did not find an effect of face mask on sad faces. Notably, following Bayesian models of social cognition (Otten et al., 2017), the observed direction of the effect of face masks on angry faces is very unlikely, and presents a challenge to the suitability of Bayesian models for explaining stereotype effects in speeded categorization paradigms.

Study 2
Study 1 provided inconclusive evidence about the effect of face masks on the stereotype effect in emotion perception. Thus, with Study 2 we conceptually replicated Study 1 by extending our investigation to ethnicity-emotion stereotype associations. In line with societal stereotypes that anger is more typical of Moroccan-Dutch males than of White-Dutch males, we expected anger to be perceived faster for Moroccan-Dutch compared to White-Dutch faces (Moroccan-anger stereotype effect). Moreover, although neither sadness nor anger are more typical of Dutch faces in general, compared to Moroccan-Dutch faces we would expect sadness to be perceived faster for White-Dutch compared to Moroccan-Dutch faces (Bijlstra et al., 2014). Following a predictive coding framework, we would expect again that these effects are larger for masked compared to unmasked faces. In contrast, following the results of Study 1, we would expect that these effects are smaller for masked faces compared to unmasked faces.

Participants
We recruited a sample of 203 adult, White, and English-speaking participants via Prolific. To increase the likelihood that participants were exposed to societal stereotypes related to Moroccan males, only participants residing in Germany, the Netherlands, France, or Belgium were eligible to participate. We applied our pre-registered exclusion criteria and excluded participants who did not complete the whole experiment (n = 19), who indicated that they did not participate seriously (n = 1), who made too many errors (3SD from the mean; n = 4) or who were too slow (3SD from the mean reaction time; n = 4). The final sample of 175 participants consisted of 97 males and 78 females between 18 and 61 years (M = 29.49, SD = 9.08) of 28 nationalities. Participants were paid in accordance with Prolific guidelines on fair

Sensitivity power analysis
We further conducted a sensitivity power analysis using MorePower (version 6.0.4;Campbell & Thompson, 2012). With α = .05, our sample of 175 participants has 80% to detect effect sizes as small as η p 2 = .044.

Procedure & materials
The procedure was mostly identical to Study 1, with two notable differences. First, as Study 2 focused on ethnicity, we replaced the sample of female faces from the Radboud Face Database (Langner et al., 2010) with a sample of male Moroccan-Dutch faces from the Radboud Faces Database. We used the images of actors 03, 07, 20, 23, 24, 25, 28, 29, 33, 45, 46, 47, 49, 50, 51, 52, 55, 59, 67, 69, 70, 71, 72, and 73. Second, in line with Prolific policy, we changed the attention check of Study 1, and we now instructed participants to select a specific answer on two questions ostensibly related to the research at hand.

Confirmatory analyses
The confirmatory analyses were identical to the confirmatory analyses of Study 1 (see https://osf.io/6ayd5/?view_only=bce1ec71085d4 0aeac7963ff7cf77b73). However, we investigated the effects of model ethnicity instead of the effects of model gender.

Exploratory analyses
The exploratory analyses were identical to the exploratory analyses of Study 1. However, we again focussed on model ethnicity instead of model gender.

Transparency statement
As in Study 1, all hypotheses and analyses were pre-registered and data, analysis scripts and functions employed to implement the specification curve analysis are available on the OSF (https://osf.io/vcwbu/ files?view_only=eb2f0ee86614481eb7f354168963a40e). We report all measures, manipulations, and exclusions used in Study 2.

Reaction times
The multiverse ANOVA, and subsequent specification curve analysis on the reaction time data, showed that the number of significant specifications for the Emotion * Model Ethnicity interaction was significantly higher than would be expected if the null hypothesis were true (126/126 specifications, p < .002). Follow-up tests showed that anger was perceived faster on Moroccan-Dutch compared to White-Dutch faces (126/126, p < .002) and sadness was perceived faster on White-Dutch compared to Moroccan-Dutch faces ((126/126, p < .002; see Table 4). For the Emotion * Model Ethnicity * Mask Status interaction, the number of significant interactions was not significantly higher than would be expected if the null hypothesis were true (0/126, p = 1). Follow-up tests indicate that indeed neither the association between Dutch-sadness (21/ 126, p = .058) nor the Moroccan-angry stereotype (0/126, p = .784) were influenced by face masks. Notably, anger was detected both faster  and more accurately on Moroccan-Dutch compared to White-Dutch faces, and on masked compared to unmasked faces. Table 3 and Fig. 4 provide an overview over the full test results.

Accuracy
The Emotion * Model Ethnicity * Mask Status ANOVA of the error rates showed significant main effects of Emotion, F(1, 174)

Within-ethnicity contrasts.
Finally, we conducted a specification curve analysis identical to the main analyses described above based on multiverse t-tests with reaction time as dependent variable and emotion as independent variable for each ethnicity separately. The multiverse ttests and subsequent specification curve analysis for the effect of emotion in the subset of Moroccan-Dutch faces revealed that the median effect size (d = 0.397, p < .002) the number of significant specifications (126/126, p < .002), and the number of significant specifications in the observed direction (126/126, p < .002) were all significantly higher than would be expected if the null hypothesis were true. The analysis of the White-Dutch faces showed that the median effect size (d = − 0.072, p = .022), and the number of significant specifications in the expected direction (49/126, p = .022) were all significantly higher than would be expected if the null hypothesis was true, whereas the absolute number of significant specifications (49/126, p = .052) was not higher than would be expected if the null hypothesis was true. These exploratory analyses support the conclusions of the main analyses, namely that there is evidence for a Moroccan-anger stereotype, and weaker but still significant evidence for the idea that sadness is more strongly associated with White-Dutch compared to Moroccan-Dutch faces.

Discussion
The goal of Study 2 was to investigate whether face masks affect the reliance on stereotypical associations between Moroccan-Dutch and White-Dutch faces and anger and sadness. Firstly, we predicted that anger would be perceived faster on Moroccan-Dutch compared to White-Dutch faces (Moroccan-anger stereotype), and that sadness would be perceived faster on White-Dutch compared to Moroccan-Dutch faces. Notably, we found evidence for both hypotheses, and thus successfully replicated prior studies (Bijlstra et al., , 2014. Secondly, we compared predictions derived from predictive coding accounts (Otten et al., 2017) with predictions derived from Study 1 and found evidence for neither.

General discussion
In the current studies, we replicated existing research showing stereotype effects in emotion perception (Bijlstra et al., , 2014. More precisely, we successfully replicated Male-anger and Moroccan-anger stereotypes, whereas we found no evidence for a Female-sadness stereotype. In addition, we showed that sadness is more strongly associated with White-Dutch compared to Moroccan-Dutch faces. These findings are well in line with predictive coding frameworks: The stereotypes influence the prior and interact with the emotional faces to influence the speed and accuracy the emotions are perceived with. Notably, the effect sizes for the anger-related stereotypes were considerably larger than the effect sizes for sadness-related stereotypes. Following theoretical frameworks that argue that stereotypes are culturally ingrained knowledge (Devine, 1989;Payne, Vuletich, & Lundberg, 2017), this hints at a larger prevalence or greater importance of anger-related stereotypes in the contemporary cultural environment. Alternatively, one may argue that anger-related stimuli signal threat and are thus detected faster by specialised threat detection pathways (Tamietto & de Gelder, 2010). For example, earlier research showed that people detect animals associated with threat faster than other animals (Öhman, Flykt, & Esteves, 2001) and find threatening faces in a crowd more easily than non-threatening faces (Öhman, Lundqvist, & Esteves, 2001). Likewise, seeing Moroccan-Dutch faces may activate dangerrelated stereotypes and thus facilitate processing more strongly than White-Dutch faces for whom there are weaker danger-related stereotypes. Both explanations are congruent with predictive coding frameworks, as they simply describe different ways in which the prior formation may have taken place.
However, the main research question investigated in the present paper was whether decreasing the strength of the visual input increases reliance on the prior. Across two studies using face masks, genderemotion, and ethnicity-emotion stereotypes we did not find evidence for this prediction. Importantly, predictive coding frameworks can neither explain the absence of the effect in Study 2 nor its reversal in Study 1. Theoretically, decreasing the probability of seeing an emotional expression if there is indeed an emotion displayed should increase the effect of the stereotypical associations between social group and emotion on the probability of categorizing a face as sad or angry. Even so, whereas increased reaction times for masked faces show that we indeed decreased the strength of the visual input, the relationship between prior and visual input is not necessarily linear. That means, that decreasing the strength of the visual input by some unit need not always lead to an increase in reliance on the prior by some unit. As such, the absence of the hypothesized effects could be due to a lack of potency of our manipulation. That is, the face masks may not have impaired emotion perception enough to increase reliance on prior stereotypes in a detectable way (Blais et al., 2012). However, in the supplementary materials we present two additional studies which degrade the strength of the visual input by means of noise patterns superimposed on the faces. These studies reveal similar outcomes and replicate earlier stereotype effects , but again, these were not qualified by the strength of the visual input and thus provide further evidence not in line with predictive coding frameworks (Supplementary Materials S1). Importantly, an explanation arguing that our face mask manipulation was insufficient cannot account for the findings of Study 1, which showed effects opposite to those predicted by predictive coding accounts. Together, we argue that it is unlikely that the absence of the effect is caused by our manipulation. Dual-process models on the other hand provide a possible explanation for this finding (Strack & Deutsch, 2004): More ambiguous stimuli need to be processed more deliberatively, decreasing the potential effects of stereotypes. As masked faces were generally processed more slowly, they might have been processed more deliberately, potentially decreasing the size of the stereotype effect. In line with this hypothesis research on multiracial faces has shown that people are slower in categorizing multiracial (i.e., more ambiguous) faces compared to monoracial faces and that cognitive load interferes with the categorization of multiracial but not monoracial faces (Chen & Hamilton, 2012). Moreover, the response bias towards anger observed in masked faces may have potentially further reduced the size of the stereotype effects. That means, face masks in themselves might trigger particular responses when categorizing emotions.
In addition to our theoretical goals of understanding stereotype effects in social perception, we also demonstrated the value of specification curve analysis (Simonsohn et al., 2020) as an analysis tool in social cognition research. That is, in both Studies 1 and 2 some effects were only present in a subset of the multiverse of pre-processing pathways. Hence, researchers choosing any single pathway may erroneously conclude the presence or absence of a particular effect. For example, when looking at the categorization advantage for White-Dutch sad faces over Moroccan-Dutch sad faces, we find that some pre-processing pathways were significant (21/126), indicating differences between pre-processing choices. As such, analyses based on any of these 21 pathways would have concluded that face masks increase reliance on these stereotype associations. However, for the remaining 105 pathways, no such conclusion would have been justifiable. Yet, by conducting analyses across pre-processing pathways and comparing test statistics of the observed data with test statistics of datasets where the null hypothesis is true, we found that the number of significant preprocessing pathways is not different from what would be expected if the null hypothesis were true. That means, specification curve analysis has the potential to reduce the proportion of false positive findings in the scientific literature and shorten discussions about optimal analyses strategies (e.g., Christensen & Christensen, 2014;Jung, Shavitt, Viswanathan, & Hilbe, 2014a, 2014bMalter, 2014;Munoz & Young, 2018;Simonsohn et al., 2020). Notably, most effects in the present study were robust to differences in data pre-processing: Either all 126 pathways indicated a significant result, or no pathway did, increasing confidence in the veracity of our findings.
We argue that the field of social cognition, and other research areas in which reaction time paradigms are frequently used, would heavily benefit from adopting specification curve analysis. For reaction time data, there often is no strong justification to adopt any single preprocessing pathway over others, and surveys of the published literature (Kerr et al., 2017;Primbs et al., 2022) and many analyst projects (Dutilh et al., 2019) indicate that even experts do not agree on what constitutes the right model, analysis or pre-processing pathway. Importantly, studies have shown that these pre-processing decisions heavily influence possible outcomes of statistical tests (André, 2021;Ratcliff, 1993) and determine the interpretation of any given study. Applying specification curve analysis provides researchers with more confidence in their findings, which strongly benefits scientific progress.

Limitations and future directions
Finally, our studies have at least two notable limitations. We investigated the effect of face masks on the stereotype effect using a speeded categorization paradigm only. While that is consistent with previous work demonstrating the stereotype effect , it could be argued that face masks affect perception differently in dynamic emotion displays (Bijlstra et al., 2014). As such, the evidence provided in the present study is limited to specific types of emotion perception processes. Future studies should therefore expand the range of paradigms used to investigate stereotype effects in emotion perception and their underlying processes. In addition, the stimuli used in the present set of studies were created by adding face masks to existing images; and not by taking images of actors wearing face masks, who might display emotions differently than without face masks. This may have further complicated the perception of emotions on masked faces. However, different manipulations to decrease the strength of the visual input also did not produce the expected result (see Supplementary Materials S1). Still, future research should employ other, more potent manipulations.
To summarise, the present research investigated the effects of face masks on the stereotype effect in emotion perception. We successfully replicated male-anger and ethnicity-emotion stereotypes in emotion perception and demonstrated the efficacy of specification curve analysis in social cognition research but did not find evidence that face masks increase reliance on prior stereotype associations. Our findings challenge the applicability of predictive coding frameworks to social cognition research.

Open practices
Our studies were preregistered. The preregistration for Study 1 can be found here: https://osf.io/pz4xh/?view_only=c05a128203944b9e 81fdf7e285ef909e. The preregistration for Study 2 is vastly identical to the one for Study 1 and can be found here: https://osf.io/6ayd5?vie w_only=bce1ec71085d40aeac7963ff7cf77b73. The data and analysis scripts associated with Study 1 and Study 2 and the supplementary materials can be found here: https://osf.io/vcwbu/files?view_only=eb2 f0ee86614481eb7f354168963a40e.

Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Declaration of Competing Interest
There are no conflicts of interest.

Appendix A. Software use statement
The present study relied on R (R Core Team, 2020) and the following software packages programmed in R: MASS (version 7.3.51.6; Venables