On the trajectory of discrimination

.


Introduction
How widespread is gender discrimination in hiring and selection, and have at least some human societies experienced meaningful change towards greater equality of opportunity?These intertwined questions represent two of the most theoretically rich, practically important, and politically controversial scientific issues of our time.For scholars, the answers hold implications for our understanding of the nature of gender stereotypes and the possibility of rapid cultural evolution.For practitioners, they point to different tactics for ensuring the fairness of selection processes into organizations.For citizens and leaders, they may validate or deeply challenge ideological assumptions and worldviews.
According to widely influential theories of gender roles and gender inequality, social stereotypes both reflect and buttress women's and men's traditional roles in families and communities (e.g., caregiver roles).These in turn stem in part from physical differences between women and men, in particular women's role in birthing and nursing children (Eagly & Wood, 2012;Eagly, Wood, & Diekman, 2000).In contemporary societies, the legitimacy behind women rather than men serving in caregiver roles is much diminished.Yet, traditional gender roles, and associated explicit and implicit beliefs contributing to discrimination, can still persist (Banaji & Greenwald, 2013;Glick & Fiske, 2001).The transmission of cultural values across generations, even after the historical circumstances that gave rise to them have largely faded (Nisbett & Cohen, 1996;Talhelm et al., 2014), suggests that gender biases will perpetuate themselves across time (Cortes & Pan, 2018;Levanon & Grusky, 2016).
The present investigation seeks to introduce new evidence to this discussion by conducting a preregistered meta-analysis of 44 years of field audits of gender bias in callback rates for job applications (Study 1), and an accompanying forecasting survey gauging academic and lay predictions about the likely results (Study 2).Audit studies, in which job applications from carefully-matched female and male candidates are sent to real organizations, have high ecological validity and can estimate a causal effect of gender on hiring and selection decisions (Neumark, 2018;Quillian & Midtbøen, 2021;Quillian, Pager, Hexel, & Midtbøen, 2017;Rich, 2014).In contrast, observational field investigations, for example of performance evaluations, wage gaps, or job promotions, may be confounded by unmeasured differences between women and men (Card, DellaVigna, Funk, & Iriberri, 2020).Another option for studying gender bias are laboratory experiments that typically use hypothetical hiring scenarios and non-expert participants who may exhibit biased judgments that would not emerge among experienced and accountable decision makers (Tetlock & Mitchell, 2009).Contrarily, since they are aware of being studied, laboratory participants may correct their judgments for social desirability reasons (i.e., to avoid appearing prejudiced or sexist), shrouding biases that might have been observed under more naturalistic conditions (Tierney et al., 2020).In a meta-analysis of audits of real organizations that did not know they were part of a scientific study, the effects of year can be used to assess stability or change in labor market biases over time (Eagly, Makhijani, & Klonsky, 1992;Koch, D'Mello, & Sackett, 2015;Quillian & Lee, 2023;Quillian et al., 2017;Stanley & Jarrell, 1998;Williams & Tiedens, 2016).Our specific focus on field audits of the effects of candidate gender on selection decisions therefore maximizes ecological validity and causal inferences, both of which are critical for highly informative tests of cultural changes in discriminatory treatment.

Competing theories of stability and change in gender discrimination
One of our key research questions was whether there is a time trend in gender discrimination in job application outcomes.Different patterns of cultural evolution in discrimination based on applicant gender are possible.Biased selection decisions may have remained stable over time, such that there is significant discrimination against women in recent as well as older field audits.This persistence-of-bias account posits that the continuing existence of many stereotypes and sexist beliefs (Eagly et al., 2000;Glick & Fiske, 2001) means behavioral discrimination should continue largely undiminished.Indirectly relevant meta-analytic evidence suggests that in many Western societies, racial and ethnic discrimination in selection for jobs has persisted across all observed time periods (Quillian & Lee, 2023;Quillian et al., 2017).If racial discrimination in hiring remains pervasive, this increases the plausibility that gender bias in candidate selection, which theoretically derives from some of the same implicit and explicit mental processes (Greenwald & Banaji, 1995) and situational forces (Larwood, Szwajkowski, & Rose, 1988), remains widespread as well.More direct evidence is provided by recent work demonstrating that gender stereotypes remain deeply ingrained in the minds of many in the form of automatic associations (Charlesworth & Banaji, 2022) and are reflected in widely consumed cultural products such as music (Boghrati & Berger, 2023).If gender stereotypes are "in the air" in the surrounding culture and conditioned in people's minds, it is reasonable to expect that stereotype-based discrimination in selection decisions against female and male candidates is commonplace as well.Substantial preceding research thus provides a strong a priori empirical reason to expect similar robust biases in hiring against female applicants, all the way up to the present.
Alternatively, discrimination against female candidates may have faded away over time, such that recent studies will reveal little gender disparity in selection.This fading-of-bias account acknowledges that unfair discrimination was common in past generations, contributing to inequalities that have carried over into the present.For example, gender gaps in representation in senior leadership positions today are attributable in part to upstream biases in selection decades ago that limited the present-day pool of available talent just below the executive level.Yet from this perspective, today's organizational decision makers have become better at correcting for societal stereotypes when it comes to deciding who to hire (Tetlock & Mitchell, 2009), and given empirical evidence of changes in at least some gender norms and behaviors (Badura, Grijalva, Newman, Yan, & Jeon, 2018;Hora, Badura, Lemoine, & Grijalva, 2021;Koenig, Eagly, Mitchell, & Ristikari, 2011) may also be less biased in the first place.A more nuanced view posits that gender discrimination is uncommon in hiring decisions, which are publicly visible and carefully monitored for bias by individuals and organizations, but still visible in compensation decisions that occur behind a shroud of confidentiality (Ceci et al., 2023).Regardless, from this perspective, contemporary selection processes are in the aggregate no longer substantially impacted by applicant gender.
Yet another possibility is that gender preferences in hiring decisions have reversed over time.This would imply a negative time trend for discrimination against women initially, followed by a transition into a preference for female candidates in recent years as organizations have striven to overcome historical discrimination and contemporary underrepresentation.Under the "reverse" discrimination account, some individuals and organizations perceive female employees as offering diversity value that goes above-and-beyond their human capital value (e.g., Chang, Milkman, Chugh, & Akinola, 2019;Leslie et al., 2017).Whether a matter of genuine inclusion motives or strategic signaling, there could be a premium associated with female candidates for certain roles in contemporary organizations.This may be especially true for roles from which women were historically excluded and where they continue to be underrepresented.In such contexts, the motive to include more female candidates and achieve greater representation of women should be stronger.Note that the average gender discrimination effect contextualizes any time trend.For a society to collectively exhibit bias against women for most of its history and then show progressively more gender-balanced judgments over time is quite different from initially gender-neutral judgments turning into a preference for female candidates.The former is a case of the gradual fading-of-bias; the latter, a gradual introduction of a different bias.
Finally, it is possible that the trend across time will reveal an inflection point associated with recent social movements related to workplace sexual harassment, specifically the #MeToo movement (Luo & Zhang, 2021).The global attention to prominent harassment and assault cases, along with the powerful everyday narratives associated with #MeToo shared on social media, may have accelerated cultural change processes with regard to gender.This #MeToo hypothesis expects more favorable outcomes for female applicants in the post #MeToo years (i.e., from 2018 onwards; see Luo & Zhang, 2021).Although the direct focus of #MeToo is on gender-based harassment, the accompanying changes in gender sensitivities and standards for appropriate behavior may have spilled over to other forms of gender discrimination, such as in selection decisions.
In testing for potential cultural changes, we consider the gender typicality of the job (stereotypically female, relatively gender balanced, or stereotypically male), since past laboratory and field studies identify this as a key moderator of hiring evaluations (Davison & Burke, 2000;Eagly et al., 1992;Glick, Zion, & Nelson, 1988;Koch et al., 2015;Riach & Rich, 2002).The theoretically predicted patterns regarding the potential persistence, fading, and reversal of bias in selection decisions do not necessarily apply to jobs that society has historically deemed the purview of women (e.g., nurse or receptionist) as contrasted with maletyped (e.g., construction worker or carpenter) and comparatively gender-balanced (e.g., sales representative) jobs.At the same time, selection decisions for female-typed jobs are of theoretical interest because they could reflect a general weakening of social stereotypes and gender norms, if employers are increasingly open to men who apply to fulfill such roles in organizations.Conversely, a reduction in discrimination against female candidates for male-typed and gender-balanced jobs without a simultaneous increase in selecting men for female-typed jobs would more likely reflect selective changes in stereotypes (Eagly et al., 2020) and employers seeking to increase the representation of women but not men (Block et al., 2019).This led to a set of research questions for which the meta-analysis aimed to help adjudicate between the competing theories.Similar to Tierney et al. (2020Tierney et al. ( , 2021)), who engaged in competitive theory testing in the context of gender bias, we carried out a single set of pre-registered analyses whose results could support or fail to support different theoretical accounts with contrasting hypotheses.
Research Question 1: On average, do men experience more positive job application outcomes than women?
Research Question 2: Is the effect of gender on job application outcomes moderated by the job's gender stereotypicality?
Research Question 3: Is there a time trend in gender bias in selection decisions?
Research Question 4: Are the years since the onset of the #MeToo movement (i.e., from 2018 onwards) associated with a change in discrimination against female applicants?
Although it is admittedly speculative and may or may not find empirical support, the #MeToo hypothesis can be tested with the available data and is theoretically informative regarding the nature of changes in gender norms (linear or nonlinear, consistent or fragmented).More generally, empirical investigations suggest interrelationships and spillovers between only indirectly related dimensions of cultural change (Charlesworth & Banaji, 2022;Charlesworth, Yang, Mann, Kurdi, & Banaji, 2021;Norris & Inglehart, 2004;Varnum & Grossmann, 2016, 2017).This provides at least some prior empirical and theoretical basis to expect that shifts in societal norms regarding sexual harassment could spill over to selection decisions involving female and male candidates.
Finally, as an exploratory analysis, we examined whether there was either a preference for male applicants, a preference for female applicants, or no bias toward either gender in recent years.The average effect speaks to whether organizations should focus their debiasing interventions on the selection process or further downstream such as in compensation decisions, work assignments, and promotions.It also speaks to whether societies seeking greater gender balance in the workplace should invest their energies in preventing selection-stage bias by employers or focus further upstream on access to educational opportunities and childcare.However, how precisely to parse the last halfcentury into different time periods on a meaningful basis is not immediately clear.For example, one could divide studies by decade, into 5year spans, pre-and-post 2000, or into quartiles based on the total number of investigations.Thus, although the theoretically predicted patterns of overall discrimination past and present were pre-registered, our statistical analyses regarding the presence or absence of bias in recent time periods were based on arbitrary time increments and are thus exploratory in nature.
The present investigation's contributions to the literature on gender are multifold.We assess the direction, severity, and stability of gender discrimination with unprecedented rigor, leveraging recent openscience best practices such as pre-registration of methodology and analyses (Wagenmakers et al., 2012) and an audit by a "red team" of external experts (Lakens, 2020) to prevent researcher bias.We prespecify the competing empirical predictions of sometimes complementary, and sometimes contradictory theoretical accounts, maximizing the informational value of the investigation for theories of gender and society.Our substantial sample of 44 years of audit studies on gender, the largest ever assembled, allows for informative tests of not only the moderating role of job stereotypicality but also recent events such as the #MeToo movement.This is the first investigation to assess scientist and lay perceptions of gender discrimination over the years and map these on to objective empirical results, offering the opportunity to put clashing priors about societal change and pervasive prejudice to a rigorous empirical test (Tetlock, Mellers, Rohrbaugh, & Chen, 2014).The Trajectory of Discrimination project is part of a broader, ongoing program of research from our group that seeks to open the science of diversity and discrimination using recent open and crowd science innovations (Dreber et al., 2015;Klein et al., 2014;Lakens, 2020;Wagenmakers et al., 2012).

Study 1: A meta-analysis of stability and change in gender discrimination over time
Our empirical approach for the meta-analysis followed a multi-step strategy.Before committing ourselves to our methodology, we recruited a "red team" (Lakens, 2020) of expert critics to provide detailed feedback to the main project team (blue team) regarding the initial project plan.The revised and optimized approach was then preregistered on the Open Science Framework (https://osf.io/ha3n4).Building on meta-analytic investigations that focused on recent studies only (e.g., Lippens, Vermeiren, & Baert, 2021), or that sampled mostly laboratory experiments along with a smaller set of field audits (Koch et al., 2015), we attempted to identify all field audits from any year concerned with gender and hiring discrimination.Next, we utilized an a priori coding scheme to extract and process relevant information from the target articles and reports to create a database for our analyses.Finally, we conducted the preregistered meta-analytic analyses, as well as additional exploratory analyses.

Red team approach
The prevalence of gender bias in hiring and other forms of groupbased discrimination are among the most controversial issues in the social sciences (Arkes & Tetlock, 2004;Banaji et al., 2004;Ceci et al., 2014;Ceci et al., 2023;Heilman & Eagly, 2008;Landy, 2008).Concerns about potential researcher ideological and intellectual commitment biases on both sides are common in this space (Clark & Winegard, 2020;Cyrus-Lai et al., 2022;Duarte et al., 2015;Jost et al., 2009).In light of the strong need to enhance objectivity, increase trust, and generally maximize the informational value of our meta-analysis, we leveraged emerging best practices of open science, including pre-registration of analyses and open data (Nelson et al., 2018;Wagenmakers et al., 2012).This constrains researcher degrees of freedom, and greatly expands opportunities for re-analyses and alternative perspectives from other scholars.
To further optimize our methods, we employed the innovative new "red team" approach (Lakens, 2020;Zenko, 2015).A red team is a designated team of scientific experts external to the core author group (the "blue team").Two coordinators recruited an independent team of experts on statistics, meta-analysis, and gender research, as well as a librarian, to critique all aspects of our meta-analysis plan, point out potential issues, and suggest improvements.The goal of the red team approach was to improve the quality of the research project by identifying flaws and challenging dominant assumptions in our work, incorporate different viewpoints, and invite early feedback from international experts.We preregistered and carried out the optimized study methodology and analysis plan, followed by another round of feedback from the red team.
Unlike traditional peer reviewers, red team members are financially compensated for their work and provide feedback throughout the project, when it is still possible to correct errors or methodological weaknesses.The logic of the red team approach is comparable to a registered report publication system, in which research protocols are reviewed by the journal before the results are known (Chambers et al., 2015).However, the criticism is not invited by the journal but by the authors (blue team).Such an approach allows for an exchange between researchers and a "devil's advocate" that aims to produce a higher quality research plan before submission to a journal by identifying oversights, soliciting feedback from experts, and preventing groupthink (Lakens, 2020).A red team is also similar in some respects to an adversarial collaboration, where researchers with directly opposing predictions work to design a study together (Clark & Tetlock, 2022;Mellers, Hertwig, & Kahneman, 2001), except that red team members are recruited for expertise alone rather than their intellectual committments.
Because our goal was to generate critical feedback on our bibliographic search, data coding, analysis, as well as our theorizing and inferences, we recruited five red team members (four female, one male) with expertise in one or more of these domains (see Supplementary Online Materials for anonymized brief profiles).Three of the red team members were scholars with training and publishing experience in the domain of gender research, some with additional expertise in field audit methods, and included one qualitative gender studies expert.This was complemented by a scholar with expertise in meta-analytic methods and statistics, as well as a senior librarian who advised us on our bibliographic search approach.With the exception of one scholar who was in advanced doctoral training and the librarian, the remaining red team members had doctoral degrees in their respective areas and were either post-doctoral fellows or tenure-track faculty.Four red team members received financial compensation for their feedback and one red team member declined payment.
We solicited feedback from the red team at two stages of the project.An initial round of feedback was requested after we had conducted a preliminary bibliographic search, developed a preliminary coding scheme, and extracted data from a portion of the studies.Five red team members participated in the first stage.No analyses had been conducted at this time.A second round of feedback was requested after completion of the revised search, data analyses, and draft manuscript.Three red team members participated in the second round (one gender expert, one statistician, and one librarian).In both rounds, the red team was given approximately two to four weeks to provide the blue team with written feedback.After receiving the first round of feedback on the planned methods, we made extensive revisions to our approach and responded to each suggestion by the red team, explaining what changes were made to address their concerns or why we decided not to incorporate a particular suggestion.For example, based on the first round of feedback of the red team, we revised our search terms and preregistered our revised search and coding approach in detail.After receiving the second round of feedback, we made changes to the manuscript (e.g., clarify arguments, extend discussion) and Supplementary Online Materials (e.g., provide more methodological detail, conduct additional analyses).For instance, we conducted additional publication bias analyses and added more material on potential limitations of the current set of studies.The red team also identified a coding error and a rounding error which we subsequently corrected.The full-length, anonymized feedback by the red team and the blue team's respective responses are available on the Open Science Framework (https://osf.io/pt4gn).Table S3 in the Supplementary Online Materials provides an overview of the most important feedback exchanges for each aspect of the meta-analysis.

Identification of relevant studies
Once the blue team and red team had settled on the meta-analysis methodology and planned statistical analyses, we worked to identify all published and unpublished field audits examining a contrast in hiring-related outcomes between female and male job applicants.This includes all in-person audit studies and resume correspondence studies that manipulated gender either "within" employer (i.e., an employer received applications from both female and male candidates) or "between" employer (i.e., an employer received applications from either a female or male candidate) and that kept all other candidate characteristics equivalent either through randomization or creating matched pairs.
We employed multiple search strategies during April 2021, including searches in academic databases, citation searches, email requests to corresponding authors of gender-related field experiments, and public calls for unpublished work.First, we conducted a systematic search of primary academic databases, including Web of Science Core Collection (A&HCI, BKCI-SSH, BKCI-S, ESCI, SCI-EXPANDED, SSCI), Business Source Ultimate (via EBSCO), EconLit (via EBSCO), Humanities International Complete (via EBSCO), APA PsycArticles (via EBSCO), APA PsycInfo (via EBSCO), SocINDEX (via EBSCO), and Google Scholar (first 1,000 results).Our search string, expanded substantially after feedback from the red team, consisted of a combination of keywords related to gender (e.g., gender, sex*, female*), discrimination (e.g., bias, stereotyp*, discriminat*), and field experimental methodology (e.g., audit stud*, field experiment*, randomized trial*) with some variation depending on the search functions of the respective database.See the meta-analysis preregistration on the Open Science Framework (https://osf.io/ha3n4)for the exact search strings for each database and Table S1 in the Supplementary Online Materials for deviations from the preregistered protocol.
Third, we took additional steps to identify unpublished and "in press" studies.We searched for unpublished dissertations related to our topic of interest on ProQuest Dissertations and Theses Global.We also issued public calls via listservs, discussion forums, and social media pages of relevant academic communities (e.g., Academy of Management Organizational Behavior Division and the Gender and Diversity in Organizations Division, American Sociological Association, PsychMap, PsychMethods).Finally, we contacted the corresponding authors of studies identified via the systematic search of academic databases and citation tracing described above to directly request information about any unpublished studies.
Our search produced a total of 6,754 search results.Using a bibliographic management software (Zotero), we excluded 709 duplicate articles and three retracted articles.One blue team author subsequently assessed each of the remaining 6,042 results for relevance ("yes", "no", "maybe") based on title and abstract using a web-based, collaborating screening platform (Rayyan) which helps organize and manage collaborative systematic literature reviews.Those coded as "maybe" were assessed by a second author.For the resulting 456 records, we subsequently retrieved the full-text articles for more careful examination.Twelve articles could not be retrieved, leaving 444 articles for full-text examination.Following our preregistered inclusion criteria, we excluded additional articles because they did not contain field experimental data (n = 193), gender was not investigated or properly randomized (n = 99), no hiring outcomes were measured or reported (n = 30), relevant statistics were missing and could not be provided by the authors (n = 19), or the underlying data were the same as in another article (n = 18).The final sample included 85 usable field studies and is thus more comprehensive and up-to-date than earlier meta-analyses on gender discrimination (Koch et al., 2015;Lippens et al., 2021), although of course also building on this important prior work.Fig. 1 contains the PRISMA flow diagram (Page et al., 2021) which summarizes the overall search process.Fig. 2 depicts the number of audit studies across geographic regions and time.

Data extraction
We coded key characteristics of each study according to a preregistered coding rubric (see Open Science Framework: https://osf.io/ha3n4).As the coding progressed, we further refined the coding scheme where necessary.Deviations from the preregistered protocol are reported in Table S1 (see Supplementary Online Materials).For example, rather than coding whether a study used a matched pairs design or not, we decided that it was more meaningful to separately code a) whether the study manipulated gender within or between employers and b) whether the female and male applications were real, manually matched pairs (e.g., two trained actors or two real resumes of similar quality) or equivalent, fictitious pairs (e.g., the same resume manipulated to have either a female or male candidate name).The final coding rubric is reported in Table S2.
The coding involved information at both the study and effect level.Study level characteristics are constant for the entire study, such as gender ratio of the author team or year of data collection.For some studies, multiple characteristics were coded at the effects level.For example, a study that separately reported callback data across three countries would produce three effect sizes, and a study that separately reported callback data for eight professional groups (e.g., cleaner, clerk, gardener) would result in eight effect sizes.
Following the preregistered protocol, all objective variables (e.g., data collection year, applications sent, callbacks) were coded by one author and subsequently verified for accuracy by a research assistant.In case of disagreement, further investigations were conducted to verify that the extracted information was accurate.Another author was consulted to resolve ambiguities.For the gender variable, we followed the established approach of other meta-analyses on gender and ethnicity to extract the main effect comparing overall discrimination between cisgender 3 female and male applicants (e.g., Flage, 2018;Lippens et al., 2023;Koch et al., 2015;Quillian et al., 2017;Quillian & Lee, 2023;Zschirnt, & Ruedin, 2016).For example, if a study orthogonally manipulated gender (female vs. male) and age (younger vs. older; e.g., see Baert et al., 2016), we examined gender differences across both younger and older applicants combined, as these characteristics naturally vary in labor markets.For our time variable, we extracted the year in which the data were collected.If data collection spanned multiple years, we extracted the year in which most of the applications were sent out.
For our subjective variable, gender typicality of job according to broader cultural stereotypes, we used a preregistered approach employing human coders for the main analysis.We complimented this with an exploratory approach based on objective country-level demographic data on gender representation in particular jobs.The human coder approach allows for a deeper and more consistent level of granularity, as country-level data may not be consistent and equally granular across countries.The objective country-level data may remove potential coder bias and more accurately capture cross-national differences in gender representation for the same type of job and account for withinjob shifts over time.For the preregistered human coder approach (Derous & Ryan, 2012), four authors independently coded studies (intercoder agreement was substantial, Fleiss kappa = 0.77, p <.001; see Table 1 for example jobs for each gender category) and discrepancies were resolved through a majority vote approach or discussion (if there was no majority).The coders who categorized jobs based on their stereotypicality were from the following nations: the United States and Chile, Switzerland, Vietnam, and Australia and South Africa.For the objective data approach, we retrieved country-level gender representation data via each country's official labor statistics reports (if available), the United Nations website, or other governmental/non-profit reports.Following past research (e.g., Hora et al., 2021;Koch et al., 2015), a job was coded as female-typed (male-typed) if the representation of women (men) was 65% or higher, and coded as gender-balanced otherwise (see Table S5 in the Supplementary Online Materials for sensitivity analyses 3 In our investigation, we focused on the comparison between cisgender females and males as this has been the primary comparison in past research to date.One study by Granberg et al. (2020) also included transgender conditions in addition to the female and male cisgender conditions.For the present research, we focused on the latter two conditions as there were not enough studies systematically examining transgender candidates at a meta-analytic level.We encourage future research to conduct more systematic investigations on this important topic.using alternative cutoffs of 60% and 70%).The association between the subjective and objective coding approaches for gender typicality of jobs was strong (Cramer's V = 0.71, p <.001).
To assess country-level gender inequality, an additional variable (Gender Inequality Index, or GII) was retrieved from the United Nations Human Development Report (United Nations Development Programme, 2020).The GII is a composite measure of gender inequality using data on reproductive health (e.g., maternal mortality), empowerment (e.g., women with higher education degrees), and the labor market (participation of women in the labor force).A low (high) GII value indicates low (high) inequality between women and men.The GII was published every five years between 1995 and 2010 and annually between 2010 and 2019.Thus, for studies published before 2010, we took the GII index with the smallest temporal distance to the data collection year (e.g., for a study from 1999, we took the GII from the 2000 report).
In cases of missing data, we followed the preregistered protocol and reached out to the corresponding author of the respective study.In most cases, the authors were able to provide us with the missing data (e.g., year of data collection, callback rates).When missing data could not be obtained from the authors, the study was either excluded (e.g., when we could not compute an effect size for the study) or we made reasonable assumptions (i.e., for one study, we inferred callback rates from figures).

Statistical analyses
Data from each study were a 2x2 frequency table as shown below.

Applicant Gender
The effect size measure of interest in the meta-analyses was the log odds ratio.Before computing the log odds ratios, we added 0.5 to all cells of the 2x2 frequency table to decrease bias in the estimator of the log odds ratio and to avoid division by zero when computing the log odds ratio and its sampling variance in cases where some of the cells equaled zero (Walter & Cook, 1991).The log odds ratio of each study was computed using (equations 11.57 and 11.58 in Borenstein & Hedges, 2019), where ln denotes the natural logarithm.The corresponding sampling variance of the log odds ratio was computed using (equation 11.59 in Borenstein & Hedges, 2019), We preregistered to conduct univariate random-effects meta-analyses for testing each hypothesis.During data collection, we realized that many studies contributed more than one effect size based on an independent sample to the meta-analysis; the minimum, maximum, median, and average number of effect sizes a study contributed was 1, 42, 1, and 2.9, respectively.Hence, we decided to deviate from the preregistered analyses and take the nesting of effect sizes within studies into account by conducting three-level multilevel meta-analyses (Konstantopoulos, 2011;Van Den Noortgate & Onghena, 2003).A three-level multilevel meta-analysis adds an extra level to the meta-analysis model to take the nesting structure into account.We report the results of the multilevel meta-analyses in the main report, but also present the results of the preregistered univariate random-effects meta-analyses in the Supplementary Materials.Parameter estimation in the multilevel metaanalyses was done using restricted maximum likelihood estimation.For each multilevel meta-analysis, the I 2 -statistic (i.e., the proportion of total variance that can be attributed to heterogeneity, Higgins & Thompson, 2002) was computed analogous to what is described in Nakagawa and Santos (2012).We evaluate whether the normality assumptions of the multilevel meta-analysis model hold in the Supplementary Online Materials (see Section S7 and Fig. S1).
All analyses were conducted in the statistical software R (R Core Team, 2021, version 4.1.1).The R package "metafor" (Viechtbauer, 2010, version 3.0.2) was used for conducting the multilevel metaanalyses and some of the data visualizations and the R packages "weightr" (Coburn & Vevea, 2016, version 2.0.2) and "puniform" (van Aert, 2021, version 0.2.4) were used for publication bias analyses.Twotailed hypothesis tests were conducted using α = 0.05 and 95% confidence intervals were computed.R code and analysis output is available on the Open Science Framework (https://osf.io/pt4gn/).
To account for potential confounding factors, we included several control variables in our analyses.First, to account for the possibility that more recent studies have been conducted in regions with more egalitarian gender norms, we controlled for national scores on the GII.Second, it is possible that the experimental designs of audit studies have become more complex, such as the increased use of multifactorial designs manipulating gender and an intervention designed to reduce discrimination (Byrd, 2019).Collapsing across baseline and intervention conditions could depress discrimination effect sizes over time.To take this into account, we controlled for whether studies manipulated a single or multiple experimental factors.Third, we accounted for the gender composition of the author teams over time.Note that we preregistered to also include application method as a control variable, but submitting a resume was used for nearly all effect sizes as the application method (98.4%).Due to a lack of variance, we did not include this variable as a control variable in the meta-analyses.
The results of the multilevel meta-analyses and univariate randomeffects meta-analyses are reported in Table 2 (multilevel metaanalyses) and S4 (univariate random-effects meta-analyses; see Supplementary Online Materials), respectively.Research Question 1 regarding overall preference for female vs. male candidates across the years was tested by conducting a meta-analysis without any predictor variables as an intercept-only model (see Models 1a/b).Research Question 2 examining moderation by job type was tested by including a dummy variable for non-female-typed jobs (a score of 0 implies that a job was a stereotypically female-typed job, and a score of 1 indicates either a stereotypically male-typed job, or a gender-balanced job or set of jobs; see Models 2a-d).For our main analyses, we followed the preregistered protocol to collapse male-typed and gender-balanced typed jobs into a single category to maximize the statistical power of tests of discrimination against women across time.Grouping gender-balanced occupations (e.g., accountant) with male-typed rather than femaletyped jobs was further supported by research on gender as a diffuse status cue (Ridgeway, 1991), traditional cultural stereotypes of women as less competent than men (Fiske, Cuddy, Glick, & Xu, 2002), and pervasive implicit and linguistic stereotypes regarding career versus family roles (Charlesworth & Banaji, 2022;Charlesworth et al., 2021), all of which predict biases in favor of men and against women even for comparatively neutral professional settings and work tasks (Jost, 1997;Pelham & Hetts, 2001).However, we also present results separately by male-typed, gender-balanced, and female-typed jobs.The competing theoretical predictions regarding Research Question 3 (persistence vs. fading-of-bias) were tested by including the year of application in the meta-analysis.This variable was first centered by subtracting the year of the oldest application (i.e., 1976) to avoid convergence issues.The centered variable was included as predictor in the meta-analyses (see Models 3a-d).Research Question 4 regarding #MeToo was tested by comparing the years 2018-2021 to 2014-2016,4 and our analyses regarding the extent of gender bias in recent versus distant time periods were carried out by breaking the studies into different yearly intervals on an exploratory basis (see Fig. 5).

Is there evidence of gender discrimination favoring male or female applicants?
We first examined whether female or male applicants experience more positive job application outcomes overall (i.e., across all job types and years).Across all 85 studies from 1976 to 2020, the average odds of male applicants to receive a callback was 0.91 times the odds of equally qualified female applicants (95% confidence interval ranged from 0.86 to 0.97, z = -3.00,p =.003; see Table 2, Model 1a).Importantly, heterogeneity of the true effects was large (I 2 -statistic = 82.8%),implying that 82.8% of the total variance can be attributed to heterogeneity.This is also apparent in the wide 95% prediction interval (0.49 to 1.70), which indicates the effect size for a future study from the same distribution of true effects.The heterogeneity could not be explained by including the control variables (inequality index, presence of moderators in study, proportion of female authors), because the heterogeneity of the true effects remained large (I 2 -statistic = 83%).However, the effect of candidate gender was no longer significant (z = -0.15,p =.883; see Table 2, Model 1b) when we included the control variables.

Does gender discrimination depend on the gender-typicality of the job?
To further examine whether job application outcomes for female and male applicants varied as a function of the gender typicality of the job category that was applied for, we tested whether job application outcomes are less favorable for women for the combined categories of maletyped and gender-balanced jobs, relative to female-typed jobs.Using the human coded gender-typicality variable, job type significantly moderated the effect of gender on callbacks (0.26, z = 4.91, p <.001; see Table 2, Model 2a).Specifically, the average odds of a male (vs.female) applicant to receive a callback was significantly lower for female-typed jobs (odds ratio: 0.75, 95% confidence interval: 0.68, 0.83) compared to male-typed and gender-balanced jobs combined (odds ratio: 0.97, 95% confidence interval: 0.91, 1.03).The moderator remained significant when we included the control variables (0.25, z = 4.85, p <.001; see Table 2, Model 2b).The results were consistent when we used the objective country-level data as gender-typicality moderator: job type significantly moderated the effect on gender on callbacks, excluding (2.68, z = 4.96, p <.001; see Table 2, Model 2c) and including (2.68, z = 4.89, p <.001; see Table 2, Model 2d) the control variables.Thus, female applicants were on average more likely than male applicants to receive callbacks for female-typed jobs, while there was no significant effect of candidate gender for male-typed and gender-balanced jobs overall.Including this variable still left a large amount of heterogeneity (residual I 2 -statistics ranged from 80.8% to 81.3%).

Has gender discrimination changed over time?
One of our key research questions for the meta-analytic investigation was whether there has been stability or change in gender discrimination over time for male-typed and gender-balanced jobs considered together.To test this, we fitted a multilevel meta-regression model with the year in which the applications were sent out as a predictor.Using the preregistered human coded gender-typicality variable, there was a significant, albeit small, decreasing time trend of the average log odds ratio (-0.010, z = -2.56,p =.011, residual I 2 -statistic = 81.2%;see Table 2,  Model 3a), suggesting that job application outcomes for female candidates improved over time relative to male candidates.Including the control variables in the meta-regression model did not change the direction or significance of the time trend (-0.015, z = -3.01,p =.003, residual I 2 -statistic = 80.7%; see Table 2, Model 3b).Fig. 3 shows that female applicants had a disadvantage over male applicants before 2009 and that this difference was no longer noticeable or, if anything, slightly reversed in direction starting in 2009.The trend also remained significant when we used the objective country-level data to identify nonfemale-typed jobs, both excluding (-010, z = -2.49,p =.013; see Table 2, Model 3c) and including (-0.013, z = -2.61,p =.009; see Table 2, Model 3d) the control variables.Jointly, these results demonstrate that the increasingly positive outcomes for female applicants over time are not likely attributable to the subjective vs. objective measurement of the gender typicality moderator, shifts in location, changes in study designs, or gender composition of research teams.
Although the time trend appears visually to be most pronounced for relatively gender-balanced jobs (Fig. 4), an exploratory analysis revealed that there was no significant interaction between time trend and job type (-0.01, z = -1.19,p =.236), meaning that we could not reject the hypothesis that the trend was the same across the three job types.In an exploratory analysis, we further broke down the outcomes for each job category by different time periods (Fig. 5A/B).Prior to 1991, we observed a preference for male applicants for male-typed and gender-balanced jobs, although these early intervals are based on a small number of studies and not significant.In more recent time periods (post 2009), we observed a preference for female candidates for genderbalanced jobs whose significance depended on the specific years in question, a significant preference for female applicants for female-typed jobs, and no significant gender-of-candidate preference for male-typed jobs.
In addition to the seemingly gradual shift over time (Fig. 3), an exploratory comparison of 2018-2020 relative to the preceding years 2014-2016 did not reveal a significant reduction in discrimination between those two time periods (z = 0.379, p =.704).This indicates that increasing support for female applicants is a longstanding trend and cannot be attributed to a sudden spike in support for female applicants due to the #MeToo movement that became a global phenomenon in 2017.
Overall, the fading-of-bias account's predicted decline in discrimination against women over time was supported (Research Question 3), as was moderation of gender discrimination by the stereotypicality of the job (Research Question 2).In contrast, the speculative possibility that the #MeToo years would be associated with accelerated cultural change (Research Question 4) did not find empirical support.Overall anti-female bias in selection decisions was not observed, and although some suggestion emerged of an aggregate anti-male bias this was not robust to covariates (Research Question 1).For our exploratory analyses regarding the presence of bias in recent years, the results are contingent on arbitrary and post hoc decisions regarding how intervals of years are divided and thus provide no robust evidence of contemporary gender discrimination for most jobs.Taken together, the results support the fading-of-bias account for male-typed and gender-balanced jobs (i.e., non-female-typed jobs), and the persistence-of-bias account for femaletyped occupations.

Table 2
Results of hypothesis tests using multilevel meta-analyses.For each model and variable, the parameter estimate, standard error (), 95% confidence interval (CI) [], z-value, and corresponding two-tailed p-value of the multilevel meta-analyses are displayed.All models were fitted using restricted maximum likelihood estimation.σ2 1 = the between-study variance in true effect size; σ2 2 = the variance in true effect size of effect sizes nested in studies.

Robustness tests
We carried out several robustness and sensitivity analyses (detailed model statistics for the analyses below are available in the R output document on the Open Science Framework at https://osf.io/ha3n4).First, three studies reported two outcomes (e.g., callbacks and interview invites) based on the same sample.This violated the assumption of independent sampling errors in the meta-analysis models (e.g., Hedges, Tipton, & Johnson, 2010).Since only six effect sizes of the 244 effect sizes came from the same sample, we decided to conduct a sensitivity analysis where we selected for each of the three studies one of the two effect sizes based on which of the two measures was more inclusive (e.g., if a study reported both callbacks and interview invites, we selected callbacks).The results of these multilevel meta-analyses did not differ substantively from those of the multilevel meta-analyses based on all data.Specifically, the overall average odds of male applicants to receive a callback remained significantly lower than the average odds of equally qualified female applicants (-0.911, z = -3.08,p =.002).The moderation of the effect of gender on callbacks by job type remained significant as well (0.253, z = 4.82, p <.001).Finally, the time trend suggesting a decrease in pro-male gender discrimination over time remained significant excluding (-0.010, z = -2.45,p =.014) or including (-0.015, z = -3.01,p =.003) control variables.
Second, although we preregistered to use publication year as the time variable, we noticed during the coding that the time between data collection and publication of an audit study varied substantially across studies (ranging from 0 to 11 years).Because the year in which applications were sent out more accurately reflects gender discrimination at any given point in time, we used data collection year as time variable in the primary analyses.However, we conducted supplemental analyses with publication year as the time variable and found comparable results.Replicating the main analyses, the time trend suggesting a decrease in pro-male gender discrimination over time remained significant excluding (-0.011, z = -2.67,p =.008) or including (-0.016, z = -3.17,p =.002) the control variables.
Third, we examined whether any one study had a particularly large effect on our results (see Section S8 in Supplementary Online Materials for detailed analyses).A leave-one-out analysis (Viechtbauer, 2010) indicated that the overall discrimination patterns and the moderation by job type remained for the most part robust.The decrease in discrimination over time was robust to the exclusion of most studies, except for one large sample study conducted in 1978 and published four years later (Firth, 1982).Exclusion of this study caused the time trend effect to be closer to zero for the models without (-0.005,SE = 0.004) and with control variables (-0.008,SE = 0.006) and made the effect nonsignificant.This is not surprising given the number of audit studies before 1990 was relatively small, such that removing the field audit with the largest sample size from the earliest time period can affect the overall estimate of the time trend (for a similar conclusion for race studies, see Quillian et al., 2017).We return to this issue in the General Discussion.
Finally, field audits aim to occlude from evaluators that they are involved in a research study by sending ostensibly real job applications to actual businesses.However, this does not completely rule out the possibility of some evaluators realizing their judgments are under scrutiny by researchers.Further, the chances of this occurring are not necessarily constant across all types of field audits.Specifically, paired audit designs may entail the greatest risk of experiment discovery among employers since they receive highly similar applications from members of both historically advantaged and underrepresented groups (e.g., men and women).However, the present meta-analysis finds no significant difference in results for audits that sent female and male applications to the same versus different employers (0.12, SE = 0.068, z = 1.74, p =.08).

Assessments of publication bias
One potential concern in meta-analyses is the presence of publication bias (Rothstein, Sutton, & Borenstein, 2006).It is possible that audit studies that document significant effects and theoretically or ideologically consistent outcomes were more likely to be published.Note that many of the currently available publication bias methods are primarily designed for univariate meta-analyses.However, Egger's regression test (Egger, Smith, Schneider, & Minder, 1997) and PET-PEESE (Stanley & Doucouliagos, 2014) are bias correcting methods that can be readily extended to multilevel meta-analysis by including the standard error of the log odds ratio as predictor in the multilevel meta-analysis with no other predictors.The other publication bias methods were applied to the univariate random-effects model where no predictors were included in the model.The included publication bias methods were: contourenhanced funnel plots (Peters, Sutton, Jones, Abrams, & Rushton, 2008), three-parameter selection model (3PSM, Hedges & Vevea, 2005), and p-uniform* (van Aert, 2021).Publication bias was assessed in a Fig. 3. Time trend of gender preferences for non-female-typed jobs.The results are based on a multilevel meta-regression model of male-typed and genderbalanced jobs combined, including gender inequality, study design complexity, and author gender ratio as control variables.Odds ratios above 1 indicate a greater preference for male applicants and odds ratios below 1 indicate greater preference for female applicants.The size of the circles is proportional to the number of applications represented by the respective data point.meta-analysis based on all studies and based on only the studies that were published.Both assessments yielded highly comparable results, and we only report the results based on the studies that were published.The results of publication bias methods applied to all studies are available in an R output document on the Open Science Framework (https://osf.io/pt4gn).
The contour-enhanced funnel plot in Fig. 6 did not provide strong evidence for small-study effects in the meta-analysis.Further, Egger's regression test that was extended to multilevel meta-analysis was not statistically significant (-0.244, z = -0.984,p =.325).The results of the methods that correct the average log odds ratio for publication bias are presented in Table 3.The estimate of PET-PEESE that was extended to multilevel meta-analysis was slightly closer to zero (-0.032).Three different variants of the 3PSM were fitted assuming that the studies in the meta-analysis used (1) a right-tailed (2), a left-tailed, and (3) a twotailed hypothesis test for testing the null-hypothesis of no effect.We assumed that α = 0.025 and α = 0.05 were used when a one-tailed and two-tailed test was conducted, respectively.The average log odds ratio was always estimated as closer to zero with 3PSM compared to the multilevel meta-analysis, and was only statistically significant in case a left-tailed hypothesis was assumed to be conducted in the studies.When applying p-uniform*, we assumed that either a right-tailed or left-tailed hypothesis test with α = 0.025 was conducted in the studies.The average log odds ratio in both implementations of p-uniform* was closer to zero and only statistically significant in case a left-tailed hypothesis was assumed to be conducted in the studies.
Overall, the estimated average log odds ratio corrected for publication bias was closer to zero compared to the estimate of the multilevel meta-analysis.However, the combination of the non-significant Egger's regression test with a small number of statistically significant results (29.9%) suggests that there is no strong evidence for the presence of bias.This implies that if there is any publication bias in this metaanalysis, it is small-providing additional confidence in the conclusions drawn.

Additional exploratory analyses
In addition to controlling for author gender in our meta-analytic models (see above), we also examined whether author gender would moderate the extent to which a study would report gender bias on the part of prospective employers.However, author gender did not influence the amount of gender discrimination reported across all studies and years (-0.06,z = 0.60, p =.549).
In addition to categorizing jobs into female-typed, male-typed, and gender-balanced jobs, we also examined additional job grouping variables.First, we examined whether gender discrimination would vary as a function of whether a job requires physical strength (0 = no, 1 = yes; rated by four human coders; Fleiss kappa = 0.76, p <.001).Results suggest that job physicality significantly moderated the effect of gender on callbacks (0.26, z = 2.08, p =.038), such that the average odds of a male (vs.female) applicant to receive a callback was significantly higher for physical jobs (odds ratio: 1.17, 95% confidence interval: 0.92, 1.49) compared to non-physical jobs (odds ratio: 0.91, 95% confidence interval: 0.85, 0.96).However, note that the proportion of jobs that require physical strength (4.92%) is small in the present sample and should thus be seen as a tentative result requiring confirmatory tests involving larger samples of jobs.Second, we explored whether gender discrimination varied as a function of whether a job required nurturance (0 = no, 1 = yes; rated by four human coders; Fleiss kappa = 0.74, p <.001).We found that job nurturance did not moderate the effect of gender on callbacks (0.09, z = 0.85, p =.394).Similar to job physicality, the proportion of jobs that require nurturance (6.97%) is small, rendering any conclusions tentative.Effect sizes are grouped together depending on the year the applications were sent out and were combined using a univariate random-effects model.The top panel (Fig. 5A) shows the average odds ratio before 2009 and 2009 and thereafter, which corresponds to the theoretical crossover point of the time trend in Fig. 3.The bottom panel (Fig. 5B) compares the odds ratios using more granular time periods, including the post-#MeToo years of 2018-2020.Odds ratios above 1 indicate a greater preference for male applicants and odds ratios below 1 indicate greater preference for female applicants.Error bars indicate the 95% confidence interval around the average odds ratio that is based on a normal distribution.The size of the symbols is proportional to the number of effect sizes in the respective bin.
Finally, we examined the potential influence of several country-level factors that could affect gender discrimination.First, in addition to including a country's gender inequality as a control variable, we tested whether the GII (described earlier) would moderate the present results; however, this was not the case (-0.713, z = 1.38, p =.168).Second, we examined whether a country's education level would affect the results, using the United Nation's Education Index (United Nations Development Programme, 2020).Similar to the GII, we matched the closest available Education Index value to each country and study year.However, education did not moderate the effect of applicant gender on discrimination (-0.291, z = 0.430, p =.667).Third, we tested whether gender discrimination may be influenced by economic prosperity, using GDP per capita data (log) retrieved from The World Bank (https://data.worldbank.org/),but found no significant effect (-0.065, z = 0.716, p =.474).Fourth, we examined whether gender discrimination is influenced by the Human Development Index (HDI)-a summary measure of average achievement in human development, including longevity and standard of living among other factors (United Nations Development Programme, 2020).However, we found no significant moderating effect of HDI (0.036, z = 0.077, p =.939).Lastly, we examined whether culture would influence gender discrimination using the WEIRD index developed by Muthukrishna et al. (2020).However, we did not find any moderating effect (-1.465, z = 1.417, p =.157).

Discussion
In sum, we found no overall pattern of gender discrimination in hiring outcomes in favor of male applicants (Research Question 1).Based on our moderator analyses, reasons for this include that a large share of field audits in our sample were conducted in the period of 2005-2020 (see Fig. 2) and our tests of RQ1 aggregated applications for stereotypically male, gender-balanced, and female-typed jobs.Parsing the results by job type and time period indicates more favorable results for female applicants for stereotypically female jobs (Research Question 2).Further, discrimination against women for male-typed and genderbalanced jobs has diminished significantly over time (Research Question 3), although not more so in the #MeToo era than in the preceding time period (Research Question 4).At the same time, in recent years we continue to observe massive heterogeneity in discrimination-related effect size estimates across studies and settings.This suggests that there exists wide variability in current hiring practices such that discrimination against women is present in some contexts and organizations, and discrimination against men in others.At the same time, there is a reliable discriminatory bias such that male applicants for traditionally female-typed jobs (e.g., receptionist, nurse, elementary school teacher) are at a persistent disadvantage in selection decisions.

Study 2: Forecasting challenge
The (to us) rather surprising meta-analytic findings give rise to the related question of whether empirical patterns of gender discrimination map on to the beliefs of laypeople and academics.Accuracies and inaccuracies in perceptions of group inequalities hold important implications for the efficient allocation of limited resources to combat them (Byrd & Thompson, 2022;Ceci et al., 2023;Kraus, Hudson, & Richeson, 2022).Consider for example that gender biases in hiring may be systematically overestimated by scientists, the general public, or both.If so, workplace interventions will tend to focus on making selection contexts fairer, rather than conducting systematic audits for wage inequalities between women and men or reforming the promotion processes in organizations.
To complement the meta-analytic investigation (Study 1), we carried out an accompanying forecasting survey examining whether scientists and laypeople could accurately estimate both time-trends and the current pervasiveness of gender biases in selection settings.Previous research has demonstrated that academics sometimes perform well in anticipating the results of scientific studies, based on limited information such as article abstracts and study materials (Camerer et al., 2016;Dreber et al., 2015;Forsell et al., 2019).Accuracy on the part of scientific forecasters has been observed even for fairly complex results such as different conceptual replications testing the same research question (Landy et al., 2020), experimental designs involving complex interactions (Tierney et al., 2020;2022), and cross-cultural similarities and differences (Tierney et al., 2021).Based on this earlier work, one straightforward prediction is that at least for academics, forecasts and realized results for gender discrimination over the years will be closely aligned.
And yet, there are also theoretical and empirical reasons to anticipate  systematic inaccuracies in academics' forecasts about gender discrimination.The famous wisdom of the crowd effect (Larrick, Mannes, Soll, & Krueger, 2012;Surowiecki, 2005) relies on the removal of random noise from estimates: errors that are randomly distributed across different independent forecasters cancel each other out in the aggregate.Select and expert crowds, for example scientists relative to laypeople, should be especially accurate because their superior knowledge, skill, and strategies lead to more accurate central tendency estimates and fewer random errors (Budescu & Chen, 2015;Mannes, Soll, & Larrick, 2014).However, if different members of the crowd are systematically biased in the same direction for any reason, aggregation will fail to remove such noise from the forecasts (Lorenz, Rauhut, Schweitzer, & Helbing, 2011).
The lack of political diversity among academics could represent one major source of shared systematic bias (Clark & Winegard, 2020;Duarte et al., 2015), leading scientists to overestimate the pervasiveness of gender discrimination in hiring despite their knowledge and expertise.
Another account incorporates elements of both the wisdom of the crowd and bias of the crowd predictions.It is also directly inspired by recent challenges in which scientists attempted to forecast the replicability of experimental laboratory demonstrations of gender discrimination (Tierney et al., 2022;Tierney et al., 2020).Academic forecasters were adept at anticipating not only simple condition differences, but even the results of complex designs capturing potential interactions between variables (e.g., expressions of anger or sadness by female or male targets perceived by evaluators of different genders; Tierney et al., 2022).But while the overall pattern of anticipated results mapped onto (i.e., correlated with) the replication effect sizes, in absolute terms scientists' expectations regarding overall discrimination were way off the mark.In Tierney et al. (2020), scientists predicted that Uhlmann and Cohen's (2005) findings of bias against female job candidates would emerge again in 2019, but the replication results were in the reverse direction (i.e., anti-male discrimination).Similarly, in Tierney et al. (2022), scientists expected that backlash against women who express anger in workplace settings (Brescoll & Uhlmann, 2008) would replicate, but in the new data collections the consequences of anger for perceived status, competence, likability, dominance, and assertiveness were the same for female and male targets.This leads to the prediction that academic forecasts and the realized results of field audits on gender discrimination should likewise be calibrated at a correlational level (wisdom of the crowd due to canceling out random error), but that discrimination will be overestimated in absolute terms (bias of the crowd due to systematic shared errors).
Of further interest were potential accuracies and inaccuracies among lay forecasters in this space.Even among non-academics, the widespread dissemination of classic academic studies on gender bias, some of them conducted decades ago, along with media coverage of high-profile cases of real-world discrimination, could contribute to similar systematic errors.At the same time, evidence that even laypeople can predict the results of some scientific studies (DellaVigna & Pope, 2018) and greater political diversity in the general U.S. population than among academics (Duarte et al., 2015), gives some reason to anticipate that an inexpert crowd of laypeople could be more collectively unbiased and accurate in this space than scientific experts.Finally, considerable evidence indicates that U.S. laypeople chronically underestimate racebased wealth inequalities (Kraus et al., 2022;Kraus, Onyeador, Daumeyer, Rucker, & Richeson, 2019).Thus, forecasts for laypeople could reflect system justifying motives (Jost & Banaji, 1994) or mere ignorance of group-based inequalities, leading to underestimations of gender gaps in hiring outcomes especially for earlier decades where the differences are larger (see Study 1).
To adjudicate between these competing possibilities, academic and lay forecasters were asked to predict the meta-analytic findings, separately for male-typed/gender-balanced and female-typed jobs, for successive spans of years.This enabled us to examine the extent to which lay and expert beliefs about the temporal trajectory and overall severity of gender discrimination map on to the observed empirical results.It further allowed us to assess correlational accuracy, absolute differences in estimated and observed effect sizes, and the potential moderating roles of forecaster characteristics.These were treated as empirical questions with multiple plausible outcomes.In other words, there were theoretical reasons to expect forecasted and realized effect sizes to correlate highly, but also weakly.Similarly, forecasts regarding absolute levels of gender discrimination might be close to the meta-analytic effect sizes or way off the mark.Moreover, either scientists or laypeople, and gender egalitarians or inegalitarians, could plausibly hold the advantage in predicting the empirical outcomes of the project.
We collected forecasts from two groups: 1) scientists primarily from the social and behavioural sciences, and 2) a nationally representative sample of laypersons from the United States.For both groups, we assessed demographic information such as their gender, political orientation on both social and economic issues, and individual differences in system-justifying vs. egalitarian beliefs (Jost & Kay, 2005;Kay & Jost, 2003).For academics, we further gathered potentially relevant disciplinary and topic expertise, such as whether they had previously published peer-reviewed research articles on gender.Greater topic expertise could enhance predictive accuracy (Budescu & Chen, 2015;Mannes et al., 2014), be associated with social-political values that increase systematic error thereby reducing accuracy (Duarte et al., 2015), or make no significant difference.

Forecasters
The nationally representative sample of laypeople was recruited through Prolific Academic and included 499 participants with ages between 18 and 78 (mean 35).When asked for their gender, 248 selected 'female', 244 selected 'male', 6 selected 'other', and 1 did not respond.In terms of overall political views, 85 participants reported to be at least somewhat conservative, 95 reported to be in the 'middle of the road' and 318 reported to be at least somewhat liberal.The sample was designed to be as representative as possible of the U.S. population on the dimensions of age, sex, and ethnicity using census data from the U.S. Census Bureau.Although the Prolific sample reflects the general population of the United States on these dimensions, it is not nearly as ideologically diverse as would be ideal.A sample with more left-leaning than right-leaning Americans is typical of such onsite data collection sites (Levay, Freese, & Druckman, 2016).
Forecasters from the academic sample were recruited through social media, professional listservs, direct email, and doctoral seminars.In the academic sample (N = 312), the age of the participants ranged from 21 to 76 (mean 38).When asked for their gender, 116 participants selected 'female', 195 selected 'male', and 1 selected 'other'.Most academics reported being at least somewhat liberal in their overall political views (2 4 7), while 38 chose 'middle of the road' and 27 reported being at least somewhat conservative.The largest subgroups of academic forecasters were from the fields of psychology (139, including subfields such as social and clinical psychology), economics (64, including subfields such as behavioural economics) and management (41, including subfields such as organizational behaviour and marketing).Of the remaining 65 participants, 35 were distributed over 16 different fields, and 33 did not provide an academic field or responded with 'N/A'.Career stages included Assistant Professor (69), Associate Professor (57), Professor (63), Graduate Student (64), Postdoctoral Scholar ( 27), Teaching Faculty (12), Research Assistant (11), other academic position (6), and Professor Emeritus (1); two participants did not respond to this item.Forecasters were provided a copy of the draft empirical report in advance and asked if they would like to opt-in to consortium authorship and if so to provide their names and affiliations.Colleagues listed as members of the Gender Audits Forecasting Collaboration in the Appendix A both made forecasts and indicated they would like to be part of the consortium credit.Not all forecasters elected to be listed as consortium authors, thus the number of names in the Appendix A differs from the sample size for Study 2.

Materials and procedures
Instructions.Forecasters were told that a forthcoming meta-analytic investigation tested for gender biases in hiring decisions, analyzing all available studies from 1976 to 2020 in which nearly identical applications were submitted to employers by either a female candidate or a male candidate and callbacks were recorded (e.g., interview invitations, job offers).Their goal in the present survey was to try and predict the results of the meta-analytic investigation.Forecasters were provided with a link to the Study 1 methods, with results redacted.
Prediction task.The forecasters predicted the callback rates for female and male candidates, separately for female-typed jobs and for male-typed/gender-balanced jobs.Within each category, predictions were divided into four successive spans of years : 1976-1986, 1987-1997, 1998-2008, and 2009-2020.They were also asked to make an overall prediction collapsing across time periods (i.e., from 1976 to 2020).For each span of years, forecasters were presented with a column asking for "Percentage of women who received callbacks" and "Percentage of men who received callbacks".Their predictions were then converted to log odds ratios and compared to the observed log odds ratios from Study 1's meta-analysis.
System-justifying beliefs.Next, forecasters completed the general system justification scale (Kay & Jost, 2003) where high overall scores reflect a tendency to justify the existing social order ("In general, society is fair"; 1 = strongly disagree to 7 = strongly agree), and low scores reflect a rejection of social hierarchy and commitment to egalitarianism.They further completed the gender system justification scale (Jost & Kay, 2005), which features similar items specifically adapted to refer to gender inequality ("In general, relations between men and women are fair").
Demographics.Finally, all forecasters reported their political orientation, both overall and separately for economic and social issues (1 = very liberal to 7 = very conservative), gender (female, male, other), age, and education level.Academic forecasters further indicated their academic career stage (e.g., Graduate Student, Postdoctoral Scholar, Teaching Faculty, Assistant Professor, Associate Professor, Professor), the year they received or expected to receive their PhD, their field of specialization, whether or not they currently held a tenured position, their number of publications on relevant topics (e.g., prejudice and discrimination, gender, race, and implicit bias), their total number of peer-reviewed publications, and the number of times they had taught a graduate level statistics or methods course.They further subjectively rated their proficiency in statistics relative to other academics (1 = much lower than average to 9 = much higher than average), and familiarity with research on gender discrimination (1 = not at all familiar to 9 = extremely familiar).
See the Open Science Framework (https://osf.io/ds6r2/)and the Supplementary Online Materials for the complete survey materials and pre-registered analysis plan for Study 2. The analyses below were preregistered, unless explicitly otherwise noted.In contrast to Study 1′s meta-analysis, for Study 2′s forecasting survey we pre-registered both the traditional significance threshold of p <.05 and the more conservative p <.005 advocated by Benjamin et al. (2018).Some members of the forecasting team, and none of the meta-analysis team, are signatories to Benjamin et al. (2018), thus this was a compromise between different sub-teams of the larger project.

Absolute levels of accuracy
Forecasted results (Study 2) are shown in Fig. 7 alongside the realized effect sizes from the meta-analysis of hiring audits (Study 1).For all forecasted log odds ratios, the mean is statistically significantly different from zero (one-sample t-tests) and is significantly different from the observed effects (two-sample z-tests).These p-values are summarized in Table S6-1 and Table S6-2 in the Supplementary Online Materials.
As seen in Fig. 7, forecasters correctly anticipated the moderating role of job stereotypicality, such that discrimination against women relative to men is comparatively greater in male-typed plus genderbalanced jobs than in female-typed jobs.A paired t-test was used to compare the forecasters' log odds ratios for male-typed/genderbalanced jobs for the entire time period with the log odds ratios for female-typed jobs for the entire time period.We find a statistically significant difference in both the academic sample (mean of differences: 2.16, t(3 1 1) = 21.2, p <.001, d = 1.97) and the layperson sample (mean of differences: 3.31, t(4 9 8) = 29.1,p <.001, d = 2.04).
Forecasters correctly believed that discrimination against women relative to men has decreased over time for male-typed plus genderbalanced jobs.A paired t-test was used to compare the forecasters' log odds ratios for male-typed/gender-balanced jobs for the first time period with the log odds ratios for last time period.We find a statistically significant decrease in both samples (mean of differences in the academic sample: 1.38, t(3 1 1) = 19.0,p <.001, d = 1.04; mean of differences in the layperson sample: 2.41, t(4 9 8) = 27.1, p <.001, d = 1.20).In addition, they incorrectly believed that discrimination against male candidates for stereotypically female-typed jobs has diminished substantially over time (mean of differences in the academic sample: -0.95, t (3 1 1) = -11.5,p <.001, d = 0.62; mean of differences in the layperson sample: − 1.34, t(4 9 8) = − 11.6, p <.001, d = -0.57).
At the same time, forecasters overestimated the overall degree of stereotype-consistent gender discrimination that would be observed in Study 1′s meta-analysis.Testing the forecasted log odds ratios for maletyped/gender-balanced jobs against zero in a one-sample t-test reveals that forecasters believed that men experience more positive job application outcomes than women for male-typed plus neutral-typed jobs (academic sample: mean forecasted log odds = 1.15,SE = 0.06, p <.001; laypeople sample: mean forecasted log odds = 1.79,SE = 0.07, p <.001).A z-test comparing the mean of the forecasted log odd ratios for male/ neutral typed jobs against the discrimination effect sizes from the metaanalysis further shows that forecasters overestimate the extent of discrimination for such jobs (academic sample: z-value = -18.98,p <.001; laypeople sample: z-value = -24.39,p <.001).In addition, forecasters correctly believed that women experience more positive job application outcomes than men for female-typed jobs (academic sample: mean forecasted log odds = -1.01,SE = 0.07, p <.001; laypeople sample: mean forecasted log odds = -1.52,SE = 0.08, p <.001), yet anticipated relatively greater discrimination against male candidates for such jobs than was actually observed (academic sample: z = 8.68, p <.001; laypeople sample: z = 13.66,p <.001).

Correlational accuracy
Distinct from perceptions of absolute levels of discrimination, we examined whether there is a positive overall association between the predictions of forecasters and the meta-analytic results.We test this hypothesis in an OLS regression where the individual forecast is included as an independent variable and the estimated meta-analytic gender discrimination in the forecasted time period and job type is the dependent variable.For the individual forecasts, we include the three time period predictions for female-typed jobs and the four time period predictions for male-typed/gender-balanced jobs.The forecasts for second time period for female-typed jobs is not used, because the corresponding meta-analytic effect size is missing due to a lack of audit studies during that specific span of years (see Fig. 7).We therefore have seven observations per forecaster.We include individual fixed effects in the OLS regression and we clustered standard errors at forecaster level (with the number of clusters equal to the number of forecasters) to take into account that each forecaster makes several predictions, and these predictions might be correlated.We observe a statistically significant positive correlational relationship between forecasts and observed metaanalytic outcomes for both the sample of academics (coefficient = 0.09, t = 17.5, p <.001) and the layperson sample (coefficient = 0.06, t = 34.6,p <.001).Thus, while forecasters expected much larger effects in absolute terms than emerged in the meta-analysis, there is a positive correlational relationship between their predictions and the realized results.

Individual differences in accuracy
Further analyses examined whether individual forecaster characteristics moderate the accuracy of their predictions.These included whether they were a trained scientist or layperson, their political orientation, and their endorsement of system justifying beliefs, among other potential moderators.Of particular interest was whether ideological beliefs, on either side of the spectrum, introduce systematic error that undermines the wisdom of the crowd effect typically observed in forecasting settings (Dreber et al., 2015).

Scientists versus laypeople.
For each survey-taker, the accuracy of each forecasting question is quantified as the squared difference between the prediction and the observed estimate in the meta-analysis.We estimate the mean squared prediction error of each forecaster for the nine verifiable predictions and then test if this differs between scientists and laypeople using an independent samples t-test.We find that the mean squared prediction error is significantly smaller for the academic forecasters compared to the layperson sample (means = 2.86 vs. 7.27, t (8 0 3) = -10.3,p <.001).This is because laypeople gave more extreme and therefore less accurate estimates than the academics (see Fig. 7).
Political orientation.To assess the political orientation of each forecaster, we averaged their responses to the three questions about their overall, social, and economic political orientation.We estimate an OLS regression with the mean squared prediction error of each forecaster as the dependent variable and the political orientation variable as the independent variable.The OLS regression is estimated with clustered standard errors, and the test is carried out as a t-test on the coefficient of the political orientation variable in the OLS regression.We find no statistically significant effect of political orientation on forecasting accuracy in the academic sample (coefficient = 0.048, t = 0.25, p =.80) or in the layperson sample (coefficient = − 0.33, t = -1.52,p =.13).
General system justification.For the academic sample, we find that individual differences in system justification are associated with a reduction in error (coefficient = -0.46,t = -2.48,p =.014).For the U.S. layperson sample, with increasing endorsement of the items on the general system justification scale, the error likewise decreases (coefficient = -0.83,t = -2.65,p =.008).Since a negative coefficient reflects fewer errors, this means high system justification scores are associated with greater accuracy in forecasting.Note however that these associations are only statistically significant according to the conventional p <.05 threshold, not under the stricter p <.005 threshold (Benjamin et al., 2018) we also pre-registered for the forecasting analyses (see S10 Fig. 7. Observed versus forecasted results of the gender audits meta-analysis.Observed and mean forecasted log odds ratios from the academic and U.S. nationally representative samples.Positive log odds ratios denote higher callback rates for male candidates than for female candidates, and negative log odds ratios denote higher callback rates for female candidates than for male candidates.Note that there is no observed meta-analytic result for female typed jobs for the period 1987-1997 due to a lack of relevant field audits during that span of years. in the Supplementary Online Materials).
Gender system justification.For the academic sample, we find a statistically significant relation between individual differences in gender system justification and the accuracy of predictions (coefficient = -0.45,t = -1.98,p =.049) when the traditional p <.05 threshold is used, but not when the more conservative p <.005 threshold is employed (Benjamin et al., 2018).For the U.S. layperson sample, we observed that with increasing endorsement of the items on the gender systems justification scale, the error decreases significantly (coefficient = -1.02,t = -2.94,p =.003), regardless of which threshold is used.
Notably however, for the academic sample the association between both general and gender system justification and forecasting accuracy did not survive robustness tests (see S12 in the Supplementary Online Materials).In contrast, there is more consistent evidence that lay forecasters who were less gender egalitarian made more accurate forecasts about gender discrimination in hiring.
Topic expertise (not preregistered).We categorized forecasters who had published at least one paper on gender as a gender researcher (N = 132).A comparison group of 168 forecasters had no published work on the topic.Using an independent samples t-test, we find no statistically significant difference in the mean forecasting error between the two groups (mean of 2.70 for non-gender researchers vs. a mean of 2.90 for gender researchers, t(2 5 0) = 0.037, p =.71).
Forecaster gender (not preregistered).We find that accuracy differs between male and female forecasters in the U.S. representative sample, with women exhibiting a significantly higher forecasting error: 8.36 for women vs. 6.18 for men, t(4 5 9) = 3.14, p =.001.In the academic sample, accuracy did not statistically significantly vary with forecaster gender.

Discussion
In sum, the average forecasts from both the academic and laypeople show that they expected higher callback rates for male candidates (relative to female candidates) for male-typed/gender-balanced jobs, and higher callback rates for female candidates (relative to male candidates) for female-typed jobs.The strength of this stereotype-consistent discrimination was expected to decline from the earliest to the most recent time period, yet remain robust in recent years.Laypeople expected stronger effects compared to academics, yet both groups overestimated the severity of biases in hiring relative to the meta-analytic estimates from Study 1, most notably for the most recent time period of 2009-2020.Despite some errors in anticipating absolute levels of discrimination, a significant correlation between forecasted and realized results was observed.Scientists with a track record of publishing research on gender were not more (or less) accurate in their predictions than other academics.Some evidence emerged that less gender egalitarian laypersons were more accurate in their beliefs regarding gender biases in selection, but this effect was not robust in the academic sample and more research is needed before drawing strong conclusions.Further analyses of the forecasting results, including robustness tests, are provided in S12 in the Supplementary Online Materials.

General discussion
The results of Study 1′s meta-analysis of 244 effect sizes based on 85 field audits and 361,645 individual job applications across 44 years and 26 countries and territories indicate that outcomes for female candidates have become more positive over time.Until relatively recently, we observe directional preferences for men in hiring and selection for many roles.However, such discrimination against female applicants has diminished over the years in some developed societies, culminating in either no gender bias or a slight reversal in favor of female job candidates depending on the type of job and specific span of years examined.It is important to emphasize that the directional preference for female candidates that we observe in some recent time intervals are based on exploratory analyses, and was absent for stereotypically male-typed and gender-balanced jobs, where no gender bias in either direction was found.
The lack of an inflection point or sudden change in selection decisions associated with the advent of the #MeToo movement indicates that the observed decline in discrimination against women is the product of longstanding social forces rather than recent events.Returning to the question with which we opened this article, although it has been a long process, at least some societies have truly experienced meaningful change.Tellingly, however, male candidates for stereotypically femaletyped jobs (e.g., secretary or elementary school teacher) did not receive more favorable outcomes in recent years relative to past decades.Thus, the results of the meta-analysis provide evidence of cultural stability as well as plasticity and speak to the continuing importance of gender in organizational selection decisions.
As in prior research (e.g., DellaVigna & Pope, 2018;Dreber et al., 2015), Study 2′s forecasting survey revealed a significant positive correlation between predicted and realized effect sizes for academics and a nationally representative sample of U.S. laypeople.Forecasters correctly anticipated the moderating role of the gender stereotypicality of the job (i.e., male-typed, gender-balanced, or female-typed occupations).At the same time, both groups of forecasters overestimated absolute levels of gender biases in selection decisions.Scientists predicted smaller effect sizes and were for this reason comparatively more accurate than laypeople in this regard.The forecasters correctly anticipated a decline in stereotype-consistent discrimination against female candidates since the late 1970s, but incorrectly expected that bias against male candidates for female-typed jobs would progressively diminish as well.Consistent with cultural and intellectual narratives of pervasive prejudice, both laypeople and academics believed that significant discrimination against female candidates for male-typed and gender-balanced jobs would be observed across the most recent time period (2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019)(2020).Scientists with higher levels of expertise in gender stereotyping, as evidenced by research publications on the topic, forecasted results for 2009-2020 along the same lines.This and other recent cases of misprediction regarding the outcomes of pre-registered tests of gender bias (Tierney et al, 2020;2021) could result from ideological blind spots reducing forecasting accuracy in this domain.Consistent with this idea, lay forecasters who strongly rejected system justifying statements regarding gender (i.e., scored especially high in gender egalitarianism) were the least accurate at predicting the meta-analytic findings.This effect of gender system justification was conventionally statistically significant (p <.05) yet not robust to alternative analyses (see S12 in the Supplementary Online Materials) and more conservative significance cutoffs (Benjamin et al., 2018) in the academic sample.Regardless of the underlying contributors to predictive (in)accuracy, the forecasting survey indicates the meta-analytic results for recent years are profoundly counter-intuitive, even to experts, and not at all obvious based on common scientific knowledge regarding contemporary gender biases.

Mechanisms of change
Field audits are better suited to documenting the prevalence of discrimination rather than elucidating process.At the same time, that discrimination against male applicants for female-typed jobs has remained constant over the last 44 years indicates gender has not become irrelevant in contemporary workplaces.Indeed, diversity and inclusion goals, which aim to build awareness and make gender top-ofmind, may contribute to the observed cultural changes with regard to treatment of female applicants for male-typed and gender-balanced jobs.If decision makers were factoring in candidate gender less across-theboard, discrimination would have faded away across the years regardless of job type (i.e., stereotypically male, relatively gender balanced, or stereotypically female occupations and roles).
Instead, contemporary evaluators appear to be making efforts to specifically increase female representation in the organization, rather than seeking to challenge stereotypes and traditional roles for both genders.Supporting this idea with independent evidence, recent research reveals muted moral concerns about male underrepresentation in traditionally female jobs, due to the perceptions that such roles are low in status and that men are not motivated to obtain them (Block et al., 2019;Reynolds et al., 2020;Stewart-Williams, Wong, Chang, & Thomas, 2022).Thus, some organizational decision makers may seek to redress a long history of discrimination and ongoing underrepresentation by supporting female candidates (Block et al., 2019;Leslie et al., 2017), yet fail to extend the same support to men whose professional interests challenge traditional gender roles.
Another likely contributor is selective shifts in stereotypes.The previously widespread belief that women are less competent than men is no longer observable in representative surveys (Eagly et al., 2020).Reductions, and in some nations full reversals, of gender gaps in education levels occurred during the period studied, eroding the motivation to engage in statistical discrimination based on perceived group differences in skills and human capital.Yet at the same time, the belief that men are less communal than women has not only failed to fade away over years, it has in fact intensified (Eagly et al., 2020).Many femaletyped jobs (e.g., elementary school teacher) are perceived as communally demanding, likely contributing to ongoing discrimination against male applicants for such positions.A full elucidation of underlying mechanisms awaits more controlled laboratory investigations, for example via contemporary replications of classic gender bias experiments featuring rigorous tests of potential moderators and mediators.

Limitations and non-implications
Our most important data limitation is the comparatively smaller number of audits before 2000, and especially before 1980, compared to more recent years where more precise estimates are possible (see Quillian et al., 2017, for a similar temporal distribution of audits of racial bias).Of particular concern, Study 1′s leave-one-out analysis finds that omitting a single large early study renders the overall time trend nonsignificant.Although readers can judge for themselves, we believe a large effect size for discrimination against female job applicants in a rare well-powered study from 1978 is highly representative of the widespread discrimination against women during that period (Avent-Holt & Tomaskovic-Devey, 2012; Blau & Kahn, 1997;Eversley & Habell-Pallán, 2015;Snipp & Cheung, 2016;Stanley & Jarrell, 1998) and important data to include in the meta-analysis.The study in question features by far the largest sample from before 1980, to the point that arbitrarily deleting it from the meta-analysis excludes 93% of pre-1980 applications without any real justification.An argument must also be followed where it leads.Deleting the large 1978 study and rejecting the conclusion of a time trend necessitates also concluding little to no discrimination against female applicants for stereotypically male-typed and gender-balanced jobs prior to 1980.Rejecting the time trend also does not question another key finding: contrary to popular and scientific beliefs, there is no evidence of recent discrimination in callback rates against female job candidates in the nations sampled.If there is no time trend, both scientists and laypeople are even more inaccurate in their theories of bias against female candidates, not only misestimating present day discrimination, but also past discrimination and cultural trajectories over time as well.
Although we believe the data does support a downward time trend, pinpointing exactly when anti-female discrimination in selection decisions reached zero in the societies in question may not be possible due to data limitations.Our sample of pre-1980 observations is neither large in absolute terms nor in comparison to recent large-scale audit studies.In general, we face rapidly mounting uncertainty in meta-analyzing the literature the further we go back in time.The available set of field audits suggest that selection bias against women for male-typed and genderbalanced jobs faded away in 2009, but this conclusion may be unduly affected by one older study.The actual year could be earlier, or later, and likely differs across societies based on unmeasured moderators we are unable to capture or test due to inadequate sample sizes of older audit studies within each nation.Although drawing strong inferences about past discrimination is challenging, as discussed in greater depth below, the scientific community is well positioned to do rigorous new work testing the robustness and direction of current gender biases in selection decisions.
At the same time, we warn against interpreting our meta-analytic results to conclude equality of treatment of female applicants has been achieved with regard to historically male-typed and gender-balanced jobs, and that current efforts to increase the proportion of female employees in such roles are no longer needed.Our data did not examine the consequences of abandoning current policies, and doing so risks increasing gender bias in the future.If organizations decide to discontinue their diversity, equity, and inclusion (DEI) efforts with regard to gender, or individuals stop making the effort to override their own sexist biases, one potential result is a slide back to discrimination against qualified female applicants.A point estimate for gender discrimination close to zero in some contemporary societies also does not mean that all the industries and organizations within those societies are free of bias.It is not possible to make generic recommendations given the large heterogeneity observed in the effect sizes, and the decision to pursue inclusive hiring needs to be made on an organization-by-organization basis.Firms that experience an upward trend in hiring women may experience backlash against this increased diversity among members of historically privileged groups (Craig & Richeson, 2014;Danbold & Huo, 2015;Dover, Major, & Kaiser, 2016).Further, given the continuing discrimination against male applicants for female-typed occupations, it is important to work to improve the social acceptability and presence of men in jobs such as social worker, nursing, preschool educator, and receptionist.Even without gender bias in selection into jobs, implicit barriers remain in place that could reduce female representation in male-typed jobs and male representation in female-typed job.For example, if male nurses and secretaries are perceived to violate prescriptive gender norms and suffer backlash effects, then there should be relative fewer male applicants for such roles even in those organizations that would not have been averse to hiring them.
We find evidence of an improvement in entry-level job application outcomes for female candidates over time, as well as no overall bias against women in callback rates over the last decade.However, gender gaps may persist for other outcomes besides employer responses to initial first-round job applications.Organizations may balance their shortlists of candidates, perhaps due to DEI initiatives, and then proceed to make biased final selection decisions.Gender bias may also persist for high-level, lucrative roles, such as executive positions or elite jobs requiring specialized experience and background, for which audit studies with bogus applicants are not feasible at scale.Unfair gaps between women and men also occur across further dimensions such as wages (Auspurg, Hinz, & Sauer, 2017;Joshi, Son, & Roh, 2015;Bar-Haim, Chauvel, Gornick, & Hartung, 2018;Ceci et al., 2023;Obloj & Zenger, 2022), advancement within firms (Goldin, Kerr, Olivetti, & Barth, 2017), career penalties for parenthood (Dias, Chance, & Buchanan, 2020), and becoming the target of sexual harassment (Quick & McFadyen, 2017), among others.Even superficially gender-neutral performance criteria can create unfair gender disparities if they leave parents and caregivers at a competitive disadvantage (Cheryan & Markus, 2020).Further, the studies included in our meta-analysis examined discrimination against cisgender individuals, and transgender applicants may experience far more mistreatment on various fronts in employment settings (e.g., James et al., 2016).
Contemporary selection-stage biases against women are also probable in nations higher in gender inequality or on a different cultural trajectory (Norris & Inglehart, 2004) than those captured in the audit studies conducted to date and included in this meta-analysis.The present set of audit studies oversampled nations with relatively low levels of gender inequality by global standards (i.e., North America, Western Europe, developed regions of Asia Pacific).The median gender inequality index of the nations included in the meta-analysis was 0.15 (25% quartile: 0.08; 75% quartile: 0.24), placing them toward the lessunequal end of the distribution with regard to leadership representation, wages, and educational attainment (i.e., the gender inequality index of the 162 tracked countries ranged from 0.03 to 0.82 between 1995 and 2019; United Nations Development Programme, 2020).Thus, the nations included in the meta-analysis were disproportionately WEIRD (Western, Educated, Industrialized, Rich, and Democratic;Henrich, Heine, & Norenzayan, 2010;Pitesa & Gelfand, 2023), because these were the places where audit studies were conducted.Although nationallevel inequality, development, and culture variables did not moderate the effect in our sample, we would expect to see more overall discrimination against women for stereotypically male-typed and genderbalanced jobs, fewer total female-typed jobs, and either less change or no change over time in societies with persistently strong gender roles and norms.
Even in societies where the goal to be inclusive towards women plays a major role in deliberative selection processes, concurrently operating culturally socialized stereotypes can influence judgments when the motivation or opportunity to control prejudice is weak (Banaji & Greenwald, 2013;Crandall & Eshleman, 2003;Fazio, 1990;Gaertner & Dovidio, 1986).The high observed heterogeneity in estimates across audit studies indicates that the presence and direction of gender discrimination is likely contingent on other unobserved factors.Such moderators may include evaluator motivations, candidate qualifications, job characteristics, and organizational and national culture, among others.In light of the present findings regarding moderation by job stereotypicality and related characteristics, discrimination against female candidates may persist in very strongly male-typed occupations that require physical strength, such as certain roles in construction work and manufacturing.
Another noteworthy limitation of audit studies stems from the random assignment of candidates to different professional characteristics (e.g., strength of qualifications, type of training, employment status) and demographics (e.g., gender, race, age, physical attractiveness, social class, parental status).Although this allows for tests of causality using richly detailed materials, it can render the sample of applicants nonrepresentative of a particular labour market's actual pool of candidates.In addition, because many audit studies manipulate multiple candidate characteristics at once without clear neutral (no-information) conditions, testing simple effects of target gender in the absence of other manipulated characteristics is often not feasible.Following on previous meta-analyses of audit studies (Flage, 2018;Lippens, Vermeiren, & Baert, 2023;Koch, D'Mello, & Sackett, 2015;Quillian et al., 2017;Quillian & Lee, 2023;Zschirnt, & Ruedin, 2016), we therefore calculated the main effects of target gender across all other candidate characteristics.This allowed us to preserve the full sample of studies and carry out informative tests of job type and trends over time.

Why shifts in gender discrimination over time but not race discrimination?
One puzzling question is why a change in bias appears to have occurred for gender and selection for jobs, but not for race (e.g., Quillian et al., 2019;Quillian & Lee, 2023;Quillian et al., 2017;Rich, 2014).In some organizations a hierarchy of diversities may have emerged, such that gender is emphasized more strongly than other dimensions of inequality such as race and ethnicity, LGBTQ+ status, and socioeconomic background.Unlike sexual orientation and SES diversity, gender is perceived as observable and thus may be seen as having more signaling value.Especially in multinational firms, gender representation may be perceived by leaders as a "50-50 problem" and more straightforward to set numeric goals for than racial diversity, given the complex dynamics and varying demographics of race across societies (Sidanius & Pratto, 1999).However, gender and race are not separate dimensions of discrimination, and workers with multiple marginalized identities (e.g., Black women) can experience unique forms of mistreatment that intersects these two identities (Purdie-Vaughns & Eibach, 2008).
Recent research directly demonstrates that perceived diversity value can mediate favorable judgments of female relative to male employees (Leslie et al., 2017), and provides evidence of organizations seeking to cynically accumulate just enough members of underrepresented groups in visible positions to manage public perceptions (Chang et al., 2019;Knippen, Shen, & Zhu, 2019;Naumovska et al., 2020).However, positive evaluations of women can also result from inferences about the candidates themselves.Some evaluators engage in "belief flipping," assuming that a female candidate, having overcome barriers that her male counterparts did not face, is superior on unmeasured variables such as work motivation (Fryer, 2007).The question then arises of why such favorable inferences are not made about members of other nonprototypical groups, such as racial minorities, or if made are insufficient to overcome discriminatory biases in hiring against them (Quillian & Lee, 2023;Quillian et al., 2017;Rich, 2014).

The need for pre-registered primary investigations and replications
Unlike many academic literatures, the present set of audit studies is not characterized by an overabundance of barely significant results, or implausibly large effect size estimates from small samples.Hence, publication bias likely did not have a major impact on the results of the meta-analysis, which was confirmed when applying multiple publication bias tests.Nevertheless, publication bias methods have their own limitations (Carter, Schönbrodt, Gervais, & Hilgard, 2019;Renkewitz & Keiner, 2019;van Aert, Wicherts, & van Assen, 2016), and although the meta-analytic approach was registered in advance, the audit studies included in our meta-analysis were generally not themselves preregistered.Thus, more strictly confirmatory experiments on group-based discrimination are needed, and eventually a meta-analysis of exclusively pre-registered investigations.Further, although the present set of field audits covered a wide array of industries, companies, and organizational roles, the positions targeted were neither sampled representatively nor systematically.Future field audits should ideally define the sample space of jobs in advance, for example positions at Fortune 500 companies that have been advertised online.Since preregistration and a well-defined sample space far from eliminate all sources of research bias (Carter et al., 2019), future investigations would be particularly productive as adversarial collaborations between scholars endorsing opposing positions on the persistence of discrimination, who plan the methodology together (Clark, Costello, Mitchell, & Tetlock, 2022;Clark & Tetlock, 2022;Mellers et al., 2001).
Based on the results of the present meta-analysis, we speculate that many classic laboratory and field investigations documenting discrimination against women will no longer replicate (i.e., will yield aggregated effect size estimates close to zero) in cultural populations subject to positive change processes.To this end, our research group has recently launched a crowdsourced initiative (Klein et al., 2014) seeking to directly replicate influential experimental studies on situational and individual factors that trigger gender discrimination.Group-based discrimination represents a special case of replication since previously observed effects may not emerge in subsequent investigations due to progressive currents in the broader society (Eagly et al., 2020;Varnum & Grossmann, 2017), in addition to improvements in research practices and other common sources of non-replicability (Nelson et al., 2018).At the same time, the high heterogeneity of estimates in the present metaanalysis points to the moderation of gender discrimination by context, rather than the absence of bias.Thus, this crowd effort will focus on factors that may activate, counteract, and reverse gender biases.Discrimination against women may not be robust in baseline (control) conditions yet emerge when women self-promote and express ambition (Okimoto & Brescoll, 2010), promote diversity initiatives (Hekman, Johnson, Foo, & Yang, 2017;Rudman, Mescher, & Moss-Racusin, 2013), are parents (Benard & Correll, 2010) or pregnant (Bragger, Kutcher, Morgan, & Firth, 2002), or are labeled feminists (Roy, Weibust, & Miller, 2009) or affirmative action hires (Heilman, Block, & Stathatos, 1997), among other potential triggers.

Conclusion
The extent to which societies have experienced meaningful changes in how women and men are treated, and whether contemporary job candidates continue to face gender discrimination, are questions of tremendous theoretical and practical importance.The present metaanalysis finds that discrimination against female applicants for jobs historically held by men has declined significantly and is no longer observable in the last decade.In contrast, bias against male applicants for female-typed jobs has remained robust and stable over the years.These results thus demonstrate both welcome declines in and the stubborn persistence of different forms of gender discrimination.Contrary to the beliefs of laypeople and academics revealed in our forecasting survey, after years of widespread gender bias in so many aspects of professional life, at least some societies have clearly moved closer to equal treatment when it comes to applying for many jobs.

M
.Schaerer et al.

Fig. 1 .
Fig. 1.PRISMA flow diagram.The diagram depicts the information flow through the different phases of our systematic review, including the number of records identified, included and excluded, and the exclusion reasons.

Fig. 2 .
Fig. 2. Number of audit studies across geographic regions and across time.World map visualizing the number of field audits included in our sample across different countries and territories (center) and across year of data collection (bottom left).

Fig. 4 .
Fig. 4. Time trends of candidate gender preferences overall and by job type.The results are based on a multilevel meta-regression model including gender inequality, study design complexity, and author gender ratio as control variables.In all figures, odds ratios above 1 indicate a greater preference for male applicants and odds ratios below 1 indicate greater preference for female applicants.Error bands indicate the 95% confidence interval around the mean trend.The size of the circles is proportional to the number of applications represented by the respective data point.

Fig. 5 .
Fig. 5. Candidate gender preferences by time period and job type.Effect sizes are grouped together depending on the year the applications were sent out and were combined using a univariate random-effects model.The top panel (Fig. 5A) shows the average odds ratio before 2009 and 2009 and thereafter, which corresponds to the theoretical crossover point of the time trend in Fig.3.The bottom panel (Fig.5B) compares the odds ratios using more granular time periods, including the post-#MeToo years of 2018-2020.Odds ratios above 1 indicate a greater preference for male applicants and odds ratios below 1 indicate greater preference for female applicants.Error bars indicate the 95% confidence interval around the average odds ratio that is based on a normal distribution.The size of the symbols is proportional to the number of effect sizes in the respective bin.

Fig. 6 .
Fig. 6.Contour-enhanced funnel plot.This contour-enhanced funnel plot shows the relationship between the effect size estimates and their standard error.Shaded areas indicate the two-tailed p-value of a particular study.Log odds ratios above 1 indicate a greater preference for male applicants and odds ratios below 1 indicate greater preference for female applicants.

Table 1 Example Jobs by Gender Typicality.
The table provides example jobs for each category of job gender typicality (female-typed, gender-balanced, male-typed) as categorized by four human raters.Example jobs are presented in alphabetical order within each category.

Table 3 Overview of publication bias statistics.
The table reports the parameter estimates, 95% confidence intervals, and test statistics of three publication bias metrics: PET-PEESE, three-parameter selection model (3PSM), and p-uniform*.Empty cells withindicate that this result was not reported by the particular method.