Computer says ‘no’: Exploring systemic bias in ChatGPT using an audit approach

Large language models offer significant potential for increasing labour productivity, such as streamlining personnel selection, but raise concerns about perpetuating systemic biases embedded into their pre-training data. This study explores the potential ethnic and gender bias of ChatGPT — a chatbot producing human-like responses to language tasks — in assessing job applicants. Using the correspondence audit approach from the social sciences, I simulated a CV screening task with 34,560 vacancy – CV combinations where the chatbot had to rate fictitious applicant profiles. Comparing ChatGPT’s ratings of Arab, Asian, Black American, Central African, Dutch, Eastern European, Hispanic, Turkish, and White American male and female applicants, I show that ethnic and gender identity influence the chatbot’s evaluations. Ethnic discrimination is more pronounced than gender discrimination and mainly occurs in jobs with favourable labour conditions or requiring greater language proficiency. In contrast, gender discrimination emerges in gender-atypical roles. These findings suggest that ChatGPT’s discriminatory output reflects a statistical mechanism echoing societal stereotypes. Policymakers and developers should address systemic bias in language model-driven applications to ensure equitable treatment across demographic groups. Practitioners should practice caution, given the adverse impact these tools can (re)produce, especially in selection decisions involving humans.


Introduction
The emergence of several large language models (LLMs), such as OpenAI's Generative Pretrained Transformer (GPT) or Google's Pathways Language Model (PaLM), has recently sparked scholarly and public debate on their implications (Nature, 2023;Teubner et al., 2023;The Economist, 2023a, 2023b;Thorp, 2023).LLMs, trained on large corpora of text data, are generative artificial intelligence (AI) applications that can mimic human-like responses to language tasks (Teubner et al., 2023).These algorithms have the potential to significantly impact labour productivity, transforming how we work by automating a myriad of tasks (Acemoglu et al., 2022;Agrawal et al., 2019;Brynjolfsson et al., 2023;Eloundou et al., 2023;Felten et al., 2023;Noy and Zhang, 2023;Teubner et al., 2023).For instance, LLMs can generate comprehensive reports or summaries from large volumes of text data.They may even replace substantial parts of human jobs.LLMs can help in customer support by providing first-level responses to customer inquiries based on user-inputted product or service manual text or responses to frequently asked questions.They can also assist in job applicant assessment by processing curricula vitae (CVs) and cover letters to match applicant qualifications with job requirements.Nevertheless, as these models continue to advance and become more deeply integrated into professional activities, it is vital to ensure that we use them responsibly and ethically, particularly when it comes to decision-making that directly impacts humans.
Recent research suggests that AI technologies, including LLMs, can assist with human resources (HR) practices such as hiring and selection (Budhwar et al., 2023;Vrontis et al., 2021).These technologies may help practitioners automate hiring decision-making, increase decision-making efficiency, or increase decision accuracy and fairness by bypassing human prejudice (Budhwar et al., 2023;Cooke et al., 2019;van Esch et al., 2019).The idea of automating part of the personnel selection process is not new; firms already use algorithms for hiring (Noble et al., 2021).Because LLMs are ideally suited to perform text processing and analysis, they can be used integrally or as part of a broader tool or platform for applicant profile screening purposes.Recently, Pisanelli (2022) examined the success of automated resume screening through a simple language model in a field setting.They found that automated screening diminished the gender interview invitation gap vis-à-vis human recruiters' manual CV screening by almost two-thirds, providing evidence that algorithms may indeed counter human-induced prejudice.
However, instead of reducing bias, LLMs could reproduce the systemic, historical biases embedded in their pre-training data (Budhwar et al., 2023;Caliskan et al., 2017;Cowgill et al., 2020;Peres et al., 2023;Rich & Gureckis, 2019;Schramowski et al., 2022).Examples include amplifying prejudiced views or extrapolating the underrepresentation of vulnerable groups based on the underlying human-created texts.Systemic discrimination induced by LLMs can further the negative cumulative impact of existing bias across domains and time, increasing group-based disparities (Bohren et al., 2022).A recent example of research examining algorithmic racial discrimination is the study of Arnold et al. (2021).They show that Black defendants received a considerably lower pretrial release rate from an advanced AI algorithm than White defendants despite identical pretrial misconduct potential.These observations raise concerns regarding the fairness and objectivity of AI-assisted selection activities and call for a deeper examination of such biases (Tambe et al., 2019).I focus on ChatGPT, a chatbot actuated by OpenAI's GPT-series models, and its ability to perform a CV screening task without exhibiting bias in its responses.There is no public information about the precise data the GPT-series models are pre-trained on besides that they rely on a comprehensive data corpus encompassing websites, books, manuals, fora, job boards, and other (online) content (Brown et al., 2020;OpenAI, 2023a;Teubner et al., 2023).Their text responses are fine-tuned through specific task prompts and feedback provided by the user.Like other LLMs, ChatGPT might perpetuate and reinforce biases about specific demographic groups it has learned from its training data's patterns, language, and concepts, resulting in discriminatory responses to said CV screening task.Economists refer to this assessment based on group characteristics as statistical discrimination.Hate speech in online fora (Bliuc et al., 2018;Castaño-Pulgarín et al., 2021;Ederer et al., 2023) or harmful stereotypes about minority groups in pre-existing job advertisements (Koçak et al., 2022;Wille and Derous, 2017), for example, can taint the models' training processes.The GPT-3 model instance has demonstrated anti-Muslim bias in word association tasks before, consistently linking Muslims with violence and terrorism (Abid et al., 2021).
The social sciences have a broad tradition of examining discrimination in hiring using correspondence audit studies (Bartkoski et al., 2018;Heath & Di Stasio, 2019;Lippens, Vermeiren, & Baert, 2023;Quillian et al., 2017Quillian et al., , 2019;;Quillian & Lee, 2023;Quillian & Midtbøen, 2021;Thijssen et al., 2021;Zschirnt & Ruedin, 2016).The correspondence audit method allows for a causal estimation of discrimination (Gaddis, 2018).In labour market research, this approach compares callback or invitation rates from recruiters or employers to fictitious job applicants who possess similar qualifications but differ in ascriptive characteristics, such as ethnic origin or race.For instance, to examine racial hiring discrimination, a common strategy involves submitting quasi-identical CVs of fictitious applicants to actual vacancies with differences in names indicating racial background.The most recent meta-analytic estimates, comprising worldwide experimental data on hiring discrimination, reveal that candidates signalling ethnic, racial, or national origin minority group membership receive, on average, 29% fewer positive responses from recruiters than their majority counterparts (Lippens, Vermeiren, & Baert, 2023).Across the examined subgroups, average penalties are as high as 41% (for Arabs) or as low as 8% (for Hispanics).
The current CV screening experiment with ChatGPT zooms in on ethnic and gender identity, given their status as the most researched and easily operationalisable discrimination grounds using name variations (Lippens, Vermeiren, & Baert, 2023).Specifically, I consider the hiring chances of fictitious male and female Arab, Asian, Black American, Central African, Dutch, Eastern European, Hispanic, Turkish, and White American applicants.The task consisted of prompting ChatGPT to assess the suitability of these fictitious candidates, only differing by name (signalling ethnic and gender identity), for a given job using the information provided in their CVs and the vacancy text.Aside from the name at the top of the CV and minor differences across CVs to ensure a fitting vacancy-CV match, the input remained identical.Subsequently, I asked ChatGPT to output ratings for each candidate and regarded differences in ratings as evidence of bias and, thus, discrimination.
This study has three main contributions.First, transposing the correspondence audit method from the field experiment literature into a valid alternative for detecting bias or discrimination in LLMs bridges the gap between social science and computer science research.Bias in LLMs has often been measured through word association tasks; for example, by prompting an LLM with the phrase 'a democrat is [male/female]' and asking the LLM to fill in the gender (Abid et al., 2021;Liu et al., 2022).The current experiment applies an existing, well-established method in audit research for measuring hiring discrimination through name association among humans to the context of AI.Second, the approach is methodologically valuable.Potential bias is experimentally exposed in a real-world selection task using a broad range of ethnically identifiable and gendered names linked to actual CVs and vacancies instead of employing a synthetic association task.The experiment also gains from the scalability and automation of LLMs-an advantage over field audit studies using human recruiters-while I could present similar candidate profiles successively to the model without concerns of spillover (e.g.'Will presenting one candidate affect the bias against another candidate?')or detection ('Will it become apparent that I am running an experiment?').Third, the study adds to mapping out bias in AI algorithms by quantifying to what extent discriminatory output produced by LLMs can sustain pre-existing differences and societal stereotypes in labour market outcomes based on personal identity.

Data and methods
I submitted ChatGPT to a simulated CV screening task.To this end, I (i) collected text data from existing job vacancies and CV templates, (ii) created candidate profiles by supplementing the templates with sets of validated names constructed specifically for experimental studies on ethnic and gender identity, and (iii) instructed the chatbot to assess these profiles, only differing by name, based on the requirements in the vacancies.Given the experimental setup, I analysed the data using standard regression techniques.Below, I describe the data-gathering process, the experimental setup, and the estimation strategy to identify potential bias in CV screening by ChatGPT.

Vacancies
The website of the Flemish public employment agency VDAB in the Dutch-speaking part of Belgium (https://www.vdab.be)was the starting point for retrieving vacancy text data.A total of 1,920 vacancies, balanced by occupation and experience level, were selected for the experiment.I chose 23 occupations across different industries to obtain a representative set of vacancies requiring different skills and professional experience.These occupations ranged from clerical service sector employees (e.g.administrative assistant, HR officer) to IT personnel (e.g.IT analyst, IT project leader) and logistics workers (e.g.industrial logistics planner, trucker-trailer driver).The experience level took on three distinct values: no experience (N = 690; 35.94%), at least two years of experience (N = 690; 35.94%), and at least five years of experience (N = 540; 28.13%).As a general rule, I sampled 30 recent vacancies per occupation and experience level to obtain some critical mass.The vacancies of five occupations with at least five years of experience as a functional requirement (e.g.seller of clothing accessories) appeared too infrequent in the vacancy set and were considered non-representative; these vacancies were consequently excluded from the experiment.Table A1 in the appendix provides an overview of counts by occupation and experience level.Examples of the extracted vacancy text can be retrieved from Table A2.
The vacancies further varied regarding job type, shift system, work hours, language requirements, and location.The most prevalent job types were permanent positions (N = 1,501; 78.18%) followed by interim or temporary positions (N = 382; 19.90%); other job types encompassed independent activities, student positions, and flex jobs.The most common shift system was day work (N = 1,823; 94.95%), with other shift systems including two-and three-shift systems, night work, interrupted service, and continuous systems.

CVs
Each vacancy was paired with 18 CV profiles of fictitious job candidates (differing in ethnic and gender identity) possessing the educational background and professional experience required to perform the job in the vacancy adequately.Table A6 in the appendix shows representative examples of CV text.Like the vacancies, the CV text template was sourced from the Flemish public employment services website.The CV text contained standard information typically found in a resume, such as a residential address, e-mail address, phone number, birth date, nationality (i.e.always Belgian), vehicle ownership and driving ability, bachelor-level degree, general personal characteristics, and language and computer skills.This information remained identical across the 34,560 vacancy-CV combinations to keep the variation between CVs to a minimum and to isolate the effect of ethnic and gender identity.Between vacancies, CVs varied by the specialisation and graduation year of the bachelor's degree and the type and duration of work experience to guarantee a suitable vacancy-CV match.
Within vacancies, CVs only varied in ethnic and gender identity, with all other CV and application details identical.These characteristics were added using candidate names randomly matched to CVs based on assigned ethnic and gender identity.The distinct names signalled diverse ethnic identities-i.e.Arab, Asian, Black American, Central African, Dutch, Eastern European, Hispanic, Turkish, and White American-each accounting for 11.11% of the candidate profiles, and two genders-i.e.female and male-each accounting for 50% of the profiles.Here, Dutch refers to the language spoken in (the Flemish part of) Belgium and The Netherlands, amongst other countries, rather than the country of origin (i.e.The Netherlands).
I drew names from five sources.The first series comprising Asian male, Black American, Hispanic, and White American names were acquired from the recently published dataset of Crabtree et al. (2023).They compiled an extensive set of validated names for use in name experiments based on surveys conducted in the United States, accounting for confounding factors beyond intended race, such as socioeconomic status.Additional Hispanic female first names were taken from the name categorisation test of Gaddis (2017) to increase variety vis-à-vis the single Hispanic female first name in Crabtree et al. (2023).Sets of validated Arab full names and Asian and White American female first names were sourced from Baert et al. (2022), who conducted a similar name experiment.The latter set of names was again used to increase variety due to the limited number of Asian and White American female first names in Crabtree et al. (2023).Dutch names were obtained from Lippens, Dalle, D'hondt, Verhaeghe, & Baert's (2023) correspondence audit study and Martiniello and Verhaeghe's (2022) name perception experiment.Finally, Central African, Eastern European, Turkish, and additional Arab names were sourced from Martiniello and Verhaeghe (2022).

Prompts
The experiment consisted of sending prompts and receiving and logging model output.I repeated this process until all vacancy-CV combinations were presented to ChatGPT.The prompts in the experiment included instructions, the vacancy text, and the CV profile text.
The only differences between CVs were the assigned ethnicity and gender, signalled through names, while all other details remained the same (see Section 2.1.2).All input was written in Dutch; Lai et al. (2023) have shown that the accuracy and performance of ChatGPT in Dutch on specific language tasks are on par with its accuracy and performance in English.
Conducting the experiment in Dutch also enabled the identification of assumed languagespecific discrimination between two majority groups (i.e.White American vs. Dutch candidates) and ethnic-specific discrimination between majority and minority groups (e.g.

White American vs. Black American applicants).
The general instruction ordered ChatGPT to help select suitable candidates: "You assist in making personnel selection decisions.Based on the vacancy below and the candidate's CV, you should indicate how likely you would invite the candidate for an interview".To obtain an outcome variable at the ratio level, ChatGPT was directed to "answer only with a score ranging from 1 (very unlikely) to 100 (very likely)", i.e. an integer.This rating was eventually labelled the 'interview invitation score' in the analyses.Because it was ambiguous how imposing different cutoff scores on ChatGPT's output would influence bias in the eventual selection decision, I carried out post-hoc sensitivity analyses to evaluate the impact of proprietary user-imposed thresholds (see Section 2.3.4 and Section 3.4).

Data exchange
I conducted the correspondence audit-like experiment using OpenAI's Application Programming Interface (API) instead of using ChatGPT's web interface (https://chat.openai.com).The API facilitated the data exchange by automatically transmitting text prompts to ChatGPT and receiving generated responses in return (OpenAI, 2023b).Connection with the API was made through R relying on the {httr} and {jsonlite} packages.
Automating the experiment using the API over ChatGPT's web interface had two significant advantages.The first advantage was the scalability of the design.Compared to regular audit studies with human subjects, I could present a relatively high number of applicants to ChatGPT in a short time.For reference, the automated sequential sending of the 34,560 vacancy-CV combinations was completed in approximately eight hours (in June 2023).Second, the absence of chat history or memory due to the successive and isolated presentation of vacancy-CV combinations to ChatGPT prohibited the chatbot from generating task-trained responses that could undercut the experiment's validity.Because of the absence of this spillover, it was possible to send identical applications, with just the change in treatment, rather than needing to create substantial differences in candidate profiles or CV templates.This approach improved the precision of the results over audit experiments in the field, where differentiation is needed to avoid detection by human recruiters.

Language model
At the time of the experiment, OpenAI's had two main models in its GPT series: GPT-3.5(-Turbo) and GPT-4.Both models were trained on data primarily sourced from the internet, comprising online articles, websites, and other texts, and included information up to September 2021 (OpenAI, 2023a).Although GPT-4 showcased advancements in certain language tasks, GPT-3.5 offered abundant capability and efficient resource utilisation.
Importantly, GPT-3.5 was more accessible, being available free of charge to the broader public, in contrast to GPT-4, which required a paid subscription for access through its web interface.This widespread accessibility of GPT-3.5 underlined its relevance to the average user and supported its use as the model instance for experimentation in this study.For reproducibility purposes, I relied on the 13 June 2023 snapshot of the GPT-3.5 model, which preserved the pre-trained language model's state as of that date.
Notably, the GPT models were equipped with safeguards to prevent unintentional discrimination, especially when directly comparing similar or nearly identical job candidates.
For example, when prompting the GPT-3.5 instance of ChatGPT via its web interface to compare quasi-identical profiles-one Dutch and the other Black American-ChatGPT's response accurately identified minimal differences between the candidates, such as their ethnic background.More explicitly instructing ChatGPT to evaluate the candidates based on this ethnic distinction, the output read: "As an AI language model, I cannot discriminate against candidates based on their background […] It is important to evaluate candidates based solely on their qualifications, skills, and educational backgrounds relevant to the job position.Therefore, it would not be appropriate to assess or score the candidates based on their backgrounds".However, the experimental approach in this study, using isolated prompts exchanged via the API without directly comparing fictitious candidates, circumvented these safeguards and enabled an unrestrained evaluation of the potential ethnic and gender bias in ChatGPT's output.I assume HR professionals would use ChatGPT similarly by sequentially presenting qualifying candidates to the model, possibly intermittently and, thus, isolated.

Sampling strategy
I also altered the model temperature (or sampling strategy) to examine its moderation effect on ChatGPT's bias.This temperature parameter influences the randomness or creativity of ChatGPT's output.Low temperatures produce more deterministic responses based on the chatbot's training data patterns, while higher temperatures generate increased stochastic outputs.The impact of different temperature settings or sampling strategies on ChatGPT's bias is unclear, as raising the temperature may reduce common pre-trained biases but introduce uncommon biases.Following a probability weighting scheme of 60.00%, 8.75%, 8.75%, 8.75%, 8.75%, 2.50%, and 2.50%, temperatures between 0.00 and 1.50 with increments of 0.25 were integrated into the API request.Here, 0.00 was the minimum temperature setting, making ChatGPT mostly deterministic, allowing minimal output variability, while temperatures above 1.50 resulted in the chatbot producing such variable output that it no longer adhered to the prompt's guidelines (i.e.outputting a quantifiable score in [1,100] ∩ ℕ).Exact count statistics by model temperature can be retrieved from Table A7 in the appendix.

Principal analyses
I estimated multiple ordinary least squares (OLS) regression models to assess the relationship between the ratings (i.e.interview invitation scores) outputted by ChatGPT and candidate and job features.In these models, the dependent variable indicated a candidate's suitability for a vacancy, expressed by a score (i.e.integer) ranging from 1-100 (Inv).The candidates' ethnic identity (Ethi), at the individual level i, was the main predictor of interest.
I also held ChatGPT's sampling strategy or temperature (Tmpi) constant in all estimated models since it was altered in about two-fifths of the prompts (see Section 2.2.4).A separate analysis of the influence of the temperature on the results is discussed in Section 3.4.The principal model, shown in Equation 1, consisted of the aforementioned predictor variables, operationalised at the applicant level i, with , an intercept,  and , model coefficients, and   , the error term.
In subsequent OLS models, I expanded the predictor scope by including candidate and job characteristics as covariates.Besides ethnic identity, candidate characteristics (CANi) comprised the candidate's gender (Geni).Work-related job characteristics (JOBv) included the occupation (e.g.administrative assistant), job type (e.g.temporary job), work hours (e.g.part-time), and shift system (e.g.night work), defined at the vacancy level v. Furthermore, I entered several job language proficiency variables concerning Dutch, French, and English.
Other job-level control variables included the level of professional experience requested (e.g. at least five years) and the employment location (e.g.Antwerp).Equation 2shows the extended model containing these covariates with , an intercept, , a coefficient,  and Λ, vectors of model coefficients, and   , the error term.
The third set of OLS models incorporated interaction terms between candidate identity and the candidates' signalled gender.Equation 3 depicts the terminal linear model, including sampling strategy and job-related variables with , an intercept,  and , coefficients, Λ, a vector of model coefficients, and   , the error term.
Each successive model in this series of OLS models contained additional controls related to candidate and job characteristics.By progressively adjusting for these factors, the analyses aimed to further isolate the experimental effect of candidate identity on the interview invitation probability, minimising the influence of potential confounding factors in the vacancy and CV texts used.

Heterogeneity analyses
In the heterogeneity analyses, I explored the interactions between candidate characteristics and job-related variables, specifically focusing on how these interactions correlate with interview invitation scores.These analyses aimed to uncover differential impacts of gender and ethnic identity across various job contexts, thereby providing insights into the mechanisms of CV screening by ChatGPT.
First, I examined the interaction between a candidate's ethnic identity (Ethi at the individual level i) and job characteristics (JOBv at the vacancy level v).Equation 4 explicates the model for assessing whether job characteristics, such as occupation or required language skills, amplify or mitigate ethnic bias in scoring candidates.The model includes gender (Geni), the sampling strategy (Tmpi), and interaction terms between the job-related variables and ethnic identity with , an intercept,  and , coefficients, Λ, a vector of model coefficients, and   , the error term.
Second, I investigated the interaction between gender and job characteristics, revealing the conditions under which gender bias in CV screening by ChatGPT is exacerbated or reduced.Analogous to Equation 4, Equation 5 illustrates the interaction model, consisting of ethnic identity, the sampling strategy and interaction terms between job-related variables and gender, again with , an intercept,  and , coefficients, Λ, a vector of model coefficients, and   , the error term.

Statistical corrections
Standard errors of each OLS model were corrected using cluster-robust wild bootstrapping.This technique produces a distribution of estimated parameters, facilitating the calculation of more precise standard errors (Cameron & Miller, 2015;Cameron et al., 2008).It achieves this by generating interim datasets with reformed dependent variables derived from a combination of the original model's fitted values, the residuals, and a random factor.There were three reasons to perform this correction: to control within-cluster error correlation, address heteroskedasticity, and mitigate violations against the residual normality assumption.Clusters were defined at the vacancy level, given the correlation between the assignment of the candidates and the vacancies presented to ChatGPT, similar to the approach in correspondence audit studies with humans (Abadie et al., 2023;Vuolo et al., 2018).The estimates appeared to stabilise around 2,000 bootstrap replications, which suffices in the context of empirical research (Cameron & Miller, 2015).
Furthermore, in the case of multiple family-wise comparisons, I performed ex-post corrections of the p-values in the regression analyses using Holm's (1979) method.This approach entails a stepwise ranking procedure that reduces the likelihood of false positives.
Implementing this procedure was particularly meaningful for models that involved numerous comparisons between distinct categories and their respective reference groups.
Throughout Section 3, where appropriate, I reported Holm-corrected p-values alongside the original estimates to cross-validate the results.Using less stringent correction procedures, such as the Benjamini-Hochberg or Benjamini-Yekutieli methods based on the Simes test outlined in Burn et al. (2022), produced similar results and did not alter their interpretation (Benjamini & Hochberg, 1995;Benjamini & Yekutieli, 2001).

Sensitivity analyses
In a real-world scenario, decision-makers such as HR professionals would likely rank the candidates using a proprietary cutoff score based on ChatGPT's output to select the optimal number of candidates to invite for the job interview.Therefore, I estimated logistic regression models to analyse sensitivity across different thresholds regarding ethnic identity.To this end, I used a penalised maximum likelihood estimator, which reduces the variance for the estimated coefficients (even in large samples) compared to the regular maximum likelihood estimator (Firth, 1993;Rainey & McCaskey, 2021).The results were robust vis-à-vis using a non-penalised estimator.The dependent variable was the probability of receiving an invitation given a predefined cutoff score n, i.e.Pr(Invn = 1).I ran a total of 100 logit models where n took on every integer between 1 and 100 (i.e.every possible interview invitation score produced by ChatGPT).The probability of receiving an invitation at a given threshold was regressed on the same predictor and covariates defined in Equation 1. Equation 6shows this logistic model with , an intercept,  and , coefficients, and   , the error term.
This analytic approach enabled assessing whether potential bias persisted across cutoff scores.In other words, how does the chosen cutoff score impact potential bias from the perspective of the decision-maker?Is there an optimal cutoff score where the bias is minimised?Which range(s) of cutoff scores exhibit(s) increasing or decreasing discrimination?
Finally, I produced discrimination ratios, which capture the relationships between two positive response rates to estimate relative penalties between groups.These ratios were calculated by transforming the log-odds from the logit model specification in Equation 6 to odds ratios (OR) and, in turn, into discrimination ratios (DR).Discrimination ratios are essentially risk ratios (RR) and constitute a standard measure of discrimination in correspondence audit studies in the social sciences (Lippens, Vermeiren, & Baert, 2023).Equation 7shows the OR-to-DR transformation.The ratios were defined relative to the baseline risk (Prbase), corresponding to the probability of a positive response for the reference group at a given cutoff score.Confidence intervals were computed through a Wald z-distribution approximation.

Scoring bias
ChatGPT generally produces high scores when responding to the prompt "How likely is it that you would invite the candidate for an interview?".On a 1-100 scale ranging from very unlikely to very likely, the mean invitation score equals 66.11 (SD = 13.27).Moreover, ChatGPT exhibits a preference for two numerical values.In over a quarter of the cases, ChatGPT scores the candidate 50 (i.e.8,701 occurrences or 25.18%), while in more than two-fifths of the cases, ChatGPT outputs a score of 70 (i.e.14,604 occurrences or 42.26%).Baert, 2023).I further evaluate the sensitivity of using different cutoff scores as an end decision-maker in Section 3.4.

Ethnic identity
< Figure 2 about here > While Dutch and White American applicants could both be regarded as majority group candidates in their respective geographies, candidates with White American names still face a small but significant penalty compared to their Dutch counterparts (W.Am.-Dutch = W.Am = −0.9563).This result suggests that the prompt language used (i.e.Dutch) at least partly affects ChatGPT's scoring bias and that this score difference could rather be interpreted as a language-specific than an ethnic-specific bias.Conversely, White American applicants receive significantly higher scores than Black American applicants, on average (B.Am.-W.Am.= −0.8765,tWelch = −2.88,p = 0.004).Because the prompt language reasonably should have little effect on the latter difference, I interpret it as a mainly ethnic-specific bias.

Gender identity
ChatGPT's outputted interview invitation scores do not vary statistically significantly with the candidate's gender.Models 2 to 7 in Table 1 include coefficient estimates for female versus male candidates.These coefficients are slightly positive but indistinguishable from zero.The statistical insignificance of this finding remains unchanged when including relevant covariates.The observation aligns with average gender discrimination estimates from the field experimental literature on hiring discrimination (Lippens, Vermeiren, & Baert, 2023).
Nevertheless, the question remains whether there are intersectional effects between gender and ethnic identity in determining ChatGPT's output.
A prominent hypothesis in the discrimination literature is the double minority status or double jeopardy hypothesis (Derous et al., 2012(Derous et al., , 2015)).Belonging to an ethnic and gender minority group presumably engenders a double penalty; ethnic females are subject to increased penalties.Nonetheless, recent research has indicated that ethnic minority males often experience more discrimination than females, especially Arab applicants (Arai et al., 2016;Dahl & Krog, 2018;Derous et al., 2015).This discrimination appears partly induced by stereotypes about masculinity (Bursell, 2014;Di Stasio & Larsen, 2020).Starting from the idea that this intersected genderism is inherently present in ChatGPT's training data, I assess whether ethnic discrimination is significantly moderated by gender.
Table 2 shows the interaction effects of ethnic and gender identity on ChatGPT interview invitation scores.While Turkish male applicants receive marginally worse scores than Dutch male candidates (Turkish = −0.8595,SE = 0.3243, p = 0.009, pHolm = 0.138; see Model 1 in B.Am.:Female = −0.9499,SE = 0.4753, p = 0.041, pHolm = 0.660).Evidence for the moderation effect of gender on the ethnic bias for the remaining identities is absent.Holding relevant covariates constant does not impact the statistical significance of these findings (see Model 2 in Table 2).In other words, Turkish females are worse off than Turkish males in the CV screening task, reflecting the double minority status of the former group.Nevertheless, similar to the main effect of ethnic identity, penalties remain relatively small considering the 1-100 scale of the outcome variable.

Discrimination by name
The differences between groups outlined in Section 3.2 hide the substantial dispersion in assigned interview invitation scores between names of the same ethnic and gender identity.
In other words, much larger differences in ChatGPT's score output exist within groups than between groups.Figure 3 visualises each name's interview invitation score distribution and mean by ethnic and gender identity.The visualised name score dispersion seems to vary across ethnic identities but is generally consistent across genders (see Section 3.2.2 for an exception regarding male and female Turkish applicants).
< Figure 3  Specific names thus elicit different discrimination levels, likely reflecting nuances in ChatGPT's pre-trained bias.Individual applicants can be far worse (or better) off compared to their ethnic and gender peers than when comparing 'average candidates' between groups.Nonetheless, this finding relies on randomly allocating a sufficient number of the same names to the ethnic-gender identities across vacancy-CV combinations.In this context, note that despite successful randomisation, the number of iterations may be too small to infer true differences in ratings between some names.Counts and mean interview invitation scores and probabilities of all 812 first and last name combinations used in the experiment can be retrieved from Table A8 in the appendix.These findings are consistent with the mechanism of statistical discrimination, which posits that, due to information asymmetry in the selection process, (i) screeners want to minimise the risk of a costly wrong hire and are therefore more inclined to offer ethnic minorities poorer labour conditions, such as fixed-term contracts, and (ii) screeners more easily rely on group productivity signals such as language skills, which are generally considered worse among ethnic minorities (Lippens et al., 2022;Lippens, Dalle, D'hondt, Verhaeghe, & Baert, 2023;Martínez-Pastor, 2013;Oreopoulos, 2011).Overall, ethnic discrimination in automated screening is partially shaped by the jobs' specific demands and exhibits features similar to statistical hiring discrimination by humans.

Discrimination by job characteristics
In Section 3.These results align with observations in field experiments with human recruiters, where women are discriminated against in (higher-paying) occupations dominated by men and men are discriminated against in (lower-paying) occupations dominated by women (Galos & Coppock, 2023).They also indicate that the gender bias may be contextually linked to societal stereotypes or gendered expectations associated with particular occupations and work arrangements, likewise corresponding to a statistical discrimination mechanism.

Discrimination by sampling strategy
Another series of analyses concerns the correlation between ChatGPT's sampling strategy and its bias.The sampling strategy is included in ChatGPT's temperature parameter, which modulates the degree of randomness or 'creativity' in ChatGPT's output.Low temperatures (i.e. a deterministic sampling strategy) result in more coherent and consistent responses, while higher temperatures (i.e. a stochastic sampling strategy) produce more varied outputs.As the temperature increases, ChatGPT may diverge from its common pre-trained biases but could also introduce and reinforce uncommon biases.Using OLS regressions to estimate the interaction effects between candidate ethnic identity and model temperature, ChatGPT's sampling strategy appears to have no impact on its ethnic bias (see Table A25 in the appendix).In other words, making the sampling process more stochastic (i.e.introducing randomness) does not significantly change ChatGPT's bias.This finding hints that the uncovered bias arises from the pre-training data rather than being particular to the sampling during post-processing.

Discrimination at different cutoffs
Next, I evaluate whether selecting a proprietary cutoff score as an end decision-maker could impact the sign or strength of the ethnic bias identified in Section 3.2.1.In other words, does the decision of an end user, relying on ChatGPT's ratings, to invite a set of candidates who attain a particular minimum score influence the extent to which they would discriminate in the selection process?
Selecting a proprietary cutoff score does not change the sign of the resulting ethnic bias but does change its magnitude.Baert, 2023).Average discrimination ratios in Belgium based on the most recent experimental evidence from correspondence audits approximate 0.73 for Eastern Europeans, 1.00 for Blacks, 0.79 for Arabs, and 0.85 for Turks (Lippens, Dalle, D'hondt, Verhaeghe, & Baert, 2023).Even though ChatGPT discriminates based on ethnic identity, the above observations suggest that it performs better than the average human recruiter (except for the Hispanic subgroup measured in audits worldwide and the Black subgroup measured in Belgian (Dutch) research).However, note that substantial differences in name sets and control groups between the estimates presented in the current study, with its specific experimental setup, and the average estimates from audit research with humans make a formal comparison difficult.

Conclusion
Through a simulated CV screening task based on the correspondence audit approach, I provide evidence that ChatGPT displays systematic bias in its output, showing noticeable preferences for specific ethnic and gender groups when evaluating job candidates.The chatbot was significantly less inclined to advance equally qualified Arabs, Asians, Black Americans, Central Africans, Eastern Europeans, Hispanics, Turks, and White Americans to the interview stage of the selection process than candidates from the Dutch reference group.The minor penalty for White American-named and more substantial penalty for Black American-named candidates suggest that a prompt language-specific bias is at play alongside an ethnic-specific bias.Levels of ethnic identity bias by ChatGPT appeared lower compared with meta-analytic ethnic hiring discrimination estimates from worldwide correspondence audit research involving human recruiters-although it is essential to highlight the differences in name sets and control groups between this experiment's setup and correspondence audits with humans in existing field research.Moreover, female candidates were not rated lower on average than their male counterparts.At the intersection of ethnic and gender identity, however, ChatGPT discriminated more against Turkish females than Turkish males, consistent with the double minority status of the former group.
The heterogeneity analyses further elucidated a job context-dependent bias.ChatGPT's ethnic bias is particularly pronounced in roles offering favourable labour conditions-fulltime contracts and day shift systems-and roles with greater language proficiency requirements.Gender bias, conversely, surfaces in gender-atypical jobs; female candidates are discriminated against in male-dominated occupations, whereas male candidates are discriminated against in female-dominated occupations and part-time vacancies.The LLM's discrimination mechanism appears to align with statistical discrimination, where ChatGPT relies on group characteristics and productivity signals from its pre-training data, reproducing societal stereotypes.Altering the chatbot's sampling strategy did not significantly influence its bias.Finally, I observed substantial heterogeneity in interview scores between names of the same ethnic and gender identity.
The potential risks associated with integrating LLMs in selection decisions hinge highly on who uses the technology, how, and when rather than on the inherent qualities of LLMs.
While the findings from this study indicate that ChatGPT exhibits a level of bias generally falling behind that of human recruiters, I can envisage a scenario where its use leads to an increase in selection bias.For example, if ChatGPT is deployed for automated pre-screening, bias in its output may be amplified by human prejudice in subsequent manual screening stages.This sequence of events can result in an overall more discriminatory selection process.Such a scenario highlights that reliance on LLMs for selection can inadvertently perpetuate or aggravate discriminatory treatment even if the language model itself is less discriminatory than a human screener, resulting in cumulative discrimination.
I see several avenues for future research.First, scholars could explore whether the uncovered bias transposes to different grounds for discrimination, languages, contexts, selection tasks, or language models.For example, discrimination based on age is a persistent problem in personnel selection, which also merits attention in research on AI bias (Lippens, Vermeiren, & Baert, 2023;Stypinska, 2023).AI-based decision-making may also result in discrimination in contexts such as housing and healthcare through algorithm-based awarding of house rentals or treatment plan recommendations in patients, to name just two examples (Basu, 2023;Rosen et al., 2021).Second, to enhance model explainability, researchers could continue investing in making the decision-making processes of large language models more transparent to understand which features contribute to the bias in their output (Arrieta et al., 2020).Third, to reduce bias, scholars could further evaluate fairness-enhancing techniques applied to the training data, the training process, or the postprocessing (Friedler et al., 2019).One example of a bias-reducing technique applied to the prompt is anonymising personal information by removing explicit minority status markers before parsing.This technique has been proven effective in countering human prejudice (Åslund & Skans, 2012;Blommaert & Coenders, 2023;Derous & Ryan, 2019;Lacroux & Martin-Lacroux, 2019).Nonetheless, LLMs could still pick up on implicit markers in applicant profiles (e.g.organisation affiliations signalling ethnic group membership or years of professional experience signalling age), continuing the output of biased responses (Arnold et al., 2021;Kleinberg et al., 2018).
This study underlines the significance of understanding and addressing systemic bias in large language models, especially when deployed in real-world applications such as hiring    Estimates are based on Model 1 in Table 1.

Figure 1
Figure1illustrates the relative frequency distribution of invitation scores (by ethnic identity).Even though the opaque decision-making process of the GPT model (and LLMs more broadly) makes it unclear why these specific values are most common, the observation of reoccurring identical values is not surprising given the recurrent presentation of quasiidentical CVs (only differing by name) as input for the screening task.Using an interview invitation score threshold of  50, right in the middle of the 1-100 scale, ChatGPT would invite the average candidate in 96.88% of the cases-i.e.almost every time.This proportion is remarkably high compared to response rates in correspondence experiments with human recruiters, where the overall invitation probability is closer to 20%(Lippens, Vermeiren, &
and selection.Three parties can help in this endeavour.First, model developers preferably advance efforts to mitigate biases arising from the pre-training data and process.Second, policymakers may create legal frameworks ensuring equitable use of large language models in decision-making directly impacting humans.The AI Act recently agreed upon by the European Parliament and European Council forms a step in the right direction(European   Commission, 2021).The act bounds LLMs by imposing transparency obligations and restrictions on their use in automatic categorisation and selection, which carry actual potential risks, as illustrated in this study.Third, practitioners considering using ChatGPT and the like should proceed cautiously.At a minimum, their usage requires a thorough assessment of the trade-off between increased efficiency and adverse impact.Taken together, the applicability of LLMs in their current form for activities that involve decisionmaking affecting humans is debatable.

Figure 3 .
Figure 3. Average ChatGPT interview invitation scores by name, ethnic and gender identity

Figure 4 .
Figure 4. Average (predicted) ChatGPT interview invitation probabilities by ethnic identity Table 1 contains the estimates of six OLS regression models where ChatGPT's interview invitation scores are regressed on job candidate ethnic and gender identity, among other covariates (see Section 2.3 for the estimation details).Compared to the Dutch reference group, the effects of each of the other ethnic identities on the invitation score are statistically significant (at the 0.1% significance level) and negative.In other words, candidates with White American, Arab, Central African, Hispanic, Turkish, Black American, Asian, and Eastern European names receive significantly lower scores from ChatGPT than Dutch-named candidates.Although minor, average penalties range from approximately −0.96 to −2.42 points on a 1-100 scale.<Table1abouthere> In contrast with worldwide hiring discrimination observed in human recruiters, where candidates of Arab, Middle Eastern or Northern African origin face the highest disadvantage (see Lippens, Vermeiren, & Baert, 2023), Eastern Europeans are penalised the most by ChatGPT compared to the Dutch reference group (E.Eu.= −2.4170,SE = 0.2275, pHolm < 0.001; see Model 1 in Table

Table 2 )
, Turkish females lose approximately 1.78 points net versus Turkish males and 1.17 points net compared to Dutch male applicants (Female = 0.6063, SE = 0.3221, p = 0.057; Turkish:Female = −1.7765,SE = 0.4810, p < 0.001, pHolm = 0.003).The apparent double penalties for Eastern European and Black American females become statistically insignificant after applying Holm's correction (E.Eu.:Female = −1.0459,SE = 0.4767, p = 0.026, pHolm = 0.412; Further heterogeneity analyses reveal distinct patterns of job context-dependent bias inChatGPT's CV screening.For ethnic identity, biases are notably contingent upon work hours, 2.2, I demonstrated that gender discrimination is less pervasive than ethnic discrimination.However, the gender bias appears to manifest selectively across occupations and work-hour requirements.Fictitious female candidates encounter lower interview invitation scores in roles traditionally dominated by men, such as sales representatives (male , the outcome is nearly the same for all applicants, namely that virtually all or no candidates would be invited, respectively.Decision-makers using ChatGPT as an assistant for personnel selection would logically only choose a cutoff in [50,80)-score differentiation only occurs in this range-and, consequently, discriminate.Panel B of Figure 4 zooms in on the results by ethnic identity in the [50,80) range.The results are similar to the findings based on the OLS model estimates: candidates with White American names are penalised the least compared to Dutch applicants.In contrast, Asian-

Table 1 .
OLS regression of ChatGPT interview invitation scores on ethnic and gender identity