Bipol: A Novel Multi-Axes Bias Evaluation Metric with Explainability for NLP

We introduce bipol, a new metric with explainability, for estimating social bias in text data. Harmful bias is prevalent in many online sources of data that are used for training machine learning (ML) models. In a step to address this challenge we create a novel metric that involves a two-step process: corpus-level evaluation based on model classification and sentence-level evaluation based on (sensitive) term frequency (TF). After creating new models to detect bias along multiple axes using SotA architectures, we evaluate two popular NLP datasets (COPA and SQUAD). As additional contribution, we created a large dataset (with almost 2 million labelled samples) for training models in bias detection and make it publicly available. We also make public our codes.


Introduction
Bias can be a difficult subject to tackle, especially as there are different opinions as to the scope of its definition (Hammersley and Gomm, 1997;Dhamala et al., 2021). The origin of the word means a slant or slope. 2 In this work, we define social bias as the unbalanced disposition (or prejudice) in favor of or against a thing, person or group, relative to another, in a way that is deemed as unfair (Maddox, 2004;Adewumi et al., 2019;Antoniak and Mimno, 2021). 3 This is harmful bias and it is related to fairness. In some quarters, bias also involves overgeneralization (Brigham, 1971;Rudinger et al., 2018;Nadeem et al., 2021), fulfilling characteristic 2 in the next paragraph.
As a motivation, we address the challenge of estimating bias in text data from some of the many axes (or dimensions) of bias (e.g. race and gender). Social bias in text usually has some of the following characteristics: 3 1. It is heavily one-sided (Zhao et al., 2018), as will be observed with the results in this work.
2. It uses extreme or inappropriate language (Rudinger et al., 2018). This forms the basis of the assumption (for some of the samples) in the two datasets used to create the new multi-axes bias dataset (MAB), as discussed in Section 3.
3. It is based on unsupported or unsubstantiated claims, such as stereotypes (Brigham, 1971).
4. It is entertainment-based or a form of parody or satire (Eliot, 2002).
ML models pick these biases from the data they are trained on. Although classification accuracy has been observed to fall with attempts at mitigating biases in data (Pleiss et al., 2017;Oneto et al., 2019;Cho et al., 2020;Speicher et al., 2018), it is important to estimate and mitigate them, nonetheless. This is because of the ethical implications and harm that may be involved for the disadvantaged group (Klare et al., 2012;Raji et al., 2020).
Our contributions We introduce a novel multiaxes bias estimation metric called bipol. Compared to other bias metrics, this is not limited in the number of bias axes it can evaluate and has explainability built in. It will provide researchers with deeper insight into how to mitigate bias in data. Our second contribution is the introduction of the new English MAB dataset. It is a large, labelled dataset that is aggregrated from two other sources. A third contribution is the multi-axes bias lexica we collected from public sources. We perform experiments using state-of-the-art (SotA) models to benchmark on the dataset. Furthermore, we use the trained models to evaluate the bias in two common NLP datasets (SQuADv2 (Rajpurkar et al., 2018) and (COPA (Roemmele et al., 2011)). We make our models, codes, dataset, and lexica publicly available.
The rest of this paper is structured as follows: Section 2 describes in detail the characteristics of the new metric. Section 3 gives details of the new MAB dataset. Section 4 explains the experimental setup. Section 5 presents the results and error analyses. Section 6 discusses some previous related work. In Section 7, we give concluding remarks.

Bipol
Bipol, represented by Equation 1a, involves a twostep mechanism: the corpus-level evaluation (Equation 1b) and the sentence-level evaluation (Equation 1c). It is a score between 0.0 (zero or undetected bias) and 1.0 (extreme bias). This is further described below: 1. In step 1, a bias-trained model is used to classify all the samples for being biased or unbiased. The ratio of the biased samples (i.e. predicted positives) to the total samples predicted makes up this evaluation. When the true labels are available, this step is represented by Equation 1b. The predicted positives is the sum of the true positives (tp) and false positives (fp). The total samples predicted is the sum of the true positives (tp), false positives (fp), true negatives (tn), and false negatives (fn).
A more accurate case of the equation will be to have only the tp evaluated (in the numerator), however, since we want comparable results to when bipol is used in the "wild" with any dataset, we choose the stated version in 1b and report the positive error rate. Hence, in an ideal case, an fp of zero is preferred. However, there's hardly a perfect classifier. It is also preferable to maximize tp to capture all the biased samples, if possible. False positives exist in similar classification systems (such as hate speech detection, spam detection, etc) but they are still used (Heron, 2009;Markines et al., 2009;Feng et al., 2018;Adewumi et al., 2022b). New classifiers may also be trained for this purpose without using ours, as long as the dataset used is large and representative enough to capture the many axes of biases, as much as possible. Hence, bipol's two-step mechanism may be seen as a framework.
2. In step 2, if a sample is positive for bias, it is evaluated token-wise along all possible bias axes, using all the lexica of sensitive terms. Table 1 provides the lexica sizes. The lexica are adapted from public sources 4 and may be expanded as the need arises, given that bias terms and attitudes are ever evolving (Haemmerlie and Montgomery, 1991;Antoniak and Mimno, 2021). They include terms that may be stereotypically associated with certain groups (Zhao et al., 2017(Zhao et al., , 2018 and names associated with specific gender (Nangia et al., 2020).
Examples of racial terms stereotypically associated with the white race (which may be nationality-specific) include charlie (i.e. the oppressor) and bule (i.e. albino in Indonesian) while darkey and bootlip are examples associated with the black race. Additional examples from the lexica are provided in the appendix. Each lexicon is a text file with the following naming convention: axes_type.txt, e.g. race_white.txt. In more detail, step 2 involves finding the absolute difference between the two maximum summed frequencies (as lower frequencies cancel out) in the types of an axis (| n s=1 a s − m s=1 c s |). This is divided by the summed frequencies of all the terms in that axis ( p s=1 d s ). This operation is then carried out for all axes and the average obtained ( 1 q q x=1 ). Then it is carried out for all the biased samples and the average ob- The use of the two-step process minimizes the possibility of wrongly calculating the metric on a span of text solely because it contains sensitive features. For example, given the sentences below 5 , the first one should be classified as biased by a model in the first step, ideally, because the sentence assumes a nurse should be female. The second step can then estimate the level of bias in that sentence, based on the lexica. In the second example, a good classifier should not classify this as biased since the coreference of Veronica and her are established, with the assumption that Veronica identifies as a female name. The second example becomes difficult to classify, even for humans, if Veronica was anonymised, say with a part-of-speech (PoS) tag. In the case of the third example, an advantage of bipol is that even if it is misclassifed as biased, the sentence-level evaluation will evaluate to zero because the difference between the maximum frequencies of the two types (his and her) is 1 -1 = 0. Bipol does not differentiate explicitly whether the bias is in favour of or against a targeted group. 1. A nurse should wear her mask as a prerequisite. 2. Veronica, a nurse, wears her mask as a pre-requisite. 3. A nurse should wear his or her mask as a pre-requisite.

Strengths of bipol
1. It is relatively simple to calculate.
2. It is based on existing tools (classifiers and lexica), so it is straight-forward to implement.
3. It is a two-step process that captures both semantic and term frequency (TF) aspects of text.
4. It is flexible, as it has no limits in the number of axes or TF that can be included.

Its explainability makes up for what is not ob-
vious from a single score. For example, the magnitude of the difference between term frequencies in an axis is not immediately obvious since (1−0)/1 = (1, 000−0)/1, 000 = 1. As an example, if he has a frequency of 1 while she has 0 in one instance, it is the same score of 1 if they have 1,000 and 0, respectively, in another instance. 5 These are mere examples. The datasets do not contain usernames Weakness of bipol 1. Although one of its strengths is that it is based on existing tools, this happens to also be a weakness, since the limitations of these tools also limit its accuracy.

Datasets
The new MAB dataset This English bias-detection dataset has a total size of 1,946,975 samples, as given in Table 2. This makes it one of the largest annotated datasets for bias detection, especially when compared to Bias in Open-Ended Language Generation Dataset (BOLD) with 23,679 samples (Dhamala et al., 2021) or HolisticBias with 459,758 samples (Smith et al., 2022). The large size of the dataset increases the chances of training a classifier to identify a broad range of biased cases. It is a combination of the Jigsaw 6 (of 1,902,194 samples) and the Social Bias Inference Corpus v2 (SBICv2) (of 147,139 samples) by Sap et al. (2020). Hence, it has 12 explicit bias axes (from the combination of both). In creating the data, we dropped duplicates since both datasets draw some content from a similar source. Examples in the MAB are given in Table 3.  In creating the MAB, given that the Jigsaw is a multipurpose dataset that assumes that bias correlates with toxicity, the target and toxicity columns in the training and test sets, respectively, with values greater than or equal to the bias threshold of 0.1 (on a scale from 0 to 1) are automatically annotated as biased while those below are automatically annotated as unbiased. The rationale for choosing the threshold of 0.1 (instead of, say, 0.5 as done by the authors of Jigsaw) is based on random inspection of several examples in the dataset and the fact that a little bias (0.1) is still bias. For example, the comment below, which we consider biased, has a target of 0.2. In addition, adopting a threshold higher than 0.1 will result in further imbalance in the dataset in favour of unbiased samples.  The SBICv2 dataset follows a similar assumption as the Jigsaw. This assumption is realistic and has been used in previous work in the literature (Nangia et al., 2020). We use the aggregrated version of the dataset and the same bias threshold for the offensiveYN column in the sets. In the Jigsaw, we retained the old IDs so that we can always trace back useful features to the original data source, but the SBICv2 did not use IDs. The MAB data statement is provided in the appendix (A.3). More details of the two base datasets are given in the following paragraphs.
Jigsaw The Jigsaw dataset 7 is a multipurpose dataset that came about as a result of annotations by the civil comments platform. It has the following axes: gender, sexual orientation, religion, race/ethnicity, disability, and mental illness. It contains 1,804,874 comments in the training set and 97,320 comments in the test set. A small ratio (0.0539) was taken from the training set as part of the validation set for the MAB because the Jigsaw has no validation set and we wanted a validation set that is representative of the test set in size. The average of scores given by all the annotators is calculated to get the final values for all the labels. The Jigsaw was annotated by a total of almost 9,000 human raters, with a range of three to ten raters per comment. It is under CC0 licence in the public domain.

Experiments & Methods
All the experiments were conducted on two shared Nvidia DGX-1 machines running Ubuntu 18 and 20 with 8 × 32GB V100 and 8 x 40GB A100 GPUs, respectively. Each experiment is conducted multiple times and the average results reported. Wandb (Biewald, 2020), the experiment tracking tool, runs for 16 counts with bayesian optimization to suggest the best hyper-parameter combination for the initial learning rate (1e-3 -2e-5) and epochs (6 -10), given the importance of hyper-parameters (Adewumi et al., 2022a). These are then used to train the final models (on the Jigsaw, SBICv2 and MAB), which are then used to evaluate their test sets, the context of the SQuADv2 validation set and the premise of the COPA training set. Figure  4 in Appendix A.1 shows the wandb exploration for DeBERTa on MAB in parallel coordinates. We use the pretrained base models of RoBERTa (Liu et al., 2019), DeBERTa (He et al., 2021) and Electra (Clark et al., 2020), from the HuggingFace hub (Wolf et al., 2020). Average training time ranges from 41 minutes to 3 days, depending on the data size. Average test set evaluation time ranges from 4.8 minutes to over 72.3 hours. 8

Results and Discussion
Across the results of the three models for the datasets in Table 4   regards to all the metrics. This trend can be observed in the explainability bar graphs (Figures 1,  2 & 3) of the top-10 frequent terms in the gender axis as captured in step 2 of bipol. We also observe from the test set results that RoBERTa appears to be the best classifier except with SBICv2, possibly because of the suggested hyper-parameters. MABtrained models are better than the Jigsaw-trained ones, though the Jigsaw shows the lowest bipol scores out of the 3 datasets for training, with MAB following closely. The bipol scores for SBICv2 show more than 100% increase over any of the other datasets -suggesting it contains much more bias relative to the dataset size. The two benchmark datasets (COPA and SQuADv2) also contain bias, though little, partly because the sets have very few unique samples. The models with the lowest positive error rates  are those trained on Social Bias Inference Corpus v2 (SBICv2), however, when choosing a suitable model for evaluating other datasets, it's important to prioritize the size and representative nature of the training data the model was trained on. This is the reason why we used the MAB-trained models to estimate for COPA and SQuADv2. The error rate provides a lower bound of error for other datasets while the size and representative nature of the training data determines the extent of generalisation of the model.
A snapshot of the explainability dictionary of lists of terms, which produced the chart in Figure  2, is given in Appendix A.2. From the bar charts, We observe that the MAB dataset has a strong male bias. In Figure 1, the top male term ('he') has a frequency of 6,589 while 'she' has only 1,593. This follows a similar observation with other datasets (Fuertes-Olivera, 2007) or OneNote 5.0, a resource for training co-reference systems, that entities with gendered pronouns are over 80% male (Zhao et al., 2018). Furthermore, when highly subjective terms like love, old, favorite, and significant that are associated with the female gender in the lexica are removed or put in both the male and female lexica, they cancel out from influencing bipol.
Qualitative results: Some qualitative examples of perceived correct predictions using the MABtrained DeBERTa model are given in Table 5. We chose the DeBERTa-MAB model for this analysis because it has the lowest bipol out of the three MAB-trained models since it presents the most challenging case to find successful examples. It appears the term cowgirl, in the first example in the table, is perceived as gendered, the same way chairman or spokesman is perceived as such (Romaine, 2001;Fuertes-Olivera, 2007  Some qualitative examples of correct predictions in SQuADv2 using the MAB-trained DeBERTa model are given in Table 6. It seems the third example (id: 989) in the table has many different viewpoints it's hard to find a justification for the model not to classify it as biased for at least one of them. Table 7 shows the prediction distribution for the models trained on MAB. Unbiased samples are more easily detected in the dataset because there are more of these in the training set. One way to improve the performance and the MAB dataset is to upsample the biased class. This may be done through counter-factual data augmentation (CDA) or sentence completion through generative models. Although bipol is designed to be data-agnostic, it is important to note that estimating bias on outof-domain (OOD) datasets may result in less stellar performances of the metric. This is because examples label csv id many locals and tourists frequent the southern california coast for its popular beaches and the desert city of palm springs is popular for its resort feel and nearby open spaces. unbiased 96 rugby is also a growing sport in southern california particularly at the high school level with increasing numbers of schools adding rugby as an official school sport.

Error Analysis
unbiased 126 moderate and reformist islamists who accept and work within the democratic process include parties like the tunisian ennahda movement. jamaateislami of pakistan is basically a sociopolitical and democratic vanguard party but has also gained political influence through military coup détat in past. the islamist groups like hezbollah in lebanon and hamas in palestine participate in democratic and political process as well as armed attacks seeking to abolish the state of israel. radical islamist organizations like alqaeda and the egyptian islamic jihad and groups such as the taliban entirely reject democracy often declaring as kuffar those muslims who support it see takfirism as well as calling for violentoffensive jihad or urging and conducting attacks on a religious basis.
biased 989 the trained MAB models are based on MAB's 12 explicit bias axes (7 axes from the Jigsaw and 5 additional axes from SBICv2). Some qualitative examples of perceived incorrect predictions in COPA using the MAB-trained DeBERTa model are given in Table 8. The second example (id: 71), particularly, is considered incorrect since the definite article "the" is used to identify the subject "terrorist".

Related Work
Previous studies on quantifying bias have used metrics such as odds ratio or vector word distance (Cryan et al., 2020  ticular gender (e.g. woman) rather than another. Meanwhile, vector word distance is used to measure bias by calculating the difference between the average distance of a word to a set of words belonging to different gender (Mikolov et al., 2013;Cryan et al., 2020). Dhamala et al. (2021) use sentiment to evaluate bias in religion. In the study by Cryan et al. (2020), they compare model classification against lexicon method for gender bias. Our approach combines the strengths of both approaches. There have been several methods involving lexicon usage, as observed by Antoniak and Mimno (2021), and they are usually constructed through crowd-sourcing, hand-selection, or drawn from prior work. Sengupta et al. (2021) introduced a library for measuring gender bias. It is based on word co-occurrence statistical methods. Zhao et al. (2018) introduced WinoBias, which is focused on only gender bias for coreference resolution, similarly to Winogender by Rudinger et al. (2018). On the other hand, bipol is designed to be multi-axes and dataset-agnostic, to the extent the trained classifier and lexica allow. Besides, in both Zhao et al. (2018) and Rudinger et al. (2018), they focus on the English language and binary gender bias only (with some cases for neutral in Winogender). Both admit their approaches may demonstrate the presence of gender bias in a system, but not prove its absence. CrowS-Pairs, by Nangia et al. (2020), is a dataset of 1,508 pairs of more and less stereotypical examples that cover stereotypes in 9 axes of bias, which are presented to language models (LM) to determine their bias. It is similar to StereoSet, (for associative contexts), which measures 4 axes of social bias in LM (Nadeem et al., 2021). Table 9 below compares some of the metrics and bipol.

Conclusion
We introduce bipol and the MAB dataset. We also demonstrate the explainability of bipol. We believe the metric will help researchers to estimate bias in datasets in a more robust way in order to address Metric/Evaluator Axis Lexicon Terms/Sentences WinoBias (Zhao et al., 2018) 1 40 occupations Winogender (Rudinger et al., 2018) 1 60 occupations CrowS-Pairs Nangia et al. (2020) 9 3,016 StereoSet (Nadeem et al., 2021) 4 321 terms Bipol (ours) >2, 13*< >45, 466*< social bias in text. Future work may explore ways of minimising false positives in bias classifiers, address the data imbalance in the MAB training data, and how this work scales to other languages. A library with bipol may be produced to make it easy for users to deploy. Another issue is to have a system that can automatically determine if bias is in favour of or against a group.

Limitations
The models for estimating the biases in the datasets in step 1 are limited in scope, as they only cover certain number of axes (12). Therefore, a result of 0 on any dataset does not necessarily indicate a bias-free dataset. The MAB dataset was aggregated from the Jigsaw and SBICv2, which were annotated by humans who may have biases of their own, based on their cultural background or demographics. Hence, the final annotations may not be seen as absolute ground truth of social biases. Furthermore, satisfying multiple fairness criteria at the same time in ML models is known to be difficult (Speicher et al., 2018;Zafar et al., 2017), thus, bipol or these models, though designed to be robust, are not guaranteed to be completely bias-free. Effort was made to mask examples with offensive content in this paper. Lukas Biewald. 2020. Experiment tracking with weights and biases.

References
Software available from wandb.com.