Counterfactually Fair Automatic Speech Recognition

—Widely used automatic speech recognition (ASR) systems have been empirically demonstrated in various studies to be unfair, having higher error rates for some groups of users than others. One way to deﬁne fairness in ASR is to require that changing the demographic group afﬁliation of any individual (e.g., changing their gender, age, education or race) should not change the probability distribution across possible speech-to-text transcriptions. In the paradigm of counterfactual fairness, all variables independent of group afﬁliation (e.g., the text being read by the speaker) remain unchanged, while variables dependent on group afﬁliation (e.g., the speaker’s voice) are counterfactually modiﬁed. Hence, we approach the fairness of ASR by training the ASR to minimize change in its outcome probabilities despite a counterfactual change in the individual’s demographic attributes. Starting from the individualized counterfactual equal odds criterion, we provide relaxations to it and compare their performances for connectionist temporal classiﬁcation (CTC) based end-to-end ASR systems. We perform our experiments on the Corpus of Regional African American Language (CORAAL) and the Lib-riSpeech dataset to accommodate for differences due to gender, age, education, and race. We show that with counterfactual training, we can reduce average character error rates while achieving lower performance gap between demographic groups, and lower error standard deviation among individuals.


I. INTRODUCTION
M ACHINE learning techniques have recently been utilized in decision making for critical settings such as financial services and hiring for a job [1]- [3]. Evaluation of these models has shown that optimizing for commonly-used measures such as accuracy can sometimes be unfair in the sense that individuals belonging to certain groups based on their gender, age, race, etc. may be disadvantaged, e.g., having lower accuracy than a majority group. These concerns have led researchers to focus on fairness in machine learning.
There are several mutually exclusive definitions of the word "fair" [2]. Bickel, Hammel and O'Connell [4] studied a dataset demonstrating significantly lower rates of college admission for women than for men. They argued that the data showed no unfairness, because women were applying to different departments than men; when department was entered as a factor in the statistical model, the apparent unfairness vanished. Srivastava et al. [5], however, present data suggesting that definitions of fairness involving more than two interacting variables are irrelevant to public perception: the users of an algorithm rate its fairness based on the demographic parity of its outcomes.
Pearl [6] re-analyzed the data of Bickel et al. using a causal model, in which all of the variables of interest are explicitly represented. The framework of causal reasoning treats problems of fairness as open discussions about the causal links among variables [7]. For example, a job hiring decision might depend on physical height, habitual clothing, or choice of undergraduate major. All of these variables correlate with gender; any of them might be considered to be either a fair or unfair basis for discrimination, depending on the job and the observer. The framework of counterfactual fairness [8] requires the designer of an algorithm to explicitly state which variables are permitted to change the outcome probabilities, and which variables are protected attributes that are not allowed to change the outcomes. In the original proposal [8], every variable that is causally dependent on a protected attribute must then be regressed against the protected attribute in order to infer a latent residual explanatory variable, independent of the protected attribute, on the basis of which decisions may be made. Many recent models in computer vision [9], [10] and natural language processing [11], [12] train statistical models to infer the residual variables implicitly, based on training using counterfactual training data. In a counterfactual training dataset, the protected attributes of individuals are counterfactually modified, and all variables dependent on the protected attribute are subject to possible consequent dependent modifications. A machine learning algorithm forced to ignore such counterfactual modifications is thereby forced to implicitly learn the residual information contained in each such variable after removing its dependence on the protected attribute.
As an important machine learning application, automatic speech recognition (ASR) is also subject to fairness concerns. Various studies have shown concerns regarding the performance gap between male and female speakers [13] as well as black and white speakers [14]. Because of the power of modern speaker adaptation methods [15]- [17], the unfairness of ASR is usually cast as a problem of unfair training corpora, e.g., [18] describes the under-performance of ASR for female speakers as a natural consequence of the under-representation of women in ASR training corpora. Counterfactual fairness provides a complementary approach that can provide some benefit even when the dataset is biased by focusing on fair treatment of every individual. Counterfactual fairness requires that if the demographic group affiliation of any given speaker (e.g., gender, race, age, education) were changed, the accuracy of the ASR should not change.
In this work, we propose counterfactually fair automatic speech recognition. We train the ASR so that it generates equivalent output label distributions for counterfactual speakers whose voices have been resampled with different protected attributes, but are otherwise identical in every respect that is not causally dependent on the protected attribute. We can summarize the main contributions of this study as follows: • Differences in ASR error rate between groups, and accuracy standard deviation among individuals, are both reduced by the use of counterfactual training data. • Counterfactual training data is applied in two ways: (1) using data augmentation, i.e., training the ASR to accurately transcribe it, and (2) using counterfactual equalized odds [19], i.e., forcing the ASR to ignore the difference between factual and counterfactual data. Of these two, data augmentation is empirically demonstrated to reduce the average error rate of the ASR, but increase unfairness, while the method of counterfactual equalized odds reduces both the average error rate and the unfairness of the recognizer. • The method of counterfactual equalized odds is made applicable to connectionist temporal classification (CTC) [20] by the invention of three different sequence training criteria, of which only one is observed to be empirically successful. The rest of the article is organized as follows: Section II reviews CTC and fairness in machine learning. Section III derives our proposed counterfactually fair training criteria for ASR. Section IV presents the experimental settings, and Section V presents the results. Section VI discusses the scale and source of the error rate reductions. We conclude the paper with a summary in Section VII.

II. BACKGROUND
In the following sections, we will use the following notation: scalar values will be denoted by small-case normal font (x), vectors will be denoted by small-case bold letters (x), sequence of vectors (or matrices) will be denoted by small-case bold letters with bar (x), sets will be denoted by calligraphic symbols (X ). For random variables we will use a similar notation except that we will replace small-case letters with capitals (X, X,X, respectively).

A. End-to-end ASR and CTC Loss
In recent years, the availability of very large speech corpora and increased computation power led speech researchers to investigate end-to-end approaches for ASR. An end-to-end ASR is viewed as a transducer which maps from an acoustic vector sequencex = [x 1 , . . . , x T ], x t ∈ X m where m is the dimension of the vectors, to a sequence of characters y = [y 1 , . . . , y S ], y s ∈ Y, where we assume that S ≤ T . Although there are several paradigms for end-to-end ASR, including CTC-based models [21], RNN transducers [22], purely attention-based transducers [23], and joint models that combine CTC with encoder-decoder models [24], we will focus on CTC-based models in this study, which we will review next.
Neural network training using CTC loss [20], which was proposed for sequence-to-sequence labeling tasks, has become one of the major approaches for end-to-end ASR systems [21], [25]. Since S ≤ T , CTC decomposes the probability P (Y|X) into a sum over alignment paths between input and output. The alignment path is a length-T sequence π = [π 1 , . . . , π T ], π t ∈ Y ∪ {−}, where the blank character (−) is added to the label inventory in order to assist in the definition of a surjective mapping l : T → Y S from any alignment to its corresponding label sequence. Suppose that a neural network generates per-frame softmax outputs P (Π t = π t |X =x) for the input x, then The network that generates the softmax outputs is trained by minimizing the negative log-likelihood, denoted by L CTC , as As shown in [20], these probabilities can be computed efficiently using a forward-backward algorithm, in which all transitions satisfying the constraint l(π) = y have equal transition probability, and in which the observation probability is the softmax output of the neural network. The algorithm first starts by augmenting the original sequence with a special blank symbol (−) and produces a new augmented sequence y = [−, y 1 , −, y 2 , −, . . . , −, y S , −] = [y 1 , . . . , y 2S+1 ] of length 2S +1. Using capital letters to denote random variables and small letters to denote their instance values, CTC defines forward and backward probabilities as α t (z) = P (Z t = z, Y 1 = y 1 , . . . , Y z = y z |X =x) and β t (z) = P (Y z+1 = y z+1 , . . . , Y 2S+1 = y 2S+1 |Z t = z,X =x), where z t ∈ {1, . . . , 2S + 1} is an index into the sequence y . The total probability of the observed sequence y can then be written for any t as or specifically for the last time index T as where |y | = 2S + 1 denotes the length of y .

B. Fairness in Machine Learning
It has been observed that machine learning models are liable in making unfair decisions [26] due to various reasons including data bias, missing data from certain groups and algorithmic bias. To evaluate the unfairness in machine learning, several fairness criteria have been defined [27]. Since it is impossible to simultaneously satisfy all possible definitions of fairness [2], recent papers emphasize the need for algorithm designers to specify a causal graph relating the variables of interest, in a manner that permits open discussion of the presumed causal dependencies among them [28].
According to [29], equalized odds is defined as the condition in which a predictorŶ ∈ Y of an outcome Y ∈ Y is conditionally independent, given Y = y, of the protected attribute A ∈ A: Causal analyses of fairness require the algorithm designer to specify a protected attribute, A, with respect to which the outcomeŶ should be fair. The observed variables, X, are then divided into descendants (X d ) and non-descendants (X nd ) of A. Each descendant variable may then be hypothesized to be the deterministic result of interaction between the protected attribute and a latent variable, U , which is independent of A.
In [8], the notion of counterfactual fairness is introduced. A counterfactually fair algorithm is defined to be an algorithm whose outcome probability distribution is unchanged even if the value of the protected attribute is changed. A test for counterfactual fairness involves three steps: abduction, action, and prediction [30]. The abduction step computes the distribution P (U |X = x, A = a) of the latent variables. The action step then modifies the protected attribute, A ←ã. Finally, the prediction step computes the distribution of the observed descendants and the outcome variable, averaged over the abducted distribution of U ; we write this weighted average outcome distribution as The criterion of counterfactual fairness requires that the outcome distribution should be unchanged by the operations of intervention and prediction, i.e., Many recent papers use deep generative models, such as variational auto-encoders (VAE) or generative adversarial networks (GANs) [31], [32], to perform the abduction and action steps in a counterfactual fairness training paradigm. A VAE infers U from the variational posterior distribution, P (U = u(x, a)|A = a, X = x), then infers counterfactual observations x A←ã from the generative distribution P (X|U = u(x, a), A =ã), resulting in a one-term approximation of Eq. (6). A GAN infers U in order to maximize the joint distribution P (U = u(x, a), A = a, X = x), then infers x A←ã from the generative distribution P (X|U = u(x, a), A =ã).
In the case of ASR, we can consider the variables in the counterfactual framework as the following: A represents the protected attribute of the speaker (gender, age, education level, etc.). Outcome represents the output unit which could be character or phoneme. In our experiments, we used character models so our outcomes are characters. Predictor is the estimate of these outcomes, e.g., character whose probability distribution is given by the softmax output layer. The latent variable, U , is the set of all information about the speech signal that is independent of a speaker's protected attribute. Since we do not know this information, we estimate it implicitly. A voice conversion system modifies the speech signal to have a different value of the protected attribute (e.g., male↔female, old↔young, little education↔much education, etc.). The ASR never explicitly calculates the latent variable, but it assumes that the original and modified speech signals have the same value of the latent variable; Eq. (7) is therefore enforced by requiring that the original and modified speech signals should result in the same ASR outcomes.

C. Counterfactual Equalized Odds
In [19], counterfactual fairness is combined with equalized odds in order to introduce counterfactual equalized odds for all x ∈ X , y ∈ Y and a ∈ A: This equation implies that the counterfactual and factual data are required to have matching outcomes only if their resampled ground truth labels match. Enforcing Eq. (8) is simplified, in [19], by assuming that Y is binary. If Y is binary, then equating the factual and counterfactual posteriors is performed by simply equating the probability ofŶ = 1 (see Appendix A for details), e.g., by minimizing the squared logit difference ∆σ −1 between factual and counterfactual tokens: where φ(x, a) = P (Ŷ = 1|X = x, A = a) is the classifier output given observation x and protected attribute a, and σ −1 (φ) = ln(φ/(1 − φ)) is the logit function. Generated counterfactual training examples are also used to augment the training dataset by the use of a counterfactual data augmentation criterion: where J denotes binary cross-entropy, and 1 is the indicator function. The logit-pairing and counterfactual cross-entropy training criteria are balanced with the target cross-entropy in a multi-task training framework with task weights λ CF and λ CLM , thus the overall training metric is: III. COUNTERFACTUAL TRAINING FOR ASR This paper proposes counterfactually fair ASR. Counterfactual fairness is enforced using a data augmentation scheme based on [19], but extended to sequence data. Extension to sequence data requires us to make assumptions about independence or conditional independence between the protected attribute (A), the label sequence (Y), the time-aligned label sequence (Π), and the alignment sequence (Z). Fig. 1 shows  four different causal graphs, representing four possible sets of assumptions. The first graph (Fig. 1a) shows the assumptions made by the counterfactual feature generation step described in Section III-A: the speech spectrogram (X) is dependent on the talker's protected attribute (A), and on an independent latent variable (Ū). The remaining three graphs (Figs. 1b, 1c, and 1d) show the additional variables that are necessary to define the three counterfactually fair ASR algorithms proposed in Sections III-B, III-C, and III-D. In all three models, the latent variable depends on the text of the utterance (Y). Models differ in the relative importance of the time-aligned label characters (Π) or time-aligned label indices (Z), and in the assumed dependence or independence between the label sequence and the protected attribute.

A. Counterfactual Feature Generation
In order to train an end-to-end ASR model using counterfactual speech data, we need the counterfactual counterparts of each utterance in our dataset. Such data only exist in a hypothetical world. Hence, we need to generate utterances as if the speaker were from a different gender, age, race or education, while keeping the spoken content the same. In our experiments, we use an adversarial auto-encoder model to generate counterfactual observationsx A←ã from the distribution P (X|A =ã,Ū =ū(x, a)), whereū(x, a) = arg max P (Ū,X = x, A = a). The schematic graphical model for counterfactual data generation is given in Fig. 1a. Details of the auto-encoder architecture are provided in Section IV.

B. Counterfactual CTC Matching
In [19], the label variable, Y , was binary, therefore the label distributions resulting from factual and counterfactual data could be matched by simply matching their logits. The closest analogue in ASR, perhaps, is the CTC loss itself, the metric L CT C shown in Eq. (2). Whereas the logit of a binary classifier is ln P (Ŷ = 1) − ln P (Ŷ = 0), the CTC loss is the negative log probability of the known correct answer, L CT C = − ln P (Ŷ = y). Using a similar reasoning as Eq. (8), we provide a relaxed version of equalized counterfactual odds which we call counterfactual CTC matching for ASR: Eq. (12) can be interpreted as only requiring similarity between the probabilities of correct outcomes (predicted outcome matches ground truth) given factual and counterfactual individuals. The left-hand side of Eq. (12) is the original CTC loss, L CT C (x, y). The right-hand side is the CTC loss L CT C (x A←ã , y) computed using a counterfactual speech signal generated from the distribution P (X|A =ã,Ū = u(x, a)). Eq. (12) therefore instantiates the causal graph shown in Fig. 1b: the transcription should depend only on the information that is shared in commmon betweenx and x A←ã , which is the latent variableŪ =ū(x, a) that the voice conversion system used to computex A←ã fromx. If, in Eq. (11), we replace L CF by the CTC loss of counterfactual data, and if we replace the logit pairing term with the difference of the CTC losses between the factual individual, x, and the counterfactual individual,x A←ã , we arrive at the loss function for counterfactual CTC matching: where λ CF and λ CCM are hyper-parameters denoting the importance of counterfactual data augmentation and counterfactual CTC loss matching, respectively.

C. Counterfactual Log Probability Matching
Intuitively, the latent variable, U , should contain some information about the time alignment of labels to the speech spectrogram. It is possible to force such an alignment by representing the alignment variable, Π = [Π 1 , . . . , Π T ], explicitly in the causal graph, as shown in Fig. 1c. This graph depicts the assumption that the label sequence Y, alignment sequence Π, and latent variableŪ should all be independent of A. The assumption can be enforced by requiring the ASR to learn a latent representation such that P (Π|Ū, A) = P (Π|Ū), i.e., Π is conditionally independent of A givenŪ.
Define φ kt (x) = P (Π t = k|X =x), the k th output of the softmax layer at time t. The requirement that Π is conditionally independent of A givenŪ can be enforced by requiring that ∆σ −1 kt = 0 for all k and t, where Extending Eq. (11) to include all logits, at all frames, we obtain: The neural network in ASR has a multi-class output layer, hence it is convenient to use the log probability rather than the binary logit function: replace σ −1 (φ kt (x)) by log φ kt (x). Logits and log-probability give the same result in the limit of zero training corpus error (Appendix A).

D. Counterfactual Log Posterior Matching
In [19], the label, Y, is assumed to depend on the protected attribute. A comparable assumption in the case of ASR is shown in Fig. 1d. Here, the variable Z = [Z 1 , . . . , Z T ] is a sequence of time-aligned label indices, describing the time alignment of the augmented label string y = [−, y 1 , −, y 2 , −, . . . , −, y S , −], where Z t = z means that the z th character in the string is aligned with spectrogram frame x t . Notice that Π t is a deterministic function of Z t , but not vice versa: for example, Z t = 1 means that Π t = −, but knowledge that Π t = − is insufficient to determine Z t . The assumption shown in Fig. 1d is not compatible with independence between Z and A, because there are two causal chains connecting Z to A. Instead, as in [19], counterfactual fairness is achieved by breaking both causal chains, i.e., by finding a latent variableŪ such that P (Z|Ū, Y, A) = P (Z|Ū, Y). As in the previous two sections, the latent variable is unknown; all we know is that the factual speech signal,x, and the counterfactual speech signal,x A←ã shared the same latent variable, therefore we require that the conditional distribution P (Z|X, Y) should be unchanged ifx is counterfactually modified tox A←ã , i.e., In other words, similar to Eq. (8), the goal is to match the posterior probability of the outcome labels (Ŷ in Eq. (8), Z in Eq. (17)) after observing the target outcome Y . With the above assumption, this corresponds to the probability of characters at the softmax layer after observing the true ground truth sequence. In the standard formulation of CTC, this probability is written as γ kt (x, y) = P (Z t = k|X =x, Y = y), which is given in terms of the forward variables α t (z) and backward variables β t (z) of CTC loss computation as The counterfactual fairness criterion shown in Eq. (17) can therefore be enforced in a multi-task framework by minimizing where ∆ log γ kt = log γ kt (x A←ã , y) − log γ kt (x, y).

IV. EXPERIMENTAL SETTINGS
We performed our experiments on the Corpus of Regional African American Language (CORAAL) [33]. The dataset is split into train, development, and test sets based on the speakers. All speakers in the datasets are alphabetically sorted and the utterances belonging to the first 64 male and 64 female speakers are used for training. From the remaining set, 8 male and 8 female speakers are used in development set and the remaining 14 male and 6 female speakers' utterances are used for test purposes. Table I summarizes the total duration per gender in terms of hours.
The baseline system is a DeepSpeech2 model [34] trained on the CORAAL dataset with CTC loss. Input features are log magnitude spectrograms extracted from 20ms windows with 10ms skip, Hamming window is used for shaping the timedomain data. Our network outputs are English alphabet characters along with blank, apostrophe and the end-of-sentence token. The baseline DeepSpeech2 model has 2 convolutional layers, each with batch normalization and tanh activation. The convolution kernel sizes are 41 × 11 and 21 × 11 respectively. These layers are followed by 5 batch-normalized bidirectional LSTM layers with 768 cells, whose output is fed into a fully connected layer. The baseline model is trained for 30 epochs with Adam optimization, batch-size 16 and learning rate of 0.001. All models are implemented using PyTorch [35] and each one ran on a single Nvidia Tesla V100 GPU.
In order to generate the counterfactual inputs for training, we use an LSTM based adversarial auto-encoder as shown in Fig. 2. This model takes the input audio features (x), and encodes them in hidden (latent) vectors denoted byŪ. The adversary is trained to compute an estimated protected attribute (Â) based onŪ, with adversary weights trained to minimize the binary cross-entropy J(Â, A); the gradient of binary cross-entropy is back-propagated through a gradient reversal layer [36] to remove information about A from the latent vectorŪ. The autoencoder appends, toŪ, a dense vector µ a representing the protected attribute A = a, then passes these concatenated vectors through an LSTM-based decoder which generates a minimum-MSE estimate (X) of the input feature sequence. In our implementation, the vector µ a is computed by taking the average spectrum over all frames of all training utterances belonging to the attribute group A = a. TheÂ layer, on the other hand, is a one-hot encoding of the attribute group, computed once per utterance by passing the time-averaged LSTM activations of the encoder to the adversary. In the gender case, there are 2 output nodes in theÂ layer: one for male and one for female speakers. Each vector in the input feature sequencex is 161 dimensional, therefore the attribute embedding vectors µ a are also 161 dimensional. The encoder consists of two LSTM layers each with 128 cells. The two LSTMs in the decoder have 256 cells and the fully-connected layer generatingX has 161 output units. The adversarial network that generatesÂ consists of a linear layer of size 64, followed by ReLU nonlinearity and another linear layer whose number of nodes depends on the experiment (2 for gender, 10 for age, 43 for education level). This network is trained using batch-size 16, learning rate 0.001, with Adam optimization for 50 epochs on the CORAAL training set. During training matchedX =x and true attribute A = a are used. Once this model is learned, counterfactual examples are generated by computingX from the abducted latent variablē U =ū and counterfactual µã.
The proposed ASR models are then trained on both factual and counterfactual data. This model is a DeepSpeech2 model similar to the baseline but trained from scratch with the proposed objectives instead of just CTC. In order to compute the counterfactual loss shown in Eq. (10), we pair each factual utterancex with one counterfactual utterancex A←ã , where the counterfactual attributeã is chosen uniformly at random from the set {ã :ã = a}. The delta terms (∆L CTC , ∆σ −1 , ∆ log γ) are computed between the reconstructed original features and the counterfactually reconstructed features.
We compared the performances of the ASR systems based on the overall character error rate (CER) on the test data, the CER difference between male and female speakers and also the standard deviation of CER across all test speakers. We tested the significance of the CER difference between models using NIST's SCLITE toolkit, MAPSSWE method [37] and reported models with significant change at p-value of 0.001 when applicable.

V. RESULTS
In the first set of experiments, our aim is to determine if the middle term in Eq. (16), i.e., the CTC loss due to the counterfactual input (L CF ) is crucial. The counterfactual log probability matching model (Eq. (16)) is trained under two conditions, λ CF ∈ {0, 1}, each while sweeping the log probability matching weight (λ CLM ) from 0.01 to 500. Fig. 3 shows the CERs and Fig. 4 shows the CER difference between males and females (M-F), and the standard deviation of CER across test speakers (Stdev). As we observe from Fig. 4, including the loss term due to the counterfactual input (denoted as λ CF = 1 in the legend) has a larger gap between gender groups and also a larger standard deviation even though it leads to lower CER in Fig. 3 for λ CLM ≥ 100. Since we are after fair outcomes, we set λ CF = 0 in the subsequent experiments. Interestingly, irrespective of λ CF , for most values of λ CLM that we have tested, the gap and the standard deviation were higher as compared to the baseline model which was only trained with CTC loss on the original input features. Reduced male-female gap was only observed when we increased λ CLM over 100. When we increase λ further (to 500 or 1000), then fairness continues to improve, but the average CER becomes worse than the baseline. As we prefer to improve fairness without harming the average CER, our conclusion is that we should choose λ CLM = 200, a particular setting for which we achieve fairer outcomes than the baseline without hurting the average CER.
In the second set of experiments, our goal is to compare the unadapted baseline, counterfactual log probability matching (CF-LogProb), counterfactual CTC loss matching (CF-CTC) and the log character posterior matching (CF-LogPost) models. As mentioned above, here we set λ CF = 0 and only sweep the λ corresponding to the last term in Eqs. (16), (13) and (19) (respectively λ CLM , λ CCM , or λ CPM ). Figure 5 compares the average CER of the four models as a function of λ. Figure 6 shows the CER gap between male and female and the interspeaker standard deviation of CER.
In terms of the overall CER (Fig. 5), only the log probability matching approach achieves significantly lower CER than the baseline for most values of λ. The log posterior matching performs similar to the baseline for small values of λ but when λ reaches 100, it performs significantly worse than the baseline. On the other hand, the CTC loss matching approach results in a large increase in the CER as we increase the weight of the counterfactual fairness term λ.
In Fig. 6, we compare the CER gap between males and females as well as the standard deviation of CER over all test speakers. Although the CTC matching approach has a lower gap (fairer) than the other two approaches, it has much higher overall CER (lower accuracy) as we show in Fig. 5. The log probability and log posterior matching approaches have similar performance which are only better than the baseline when λ ≥ 100.
In Fig. 6, we also compare the standard deviation of CERs from different models. As expected, the curves have downward slopes because as we increase the weight of fairness λ, we should achieve fairer outcomes, which tends to correlate with lower inter-speaker standard deviation. When λ < 1 all three approaches have higher standard deviation as compared to the baseline and the log posterior matching approach has the smallest deviation. When λ = 100, CTC and log posterior matching approaches perform better than the baseline but they come at the expense of having higher CER as shown in Fig.  5. Although the log probability matching method has lower male-female gap in Fig. 6 at λ = 100, we see that the standard deviation of CER is not lower than the baseline. Still, as we discussed above, λ = 200 is the optimal point for the LogProb setting.
In the experiments described above, the protected attribute was always gender. However, CORAAL dataset comes with speaker metadata including their age and education groups which can also be considered as protected attributes. In the sequel, we investigate the cases where the protected attribute is age or education group rather than gender. In these experiments, we still use the auto-encoder model described in Fig. 2 except that the number of possible attributes changes depending on the experiment. For example, in the age group experiments, we have 10 classes as there are 5 age groups from two genders. Having 10 classes instead of 5 allows us  to keep the gender attribute the same while generating the counterfactual data from a different age group. As we can see from Table II, when λ CLM = 200, we operate at an equal or lower CER level than the baseline while reducing the inter-speaker standard deviation of the CER and the gap between male and female speakers. Although the ASR was trained using age as the protected attribute, the CER difference among age groups is the only measure of fairness that did not improve: inter-speaker standard deviation has improved, and the male-female gap has improved, but inter-age-group differences did not. We speculate that this is a data sparsity issue, caused by the small number of speakers in each age group in the test set (4-5 male and 0-4 female speakers in each age group).
Next, we will investigate the case where the protected attribute is the education level of the speaker. According to the results shown in the lower part of Table II, when we have λ CLM = 200, we operate at a lower CER than the baseline while having lower inter-speaker standard deviation. For comparison, we also include the male/female and age group statistics in each case. As we can see, when λ CLM = 200, we are able to reduce the gender gap, as well as the CERs per age group. Since there are around ten education categories, those statistics are not provided in the table. However, if we look at the data, we also observe some decrease in CER for each education category. We will provide further discussion in the next section.
It has been shown that speech data augmentation by textto-speech models helps reduce error rates (e.g. [38]). Hence, it might be argued that counterfactual training should be performed by simply generating counterfactual training data, and then training the ASR using the augmented dataset; in our notation, this corresponds to setting λ CF = 1 without any counterfactual matching (λ CLM = 0). Table III compares two experimental conditions: counterfactual data augmentation (λ CF = 1, λ CLM = 0) and counterfactual regularization (λ CF = 0, λ CLM = 200). As shown, counterfactual augmentation improves the average CER, but harms fairness (increases both inter-speaker standard deviation and male-female gap). This is because counterfactual data augmentation lowers CER for both the advantaged and disadvantaged groups, but it provides greater benefit for the advantaged group. For example, on the CORAAL dataset, as shown in Table III, female performance is in general better than that of male speakers. Counterfactual data augmentation improves CER more for female than male speakers, hence it increases the male-female gap; it also increases the inter-speaker standard deviation. Counterfactual regularization, by contrast, improves all three metrics. Fairness (standard deviation and the male-female gap) is improved relative to the baseline system, and relative to the counterfactual augmentation strategy. Average CER is improved relative to baseline, but does not beat the average CER achieved by the counterfactual augmentation strategy. While gaining from fairness perspective, we lose some of the advantage that would have been provided by pure data augmentation in this case. The experiments described above are on the CORAAL dataset. In the next experiment, we will test our method on a standard American English dataset, namely, 100 hr subset of LibriSpeech [39]. This dataset contains only gender information of the speaker; hence, we will only test the performance when the protected attribute is gender. The experimental procedure is similar to the CORAAL dataset except that the generator is now trained on the Librispeech train subset. Table IV shows the CER performance on LibriSpeech. Since LibriSpeeech is a larger dataset with read speech, in general, we operate at a lower CER level as compared to CORAAL. Optimal values of λ CLM were found to be slightly higher for Librispeech than for CORAAL; when λ CLM = 300, we observe reduction in both the overall CER as well as the inter-speaker standard deviation. Furthermore, we reduce the gender gap from 1.7% to 1.4%.
Note that in Librispeech [39] males have lower CER than females (8.9% vs. 10.6% in Table IV), but that the opposite is true in CORAAL (in Table III, male speakers have 44% CER, while female speakers have 32.6%). This result has been previously reported: several studies have reported lower error rates for male than female speakers in standard dialects [18], including standard American English [13], whereas the CORAAL dataset has been shown to have the opposite trend [14]. Two explanations are possible. First, male and female dialects of African American Language (AAL) have been reported to differ [40]; it is possible that women's AAL poses less difficulty for ASR. Second, though the CORAAL training set has equal numbers of male and female speakers, it has more female speech, and the greater quantity of training data may reduce CER. We will next present the results on combined LibriSpeech and CORAAL datasets where the protected attribute is dataset, which, among the variables available to us, is the variable most probably correlated with race. LibriSpeech contains speech in a variety of dialects, but the modal dialect is Standard American English; race of speakers is not annotated. Data from CORAAL are assumed to exemplify AAL. Table V shows the CERs of these experiments. The first row shows the results from the individual baselines, i.e., decoding each dataset with its own model. Since LibriSpeech model performs better than CORAAL, in the second row, we show the case where we decode the CORAAL dataset using the LibriSpeech model. Then, we perform counterfactual training for various λ CLM . As we can see, without any counterfactual adaptation, we have a very large performance gap for Standard American English versus African American Language. When we train a model on combined CORAAL and LibriSpeech (λ CLM = 0), we already see an improvement in terms of fairness, i.e., lower standard deviation across speakers and lower CER difference between groups. If we also apply counterfactual training (λ CLM > 0), we further reduce the standard deviation and the CER difference while slightly improving the overall CER.

VI. DISCUSSION
Systems are trained with an individual fairness objective, but results are reported using group disparity (overall male CER vs. female CER). This is mainly because counterfactuals do not really exist. In order to compensate for the lack of factual comparisons between matched individuals, we included the standard deviation across all speakers as a proxy for the individual differences. A full discussion of whether there is a trade-off between individual and group fairness is out of scope of this study and we refer to [41] for a relevant discussion. One way to investigate the improvement in individual differences is to look at the CER differences between a speaker and  Fig. 8: Total individual CER differences between real and counterfactual age groups on CORAAL their counterfactual realizations. Next, we will visualize these results on CORAAL dataset.
In Figs. 7-9, we show the total CER difference between real and counterfactual settings of each protected attribute where the protected attributes are gender, age, and education, respectively. In each case, the left subfigure shows the absolute CER gap from the baseline system and the right figure shows that of the log probability matching system. Colors in the figures are shaded such that identical colors in left and right subfigures correspond to the same absolute CER gap. In all three figures, we see that the model obtained from counterfactual training has lower CER gap among categories, by an order of magnitude.
The counterfactual feature generation step, in this paper, is a type of voice conversion system, but since it is not trained for the task of optimum voice conversion, it does not generate counterfactual speech that would fool a human listener: the differences between the factual and counterfactual spectrograms (x and x A←ã ) are actually quite small. It is therefore necessary to ask whether the counterfactually generated features are really counterfactual, in any meaningful sense: we could pose the null hypothesis that x A←ã is simply a version of x perturbed by random noise. It is well known that noise-augmented training can improve ASR (e.g., [42]), but the effect of noise augmentation on fairness has not been previously reported, so we tested it. Table VI summarizes the results of counterfactual log probability matching with λ CLM ∈ {200, 300}, in which the counterfactual speech has been replaced by a spectrogram with additive Gaussian noise.  The noise level is set to σ n = 10 −3 ≈ x A←ã −x 2 , i.e., approximately the RMS difference between factual and counterfactual spectrograms in the experiments of Table III.  When we compare Table VI and Table III, we see that random perturbation and counterfactually-generated speech provide similar gains in overall CER (the former is better at λ CLM = 200, the latter is better at λ CLM = 300), but that counterfactual training performs better in terms of fairness (both Stdev and M-F difference).
VII. CONCLUSION In this work, we investigated the ASR training problem from an individualized counterfactual fairness point-of-view. We propose that for any given individual, if this person were from the opposite gender, but spoke the same words with similar style, fair ASR should estimate the same characters as its output. We formulated this as an additional loss term that is added to the CTC loss due to the original input. We compared three approaches: matching the log probability outputs from the ASR model, matching the CTC loss and matching the log-posterior of characters given the ground truth sequence.
In the experiments on CORAAL, we showed that there is generally a trade-off between the CER and fairness of the system. CTC-matching and log-posterior matching achieved fairness, at the expense of significant increases in average CER, but log probability matching successfully improved both the fairness and the average CER of the recognizer. We verified this observation on gender, age or education level as protected attributes. The same method was also demonstrated to reduce the female-male CER gap in LibriSpeech, and to reduce differences between the error rates of speech samples from the LibriSpeech and CORAAL corpora.
The loss function that involves log-probability matching only assumes that we have (log-)softmax outputs from the network and is independent of the specific model architectures. Hence, as future work, we would like to investigate the use of counterfactual log probability matching for other neural architectures, for both voice conversion and ASR. We especially think that the effect of improved voice conversion on the final ASR performance should be investigated as a future work. mitigating of bias in AI model and training data). Opinions and findings are those of the authors, and are not endorsed by IITP.

APPENDIX A LOGITS VERSUS LOG-PROBABILITIES
The logit function is defined to be For a binary classifier, equating logits between real and counterfactual inputs would mean matching the log-probability terms log P (Ŷ |X, A): the numerator and denominator of the fraction in Eq. (21) must add up to 1, so fixing the value of Eq. (21) is equivalent to fixing both the numerator and denominator. Now, if we consider ASR or specifically the character recognition problem, we have a multi-class classifier (K > 2) and the logit terms become If we achieve the perfect match of the logits (∀k ∈ {1, . . . , K}), then this would again imply equality of logprobabilities resulting from the real and counterfactual inputs. However, during training the difference between real and counterfactual outputs will not be 0 for all k. Since the goal is to match probabilities and since it is also easier to compute the log-softmax as compared to the logits for the multi-class case, we use the log-softmax outputs (log P (Ŷ = k|X, A)) rather than the logits in our experiments.