Towards Mitigating Gender Bias in a decoder-based Neural Machine Translation model by Adding Contextual Information

Gender bias negatively impacts many natural language processing applications, including machine translation (MT). The motivation behind this work is to study whether recent proposed MT techniques are signiﬁcantly contributing to attenuate biases in document-level and gender-balanced data. For the study, we consider approaches of adding the previous sentence and the speaker information, implemented in a decoder-based neural MT system. We show improvements both in translation quality (+1 BLEU point) as well as in gender bias mitigation on WinoMT (+5% accuracy)

This bias has been demonstrated in Neural Machine Translation (NMT), where translations seem to ignore the context and translate professions with their stereotyped genders (Font and Costa-jussà, 2019;Stanovsky et al., 2019). This occurs due to the fact that NMT systems generally work on a sentence by sentence basis. Several approaches have been proposed to output different gendered translations (Kiritchenko and Mohammad, 2018), add gender information in the process of training (Vanmassenhove et al., 2018), and use debiased word embeddings (Font and Costa-jussà, 2019). Other approaches focused on measuring gender bias in translation systems (Prates et al., 2020;Stanovsky et al., 2019). Finally, the work by  presented a non-synthetic gender-balanced data set, which can be considered to evaluate NMT.
The main contribution of this work is using existing NMT contextual methodologies, both context of the previous sentence (Junczys-Dowmunt, 2019) and speaker identification (Vanmassenhove et al., 2018), in a prominent and competitive NMT architecture (Fonollosa et al., 2019). These approaches are explicitly tested for the purpose of mitigating gender bias while improving the translation quality. The architecture in our experiments uses only the decoder part of the popular Transformer (Vaswani et al., 2017;He et al., 2018;Fonollosa et al., 2019); thus, reduces training parameters and simplifies the model.

Methodology: adding context and speaker id in a decoder-based NMT model
This study uses the following two recent proposed methodologies to improve the accuracy of NMT. While these methodologies are not new, we are adding them on a different baseline (Fonollosa et al., 2019) and testing specifically on a gender-balanced data sets. We describe the baseline system and the techniques as follows and examples are shown in Table 1.
Neural Machine Translation with joint source-target self-attention. The current state of the art is the encoder-decoder architecture using the Transformer (Vaswani et al., 2017) that avoids recurrence completely and gives better translations depending on the stacked self-attention and fully connected layers between encoder and decoder. An alternative to this architecture is based on the simplified architecture by Fonollosa et al. (2019) 1 . This model, instead of having both encoder-decoder, only uses the decoder block and it adopts the idea of language modeling for translation task. The joint source-target representations are learnt in the early layers. Positional embeddings are applied to the source and target independently. There are also language embeddings representing the language of the source and the target separately. Different from the self attention in normal transformers, a locally constrained attention is proposed by the authors to attend only to a token's locality, to form a reduced receptive field.
Adding the previous context sentence (PreSent): Concatenating two sentences with a separator token. This method adopts the idea of increasing the context (Junczys-Dowmunt, 2019).
Incorporating the speaker gender identification (SpeakerId): Incorporating the information of the gender of the speaker in NMT by adding the gender tag before each sentence (Vanmassenhove et al., 2018). This approach is specially helpful when translating from a less inflected language to a more inflected one, e.g., from English to Spanish.

Examples Baseline
I have only done this once before. +PreSent I have only done this once before. <sep> This is not a joke. +SpeakerId MALE I have only done this once before.

Experimental Framework, Results and Discussion
Data and Parameters: Spanish is a highly-gendered morphological language compared to English, associating gender to professions and adjectives. That is why the language pair (EN-ES) has been used from the proposed data in Vanmassenhove et al. (2018). BLEU results (Table 2). These results have been acquired by testing Europarl test set and GeBioCorpus. Adding the previous sentence has higher impact in Europarl (+1.09) than in GeBioCorpus (+0.21) due to the fact that documents in GeBioCorpus are not coherent (all sentences belong to the same document but some sentences may not be contiguous). Adding the gender tag shows exactly the same effect in GeBioCorpus than in Europarl (+0.17), even if the speaker identification is not from the same nature in the EuroParl and in GeBioCorpus. In the former, the speaker identification comes from the speaker, whereas in the latter, it comes from the biography main character.
Other advantages (Table 4 and Table 5). Evaluating on WinoMT (Table 3 and Figure 1). This step is carried out by translating the WinoMT dataset and evaluating the translation by Stanovsky et al. (2019) system that depends on extracting the gender of entities of the translated sentences. These entities are evaluated against the gold annotations provided by the original English dataset. The evaluation is performed on three aspects, the whole WinoMT dataset, and subsets of both pro-stereotypical and anti-stereotypical sentences. An example of anti-stereotypical sentences is The developer argued with the designer because she did not like the design., where the developer is a female entity. An example of pro-stereotypical sentences is The CEO helped the nurse because he wanted to help., where the CEO is a male entity. As shown in Figure 1, the systems are performing better on the pro-stereotypical portion of WinoMT than on the anti-stereotypical one. The accuracy (Acc), shown in Table 3 and Figure 1, indicates that the methodology PreSent performs best compared to the other approaches in this paper (baseline or SpeakerId). The PreSent detects the gender more correctly than the others whether pro-stereotyped or anti-stereotyped, and its accuracy reaches 61% with 12.2% difference between f1-scores of males and females. This accuracy improves over the best results presented in the original paper (Stanovsky et al., 2019), where the best accuracy is 59.4% with 15.4% difference between f1-scores of males and females. It is important to notice that WinoMT is a test set that does not contain information at the level of document and without speaker identification, so translation with our methodologies is done without this information. Therefore, adding the information of the previous sentence makes the system more robust and it does not mind that we are doing inference without this information.

Methods
Acc.