A Brief Survey and Comparative Study of Recent Development of Pronoun Coreference Resolution in English

Pronoun Coreference Resolution (PCR) is the task of resolving pronominal expressions to all mentions they refer to. Compared with the general coreference resolution task, the main challenge of PCR is the coreference relation prediction rather than the mention detection. As one important natural language understanding (NLU) component, pronoun resolution is crucial for many downstream tasks and still challenging for existing models, which motivates us to survey existing approaches and think about how to do better. In this survey, we first introduce representative datasets and models for the ordinary pronoun coreference resolution task. Then we focus on recent progress on hard pronoun coreference resolution problems (e.g., Winograd Schema Challenge) to analyze how well current models can understand commonsense. We conduct extensive experiments to show that even though current models are achieving good performance on the standard evaluation set, they are still not ready to be used in real applications (e.g., all SOTA models struggle on correctly resolving pronouns to infrequent objects). All experiment codes will be available upon acceptance.


Introduction
The question of how human beings resolve pronouns 1 has long been of interest to both linguistic and natural language processing (NLP) communities, for the reason that a pronoun itself only having weak semantic meaning brings challenges to natural language understanding.To explore solutions for that question, pronoun coreference resolution (PCR) [2] was proposed. 2As a challenging yet vital natural language understanding task, pronoun coreference resolution is to find the correct reference for a given pronominal anaphor in the context and has been shown to be crucial for a series of downstream tasks, such as machine translation [6], summarization [7], and dialog systems [8].
To investigate the difference between PCR and the general coreference resolution task, which tries to identify not only the coreference relations between noun phrases (NP) and pronouns (P) but also potential coreference relations between noun phrases or coreference relations between pronouns, we conduct experiments with one recent breakthrough model (i.e., End-to-end model [9]) on the CoNLL-2012 shard task [10] under two settings: one without the gold mention and one with the gold mention.In the 'without gold mention' setting, models are required to first identify spans from the documents as the mentions and then predict the coreference relations among these mentions.As a comparison, if gold mentions are provided, models only need to predict the coreference relations.From the results in Table 1 we can see that, without gold mention, the model performs well on P-P coreference relations while not that well on the other two kinds of relations.However, if gold mentions are provided, the model can achieve very good performance on the NP-NP coreference relations.Compare with other kinds of coreference relations, no matter whether the gold mention is provided or not, resolving pronouns to noun phrases is always the most challenging one.
The correct resolution of pronouns typically requires reasoning over both linguistic knowledge (e.g., 'they' typically can only refer to plural objects 3 ) and commonsense knowledge (e.g., in sentence "The fish ate the worm, it was hungry", 'it' refers to 'fish' because hungry things tend to eat rather than being eaten.).Considering that the ordinary PCR task evaluates the inference over both types of knowledge at the same time, the performance on ordinary PCR tasks cannot clearly reflect models' performance regarding different knowledge types.To address this problem, the Winograd Schema Challenge (WSC) [12] task is proposed.The influence of all commonly used linguistic knowledge is avoided during the creation of WSC such that WSC can be used to reflect how current PCR models can understand commonsense knowledge.In Section 2 and 3, we introduce the progress and remaining challenges on the ordinary PCR and WSC tasks respectively.After that, we introduce other PCR tasks that are developed for different research purposes in Section 4. In the end, we conclude this survey with Section 5.The contribution of this survey is three-fold: (1) we broadly introduce available PCR tasks, datasets, and models; (2) We summarize the main contribution of recent models; (3) We conduct experiments to analyze the limitations of current models, which can help the community think about how to better solve PCR in the future.

Ordinary PCR
Ordinary pronoun coreference resolution tasks are often defined over formal textual corpus (e.g., newspaper) and the annotation is usually conducted by domain experts or linguists.The PCR task can be formally defined as follows.Given a text D, which contains a pronoun p, the goal is to identify all the mentions that p refers to.We denote the correct mentions p refers to as c ∈ C, where C is the correct mention set.Similarly, each candidate span is denoted as s ∈ S, where S is the set of all candidate spans.Note that in the case where no golden mentions are provided, all possible spans in D are used to form S. The task is thus to identify C out of S. In the rest of this section, we introduce the widely used datasets as well as the progress and limitation of current approaches.

Datasets
Throughout the years, researchers in the NLP community have devoted great efforts to developing high-quality coreference resolution datasets 4 and we introduce representative ones as follows: 1. MUC: MUC-6 [13] and MUC-7 [14], which were developed for the 6 th and 7 th message understanding conferences respectively, are the earliest coreference resolution datasets.They are focusing on English news articles and are relatively small compared with modern datasets.

ACE:
The ACE dataset [15] was proposed as part of the Automatic Content Extraction program.Compared with MUC datasets, ACE extends the corpus domain from news to other domains like telephonic speeches and broadcast conversations.

Methods
In this subsection, we introduce representative models for the ordinary PCR task.We first briefly introduce conventional approaches that rely on human-designed rules or features and then introduce the end-to-end model, which is a groundbreaking model for solving coreference resolution tasks.After that, we briefly introduce a few recent improvements over the end-to-end model.

Rule and Feature Based Methods
Before the deep learning era, human-designed rules [2,19], knowledge [20,21], or features [3,22] dominated the general coreference resolution and PCR tasks.Some rules and features are crucial for correctly resolving pronouns [23].For example, 'he' can only refer to males and 'she' can only refer to females; 'it' can only refer to singular objects and 'them' can only refer to plural objects.The performances of these methods heavily rely on the coverage and quality of the manually defined rules and features.Based on these designed features [24], a few more advanced machine learning models were applied to the coreference resolution task.For example, instead of identifying coreference relation pair-wisely, [25] proposes an entity-centric coreference system that can learn an effective policy for building coreference chains incrementally.Besides that, a novel model was also proposed to predict coreference relations with a deep reinforcement learning framework [26].Moreover, heuristic rules based on linguistic knowledge can also be incorporated into constraints for machine learning models [27].

End-to-end Model
Leveraging human-designed rules or features can help accurately resolve some pronouns, but it is hard to manually design rules to cover all cases.To solve this problem, an end-to-end deep model [9] was proposed.Different from other machine learning-based methods, it does not use any human-defined rules, yet achieves surprisingly good performance.Specifically, the end-to-end model first leverages the combination of Bi-directional LSTM and inner-attention modules to encode local context and generate representations for all potential mentions.After that, a standard feed-forward neural network is used to predict the coreference relations.Experiment results show that the proposed model is simple yet effective.Its success proves that current deep models are capable of capturing rich contextual information, which is crucial for resolving coreference relations.

Further Improvements
Recently, on top of the end-to-end model, a few improvement works were proposed to address different limitations of the original end-to-end model 1.Higher-order Information: One limitation of the original end-to-end model is that all predictions are based on pairs, which is not sufficient for capturing higher-order coreference relations.To fix this issue, a differentiable approximation module was proposed in [29] to provide the higher-order coreference resolution inference ability (i.e., leveraging the coreference cluster to better predict the coreference relations).Moreover, this work first incorporates ELMo [30] as part of the word representation, which is proven very effective.
2. Structured Knowledge: Another limitation of the end-to-end model is that its success heavily relies on the quality and coverage of the training data.However, in real applications, it is labor-intensive and almost impossible to annotate a large-scale dataset to contain all scenarios.To solve this problem, two works [5,4] were proposed to inject external structured knowledge into the end-to-end model.Among these two, [5] requires converting external knowledge into features while [4] directly uses external knowledge in the format of triples.
3. Stronger Language Representation Models: Recently, along with the fast development of language representation models, a few works [31,28] have been trying to replace the encoding layer of the original end-to-end model with more powerful language representation models.Take SpanBERT [28] as an example, by replacing ELMo with SpanBERT, it boosts the performance by 6.6 F1 over the general coreference resolution task.

Performances and Analysis
We follow the experimental setting of [4] and test the performance 6 of representative models [19,25,26,9,4,28] on the CoNLL-2012 dataset [10].From the results in Table 2, we can observe that with the help of the end-to-end model and further modifications, the community has made great progress on the standard evaluation set.For example, the end-to-end model achieves an F1 score over 70 and adding external knowledge (either in a structured way or a representation way) further boost the performance.Among all pronoun types, all models perform better on third personal and possessive pronouns, and relatively poorly on demonstrative ones.This is mainly because of the imbalanced distribution of the dataset (i.e., third personal and possessive pronouns appear much more than demonstrative ones).

Cross-domain Performance
To investigate whether current PCR models are good enough to be used in real applications, which could be out of the training domain, we conduct experiments on the cross-domain setting.In detail, we select two different PCR datasets from different domains (i.e., CoNLL [10] from news and i2b2 [32] from the medical domain) and try to train the model on one dataset and test it on the other.We conduct experiments with three best-performing models and show the results in Table 3, from which we can see that all models7 perform significantly worse if they are used across domains.
Compared with the baseline method, adding explicit knowledge can help achieve slightly better performance in the cross-domain setting because its training objective allows models to learn to selectively use suitable knowledge rather than just fitting the training data.

Influence of Frequency
To further analyze the performance of existing models, we split the pronouns based on the frequency of the objects they refer to.If an object appears more than ten times in the whole dataset, we denote it as frequent objects.Otherwise, we  denote it as infrequent objects.As a result, we collect 1,095 frequent and 470,232 infrequent objects, whose average frequencies are 36.2and 1.46 respectively.We report the performance of best-performing models on infrequent and frequent objects separately in Table 4.In general, all models perform better on frequent objects because they appear more in the training data.Another interesting observation is that even though adding external KG and a stronger language representation model can both boost the performance, their improvements come from different types of objects.For example, the main contribution of adding KG is on infrequent objects because even though they are less frequent in the training data, they could still be covered by some external knowledge.As a comparison, using a strong language representation model mainly benefits the frequent objects because it has a stronger ability to fit the training data.This observation is consistent with our previous observations that adding external KG has more effect on those relatively rare pronouns (i.e., demonstrative pronouns).

Hard PCR
As aforementioned, the correct resolution of pronouns requires the inference over both linguistic knowledge and commonsense knowledge.To clearly reflect how models can resolve pronouns that require the inference over commonsense knowledge, the hard PCR task was proposed.As Winograd Schema Challenge (WSC) is the most popular hard PCR task, we use the task definition in WSC to define the hard PCR task.Given a sentence s, which contains a pronoun p and two candidates n 1 , n 2 , the task is to find out which of the candidates p refers to.Different from the ordinary PCR task, the influence of all commonly observed features (e.g., gender or plurality) are removed via carefully expert design.
In WSC, all questions are paired up such that questions in each pair have only minor differences (mostly one-word difference), but the answers are reversed.One pair of the WSC instances is shown in Figure 1.Solving these questions typically requires the support of complex commonsense knowledge.For example, human beings can know that the pronoun 'it' in the first sentence refers to 'fish' while the one in the second sentence refers to 'worm' because 'hungry' is a common property of something eating while 'tasty' is a common property of something being eaten.Without the support of such commonsense knowledge, answering these questions becomes challenging because both the fish and worm can be hungry or tasty by themselves.

Human Beings
Original [12] 252 21 0 92.1% 92.1% Recent [34] 264 9 0 96.5% 96.5% 2. Definite Pronoun Resolution: Another hard pronoun coreference resolution dataset is the definite pronoun resolution dataset (DPR)9 [33].Different from WSC, DPR leveraged undergraduates rather than experts to create the dataset.In total, DPR collected 1,886 questions, which is a slightly larger scale than the official WSC.However, as DPR could not guarantee that all DPR questions follow the strict design guideline of WSC, questions in DPR are relatively simpler.

WinoGrande:
One common problem of WSC and DPR is their small scales.To create a larger scale data, WinoGrande [34] was proposed.By leveraging annotators from Amazon Mechanical Turk, WinoGrande collected 53 thousand WSC-like questions.Moreover, to make sure of the dataset quality, WinoGrande applied a bias reduction algorithm to filter out examples that may contain annotation bias.Experimental results prove that WinoGrande is much more challenging than the original WSC because the SOTA models on WSC only achieve 51% accuracy on WinoGrande, which is similar to the random guess.
4. KnowRef: Similar to WinoGrande, KnowRef [35] also aimed at creating a larger scale WSC dataset but with a different approach.Instead of using crowd-sourcing + adversarial filtering framework, KnowRef tried to extract WSC-like questions from raw sentences.As a result, KnowRef collected eight thousand WSC-like questions.

Methods
In this subsection, we introduce existing approaches for the hard PCR task.As the majority of the methods are evaluated based on WSC, all the discussion and analysis are based on their performance on WSC.

Reasoning with Structured Knowledge
At first, people tried to leverage different commonsense knowledge resources to solve WSC questions in an explainable way.For example, [43] first leveraged the commonsense triplets from ConceptNet [44] to train the word embeddings and then applied the embeddings to solve the WSC task.Knowledge hunter [36] proposed to leverage search engines (e.g., Google) to acquire needed commonsense knowledge.It first searched WSC questions in search engines and then used the returned searching results to solve WSC questions.SP-10K [37] conducted experiment to show that selectional preference (SP) knowledge such as human beings are more likely to eat 'food' rather than 'rock' can also be helpful for solving WSC questions.Last but not least, ASER [38] tried to use knowledge about eventualities (e.g., 'being hungry' can cause 'eat food') to solve WSC questions.In general, structured commonsense knowledge can help solve one-third of the WSC questions, but their overall performance is limited due to their low coverage.There are mainly two reasons: (1) coverage of existing commonsense resources are not large enough; (2) lack of principle way of using structured knowledge for NLP tasks.Current methods [36,37,38] mostly rely on string match.However, for many WSC questions, it is hard to find supportive knowledge in such way.

Language Representation Models
Another approach is leveraging language models to solve WSC questions [39], where each WSC question is first converted into two sentences by replacing the target pronoun with the two candidates respectively and then the language models can be employed to compute the probability of both sentences.The sentence with a higher probability will be selected as the final prediction.As this method does not require any string match, it can make prediction for all WSC questions and achieve better overall performance.Recently, a more advanced transformer-based language model GPT-2 [40] achieved better performance due to its stronger language representation ability.The success of language models demonstrates that rich commonsense knowledge can be indeed encoded within language models implicitly.
Another interesting finding about these language model based approaches is that they proposed two settings to predict the probability: (1) Full: use the probability of the whole sentence as the final prediction; (2) Partial: only consider the probability of the partial sentence after the target pronoun.Experiments show that the partial model always outperforms the full model.One explanation is that the influence of the imbalanced distribution of candidate words is relieved by only considering the sentence probability after them.Such observation also explains why GPT-2 can outperform unsupervised BERT on WSC because models based on BERT, which relies on predicting the probability of candidate words, cannot get rid of such noise.

Fine-tuning Representation Models
Last but not least, we would like to introduce current best-performing models on the WSC task, which fine-tunes pre-trained language representation models (e.g., BERT [41] or RoBERTa [42]) with a similar dataset (e.g., DPR [33] or WinoGrande [34]).This idea was originally proposed by [45], which first converts the original WSC task into a token prediction task and then selects the candidate with higher probability as the final prediction.In general, the stronger the language model and the larger the fine-tuning datasets are, the better the model can perform on the WSC task.

Performances and Analysis
To clearly understand the progress we have made on solving hard PCR problems, we show the performance of all models on Winograd Schema challenge in Table 5.From the results, we can make the following observations: 1.Even though methods that leverage structured knowledge can provide explainable solutions to WSC questions, their performance is typically limited due to their low coverage.
2. Different from them, language model based methods represent knowledge contained in human language with an implicit approach, and thus do not have the matching issue and achieve better overall performance.
3. In general, fine-tuning pre-trained language representation models (e.g., BERT and RoBERTa) with similar datasets (e.g., DPR and WinoGrande) achieve the current SOTA performances and two observations can be made: (1) The stronger the pre-trained model, the better the performance.This observation shows that current language representation models can indeed cover commonsense knowledge and along with the increase of their representation ability (e.g., deeper model or larger pre-training corpus like RoBERTa), more commonsense knowledge can be effectively represented.(2) The larger the fine-tuning dataset, the better the performance.This is probably because the knowledge about some WSC questions is only covered by Winogrande but not in DPR.
To investigate the reason behind WinoGrande's success, we divide WinoGrande into subsets based on the instances' relevance towards WSC.Assume that the instance set of WinoGrande and WSC are I W G and I W SC respectively, for each instance i ∈ I W G , we design its relevance score as follows: where O(i, i ) is the unigram co-occurrence of i and i and L() the instance length.We use the released code and dataset to conduct the experiments and follow all hyper-parameters as the original paper [34] except the batch size 10 .
From the results in Table 6, we can observe that: (1) The most relevant instances contribute the most to the success.
In some learning rate settings, it performs similar to or even better than the overall set; (2) Less relevant instances also help, which shows that current fine-tuning approach is not just fitting the data but also learning some underneath knowledge about solving the task from the data; (3) The model can be sensitive to the hyper-parameters (i.e., learning rate).Different subsets have different best hyper-parameters and the learning process can easily fail with a bad hyper-parameter.To achieve a good performance on a fixed dataset like WSC, we can tune the hyper-parameters.But to create a reliable PCR system we can rely on in real life, we probably need a more robust model.

Other PCR Tasks
Besides the ordinary and hard PCR tasks, PCR is also an important research topic for many special purposes (e.g., gender bias) or in some special settings (e.g., Visual-aware PCR).In this section, we briefly introduce these tasks: 1. PCR in the Medical Domain: I2b2 [32] is a dataset that focuses on identifying coreference relations in electrical medical records.As reported in [4], the training set of I2b2 contains 2,024 third personal pronouns, 685 possessive pronouns, and 270 demonstrative pronouns.Its test set contains 1,244 third personal pronouns, 367 possessive pronouns, and 166 demonstrative pronouns.As a dataset in a relatively narrow domain, the usage of domain knowledge becomes important.As shown in [4], i2b2 can be used as an additional dataset to evaluate models' cross-domain abilities.
2. PCR for Machine Translation: ParCor [46] and ParCorFull [47] are datasets focusing on PCR in parallel multilingual datasets, which can be used in downstream machine translation tasks.Different from other PCR works, it focuses on how to leverage the PCR results for better translation rather than how to solve the PCR problem.
3. PCR for Chatbots: CIC [48] is a dataset focusing on identifying coreference relations in multi-party conversations.Compared with the ordinary PCR tasks, which are mostly annotated on formal textual data (e.g., newswire), identifying coreference relation in conversation is more challenging.
4. PCR for Studying Gender Bias: Nowadays, gender bias has been a hot research topic in the NLP community [49,50].Among all the works, WinoGender [49] is one of the most popular ones.The setting of WinoGender is similar to the setting of WSC [12], where each sentence contains one target pronoun and two candidate noun phrases and the models are required to select the correct antecedent from the two candidates.But the purpose is different.WSC aims at evaluating models' abilities to understand commonsense knowledge, while WinoGender aims at evaluating how well models can predict without the influence of gender bias.The experiments show that some gender bias (e.g., 'he' is more likely to be predicted to be the doctor rather than the nurse by the machine) indeed exists in pre-trained language representation models.Such observation is astonishing and motivates the community to think about how to minimize the influence of such gender bias.

5.
Visual-aware PCR: Recently, a visual-aware PCR dataset [51], which evaluates how well models can ground pronouns to visual objects, was proposed.Similar to CIC [48], Visual-PCR also focuses on pronouns in daily dialogue, where the language usage is informal and a lot of background knowledge could be missing.For example, if one speaker refers to something both speakers can see, they may directly use a pronoun rather than introduce it first.In such a case, a pronoun may refer to not mentioned objects in the conversation.As analyzed in the original paper, 15% of pronouns in conversations refer to not mentioned objects and for them, leveraging the visual context information becomes crucial.As shown in [52], grounding pronouns to the visual objects can significantly help the model to better understand the dialog and generate the better response, which further proves that visual PCR is an important research topic worth exploring.

Conclusion
In this paper, we survey the progress on the pronoun coreference resolution (PCR) task and the limitation of existing approaches.Experiments and analysis on both the ordinary and hard PCR tasks demonstrate that even though we have made great progress based on the main evaluation metric, the PCR task is still far away from being solved.For example, all best-performing ordinary PCR models struggle on the cross-domain setting as well as infrequent objects, and even though fine-tuning pre-trained language representation models can achieve near-human performance on WSC, it can be sensitive to the hyper-parameters.All codes will be released to encourage the research on the PCR task.

Table 1 :
The performance of the End-to-end model on the CoNLL-2012 shared task coreference resolution dataset.The model's performances of different coreference types are reported separately.

Table 2 :
Performances of different models on the CoNLL-2012 shared task.Precision (P), recall (R), and the F1 score are reported.Numbers of different types of pronouns in the test set are shown in the brackets.Best models are indicated with the bold font.

Table 3 :
Models' performance (in F1 score) in cross-domain setting on different training/test data.

Table 4 :
Influence of the frequency.

Table 5 :
Performances of different models on the 273-question version WSC.N A means that the model cannot give a prediction, Ap means the accuracy of predict examples without N A examples, and Ao the overall accuracy.

Table 6 :
Performance of fine-tuning RoBERTa with different subsets of Winogrande and different learning rates.L.R. means learning rate and Rel.means relevance to WSC data.WinoGrande instances are grouped into three subsets.Numbers of instances are shown in brackets.Best performed datasets for each learning rate is indicated with the bold font.