Generalisation in Named Entity Recognition: A Quantitative Analysis

Named Entity Recognition (NER) is a key NLP task, which is all the more challenging on Web and user-generated content with their diverse and continuously changing language. This paper aims to quantify how this diversity impacts state-of-the-art NER methods, by measuring named entity (NE) and context variability, feature sparsity, and their effects on precision and recall. In particular, our findings indicate that NER approaches struggle to generalise in diverse genres with limited training data. Unseen NEs, in particular, play an important role, which have a higher incidence in diverse genres such as social media than in more regular genres such as newswire. Coupled with a higher incidence of unseen features more generally and the lack of large training corpora, this leads to significantly lower F1 scores for diverse genres as compared to more regular ones. We also find that leading systems rely heavily on surface forms found in training data, having problems generalising beyond these, and offer explanations for this observation.

As NER is being applied to increasingly diverse and challenging text genres [61; 18; 24], this has lead to a noisier, sparser feature space, which in turn requires regularisation [10] and the avoidance of overfitting. This has been the case even for large corpora all of the same genre and with the same entity classification scheme, such as ACE [40]. Recall, in particular, has been a persistent problem, as named entities often seem to have unusual surface forms, e.g. unusual character sequences for the given language (e.g. Szeged in an Englishlanguage document) or words that individually are typically not NEs, unless they are combined together (e.g. the White House).
Indeed, the move from ACE and MUC to broader kinds of corpora has presented existing NER systems and resources with a great deal of difficulty [37], which some researchers have tried to address through domain adaptation, specifically with entity recognition in mind [15; 62; 28; 12; 3]. However, more recent performance comparisons of NER methods over different corpora showed that older tools tend to simply fail to adapt, even when given a fair amount of indomain data and resources [51; 18]. Simultaneously, the value of NER in nonnewswire data [51; 34; 47; 53; 5] has rocketed: for example, social media now provides us with a sample of all human discourse, unmolested by editors, publishing guidelines and the like, and all in digital format -leading to, for example, whole new fields of research opening in computational social science [29; 46; 48].
The prevailing assumption has been that this lower NER performance is due to domain differences arising from using newswire (NW) as training data, as well as from the irregular, noisy nature of new media (e.g. [51]). Existing studies [18] further suggest that named entity diversity, discrepancy between named entities in the training set and the test set (entity drift over time in particular), and diverse context, are the likely reasons behind the significantly lower NER performance on social media corpora, as compared to newswire.
No prior studies, however, have investigated these hypotheses quantitatively.
For example, it is not yet established whether this performance drop is really due to a higher proportion of unseen NEs in the social media, or is it instead due to NEs being situated in different kinds of linguistic context. Accordingly, the contributions of this paper lie in investigating the following open research questions: RQ1 How does NERC performance differ for corpora between different NER approaches?
RQ2 How does NERC performance differ for corpora over different text types/genres?
RQ3 What is the impact of NE diversity on system performance?
RQ4 What is the relationship between Out-of-Vocabulary (OOV) features (unseen features), OOV entities (unseen NEs) and performance?
RQ5 How well do NERC methods perform out-of-domain and what impact do unseen NEs (i.e. those which appear in the test set, but not the training set) have on out-of-domain performance?
In particular, the paper carries out a comparative analyses of the performance of several different approaches to statistical NER over multiple text genres, with varying NE and lexical diversity. In line with prior analyses of NER performance [44; 18], we carry out corpus analysis and introduce briefly the NER methods used for experimentation. Unlike prior efforts, however, our main objectives are to uncover the impact of NE diversity and context diversity on performance (measured primarily by F1 score), and also to study the relationship between OOV NEs and features and F1. See Section 3 for details.
To ensure representativeness and comprehensiveness, our experimental findings are based on key benchmark NER corpora spanning multiple genres, time periods, and corpus annotation methodologies and guidelines. As detailed in Section 2.1, the corpora studied are OntoNotes [30], ACE [60], MUC 7 [11], the Ritter NER corpus [51], the MSM 2013 corpus [52], and the UMBC Twitter corpus [21]. To eliminate potential bias from the choice of statistical NER approach, experiments are carried out with three differently-principled NER approaches, namely Stanford NER [22], SENNA [14] and CRFSuite [43] (see Section 2.2 for details).

Datasets
Since the goal of this study is to compare NER performance on corpora from diverse domains and genres, seven benchmark NER corpora are included, spanning newswire, broadcast conversation, Web content, and social media (see Table 1 for details). These datasets were chosen such that they have been annotated with the same or very similar entity classes, in particular, names of people, locations, and organisations. Thus corpora including only domainspecific entities (e.g. biomedical corpora) were excluded. The choice of corpora was also motivated by their chronological age; we wanted to ensure a good temporal spread, in order to study possible effects of entity drift over time. 1 1 Entity drift is a cause of heightened diversity. It happens when the terms used to represent an NE category, such as "person", change over time. For example, "yeltsin" might be one lexicalisation of a person entity in the 1990s, whereas one may see "putin" in later texts. Both A note is required about terminology. This paper refers to text genre and also text domain. These are two dimensions by which a document or corpus can be described. Genre here accounts the general characteristics of the text, measurable with things like register, tone, reading ease, sentence length, vocabulary and so on. Domain describes the dominant subject matter of text, which might give specialised vocabulary or specific, unusal word senses. For example, "broadcast news" is a genre, describing the manner of use of language, whereas "financial text" or "popular culture" are domains, describing the topic. One notable exception to this terminology is social media, which tends to be a blend of myriad domains and genres, with huge variation in both these dimensions [31; 4]; for simplicity, we also refer to this as a genre here.

Corpora Used
In chronological order, the first corpus included here is MUC 7, which is the last of the MUC challenges [11]. This is an important corpus, since the Message Understanding Conference (MUC) was the first one to introduce the NER task in 1995 [27], with focus on recognising persons, locations and organisations in newswire text.
A subsequent evaluation campaign was the CoNLL 2003 NER shared task [58], which created gold standard data for newswire in Spanish, Dutch, English and German. The corpus of this evaluation effort is now one of the most popular gold standards for NER, with new NER approaches and methods often reporting performance on that.
Later evaluation campaigns began addressing NER for genres other than newswire, specifically ACE [60] and OntoNotes [30]. Both of those contain subcorpora in several genres, namely newswire, broadcast news, broadcast conversation, weblogs, and conversational telephone speech. ACE, in addition, contains a subcorpus with usenet newsgroups. Like CoNLL 2003, the OntoNotes corpus is also a popular benchmark dataset for NER. The languages covered are Enare lexical representations of the same kind of underlying entity, which has remained static over time. Entity drift is a form of concept drift [36] specific to NER. glish, Arabic and Chinese. A further difference between the ACE and OntoNotes corpora on one hand, and CoNLL and MUC on the other, is that they contain annotations not only for NER, but also for other tasks such as coreference resolution, relation and event extraction and word sense disambiguation. In this paper, however, we restrict ourselves purely to the English NER annotations, for consistency across datasets. The ACE corpus contains HEAD as well as EXTENT annotations for NE spans. For our experiments we use the EXTENT tags.
With the emergence of social media, studying NER performance on this genre gained momentum. So far, there have been no big evaluation efforts, such as ACE and OntoNotes, resulting in substantial amounts of gold standard data. Instead, benchmark corpora were created as part of smaller challenges or individual projects. The first such corpus is the UMBC corpus for Twitter NER [21], where researchers used crowdsourcing to obtain annotations for persons, locations and organisations. A further Twitter NER corpus was created by [51], which, in contrast to other corpora, contains more fine-grained classes defined by the Freebase schema [7]. Next, the Making Sense of Microposts initiative [52] (MSM) provides single annotated data for named entity recognition on Twitter for persons, locations, organisations and miscellaneous. MSM initiatives from 2014 onwards in addition feature a named entity linking task, but since we only focus on NER here, we use the 2013 corpus.
These corpora are diverse not only in terms of genres and time periods covered, but also in terms of NE classes and their definitions. In particular, the ACE and OntoNotes corpora try to model entity metonymy by introducing facilities and geo-political entities (GPEs). Since the rest of the benchmark datasets do not make this distinction, metonymous entities are mapped to a more common entity class (see below).
In order to ensure consistency across corpora, only Person (PER), Location (LOC) and Organisation (ORG) are used in our experiments, and other NE classes are mapped to O (no NE). For the Ritter corpus, the 10 entity classes are collapsed to three as in [51]. For the ACE and OntoNotes corpora, the  In order to compare corpus diversity across genres, we measure NE and token/type diversity (following e.g. [44]). Note that types are the unique tokens, so the ratio can be understood as ratio of total tokens to unique ones. Table 5 shows the ratios between the number of NEs and the number of unique NEs per corpus, while Table 6 reports the token/type ratios. The lower those ratios are, the more diverse a corpus is. While token/type ratios also include tokens which are NEs, they are a good measure of broader linguistic diversity.
Aside from these metrics, there are other factors which contribute to corpus diversity, including how big a corpus is and how well sampled it is, e.g. if a corpus is only about one story, it should not be surprising to see a high token/type ratio. Therefore, by experimenting on multiple corpora, from different genres and created through different methodologies, we aim to encompass these other aspects of corpus diversity.
Since the original NE and token/type ratios do not account for corpus size, Tables 6 and 5 present also the normalised ratios. For those, a number of tokens equivalent to those in the corpus, e.g. 7037 for UMBC (Table 6) or, respectively, a number of NEs equivalent to those in the corpus (506 for UMBC) are selected (Table 5).
An easy choice of sampling method would be to sample tokens and NEs randomly. However, this would not reflect the composition of corpora appropri-ately. Corpora consist of several documents, tweets or blog entries, which are likely to repeat the words or NEs since they are about one story. The difference between bigger and smaller corpora is then that bigger corpora consist of more of those documents, tweets, blog entries, interviews, etc. Therefore, when we downsample, we take the first n tokens for the token/type ratios or the first n NEs for the NEs/Unique NEs ratios.
Looking at the normalised diversity metrics, the lowest NE/Unique NE ratios The telephone conversations in ACE CTS are all about the same story, and newswire and broadcast news tend to contain longer stories (reducing variety in any fixed-size set) and are more regular due to editing.
The token/type ratios reflect similar trends ( conversational corpora, to which ACE BC also belong, tend to have many filler words, thus it is not surprising that they have a low tag density. There are only minor differences between the tag density and the normalised tag density, since corpus size as such does not impact tag density.

NER Models and Features
To avoid system-specific bias in our experiments, three widely-used supervised statistical approaches to NER are included: Stanford NER, 2 SENNA, 3 and CRFSuite. 4 These systems each have contrasting notable attributes.
Stanford NER [22] is the most popular of the three, deployed widely in both research and commerce. The system has been developed in terms of both generalising the underlying technology and also specific additions for certain lan-guages. The majority of openly-available additions to Stanford NER, in terms of models, gazetteers, 5 prefix/suffix handling and so on, have been created for newswire-style text. Named entity recognition and classification is modelled as a sequence labelling task with first-order conditional random fields (CRFs) [32].
SENNA [14] is a more recent system for named entity extraction and other NLP tasks. Using word representations and deep learning with deep convolutional neural networks, the general principle for SENNA is to avoid taskspecific engineering while also doing well on multiple benchmarks. The approach taken to fit these desiderata is to use representations induced from large unlabelled datasets, including LM2 (introduced in the paper itself) and Brown clusters [9; 16]. The outcome is a flexible system that is readily adaptable, given training data. Although the system is more flexible in general, it relies on learning language models from unlabelled data, which might take a long time to gather and retrain. For the setup in [14]  Finally, we use the classical NER approach from CRFsuite [43], which also uses first-order CRFs. This frames NER as a structured sequence prediction task, using features derived directly from the training text. Unlike the other systems, no external knowledge (e.g. gazetteers and unsupervised representations) are used. This provides a strong basic supervised system, and -unlike Stanford NER and SENNA -has not been tuned for any particular domain, giving potential to reveal more challenging domains without any intrinsic bias.
We use the feature extractors natively distributed with the NER frameworks.
For Stanford NER we use the feature set "chris2009" without distributional similarity, which has been tuned for the CoNLL 2003 data. This feature was tuned to handle OOV words through word shape, i.e. capitalisation of constituent characters. The goal is to reduce feature sparsity -the basic problem behind Our first research question is how NERC performance differs for corpora between approaches.
In order to answer this, Precision (P), Recall (R) and F1 metrics are reported on size-normalised corpora (Table 3) Table 9, where F1 of NERC methods is reported on the original training data.
Our analysis of CRFSuite here is that it is less tuned for NW corpora and might therefore have a more balanced performance across genres does not hold.
Results with CRFSuite for every corpus are worse than the results for that corpus with Stanford NER, which is also CRF-based.
To summarise, our findings are: • F1 is highest with SENNA, followed by Stanford NER and CRFSuite • SENNA outperforms other methods by a large margin (e.g. >= 13 points in F1) for CoNLL Test A, ACE CTS and OntoNotes TC • Our hypothesis that CRFSuite is less tuned for NW corpora and will therefore have a more balanced performance across genres does not hold, as results for CRFSuite for every corpus are worse than with Stanford NER

RQ2: NER performance in Different Genres
Our second research question is whether existing NER approaches generalise well over corpora in different genres. To do this we study again Precision (P), Recall (R) and F1 metrics on size-normalised corpora (Table 3), on original corpora (Tables 8 and 9), and we further test performance per genre in a separate table (Table 4).
F1 scores over size-normalised corpora vary widely (Table 3) OntoNotes NW is more than twice the size of CoNLL in terms of NEs (see Table 5), NERC performance is much higher on CoNLL. NERC performance with respect to training corpus size is represented in Figure 1. The latter figure confirms that although there is some correlation between corpus size and F1, the variance between results on comparably sized corpora is big. This strengthens our argument that there is a need for experimental studies, such as those reported below, to find out what, apart from corpus size, impacts NERC performance.
Another set of results presented in Table 8 are those of the simple NERC memorisation baseline. It can be observed that corpora with a low F1 for NERC methods, such as UMBC and ACE UN, also have a low memorisation performance. Memorisation is discussed in more depth in Section 3.5.
When NERC results are compared to the corpus diversity statistics, i.e.
NE/Unique NE ratios (Table 5), token/type ratios (Table 6), and tag density (Table 7), the strongest predictor for F1 is tag density, as can be evidenced by the R correlation values between the ratios and F1 scores with the Stanford NER system, shown in the respective tables.
There is a positive correlation between high F1 and high tag density (R of However, tag density is also not an absolute predictor for NERC performance. While NW corpora have both high NERC performance and high tag density, this high density is not necessarily an indicator of high performance.
For example, systems might not find high tag density corpora of other genres necessarily so easy.
One factor that can explain the difference in genre performance between e.g.
newswire and social media is entity drift -the change in observed entity terms over time. In this case, it is evident from the differing surface forms and contexts for a given entity class. For example, the concept of "location" that NER systems try to learn might be frequently represented in English newswire from 1991 with terms like Iraq or Kuwait, but more with Atlanta, Bosnia and Kabul in the same language and genre from 1996. Informally, drift on Twitter is often characterised as both high-frequency and high-magnitude; that is, the changes are both rapid and correspond to a large amount of surface form occurrences (e.g. [24; 17]).
We examined the impact of drift in newswire and Twitter corpora, taking datasets based in different timeframes. The goal is to gauge how much diversity is due to new entities appearing over time. To do this, we used just the surface lexicalisations of entities as the entity representation. The overlap of surface forms was measured across different corpora of the same genre and language.
We used an additional corpus based on recent data -that from the W-NUT 2015 . This is measured in terms of occurrences, rather than distinct surface forms, so that the magnitude of the drift is shown instead of having skew in results from the the noisy long tail. Results are given in Table 10 for newswire and Table 11 for Twitter corpora.
It is evident that the within-class commonalities in surface forms are much higher in newswire than in Twitter. That is to say, observations of entity texts in one newswire corpus are more helpful in labelling other newswire corpora, than if the same technique is used to label other twitter corpora.
This indicates that drift is lower in newswire than in tweets. Certainly, the proportion of entity mentions in most recent corpora (the rightmost-columns) are consistently low compared to entity forms available in earlier data. These reflect the raised OOV and drift rates found in previous work [24 ; 19]. Another explanation is that there is higher noise in variation, and that the drift is not longitudinal, but rather general. This is partially addressed by RQ3, which we will address next, in Section 3.3.
To summarise, our findings are: • Overall, F1 scores vary widely across corpora.
• Trends can be marked in some genres. On average, newswire corpora and OntoNotes MZ are the easiest corpora and ACE UN, WEB and TWI are the hardest corpora for NER methods to reach good performance on.
• Normalising corpora by size results in more noisy data such as TWI and WEB data achieving similar results to NW corpora.
• Increasing the amount of available in-domain training data will likely result in improved NERC performance.
• There is a strong positive correlation between high F1 and high tag density, a weak positive correlation for NE/unique ratios and no clear correlation between token/type ratios and F1 • Temporal NE drift is lower in newswire than in tweets The next section will take a closer look at the impact of seen and unseen NEs on NER performance.

RQ3: Impact of NE Diversity
Unseen NEs are those with surface forms present only in the test, but not training data, whereas seen NEs are those also encountered in the training data.
As discussed previously, the ratio between those two measures is an indicator of corpus NE diversity. smaller corpora such as MUC and UMBC tend to contain a larger proportion of unseen NEs than bigger corpora such as ACE NW. In addition, similar to the token/type ratios listed in Table 6, we observe that TWI and WEB corpora have a higher proportion of unseen entities.
As can be seen from To summarise, our findings are: • F1 on unseen NEs is significantly lower than F1 on seen NEs for all three NERC approaches, which is mostly due to recall on unseen NEs being lower than that on seen NEs.
• Performance on seen NEs is significantly and consistently higher than that of unseen NEs in different corpora, with the lower scores mostly attributable to lower recall.
• However, there are still significant differences at labelling seen NEs in different corpora, which means that if NEs are seen or unseen does not account for all of the difference of F1 between corpora of different genres.   This entity could reasonably have "shape:Xxxxxxx" and "last-letter:e" as part of its feature representation. If the training data contains entities "Kenneth" and "Simone", each of this will have generated these two features respectively.
Thus, these example features will not be unseen features in this case, despite coming from an unseen entity. Conversely, continuing this example, if the training data contains no feature "first-letter:X" -which applies to the unseen entity in question -then this w ill be an unseen feature.
We therefore measure the proportion of unseen features per unseen and seen proportion of different corpora. An analysis of this with Stanford NER is shown in Figure 2. Each data point represents a corpus. The blue squares are data points for seen NEs and the red circles are data points for unseen NEs. For all approaches the proportion of observed features for seen NEs is bigger than the proportion of observed features for unseen NEs, as it should be.
However, within the seen and unseen testing instances, there is no clear trend indicating whether having more observed features overall increases F1 performance. One trend that is observable is that the smaller the token/type ratio is (Table 6), the bigger the variance between the smallest and biggest n for each corpus, or, in other words, the smaller the token/type ratio is, the more diverse the features.
To summarise, our findings are: • Seen NEs have more unseen features and systems score a lower F1 on them.
• Outliers are due to low/high ambiguity of seen NEs.
• The proportion of observed features for seen NEs is bigger than the proportion of observed features for unseen NEs • Within the seen and unseen testing instances, there is no clear trend indicating whether having more observed features overall increases F1 performance.
• The smaller the token/type ratio is, the more diverse the features. showing difficulty in picking up novel entities in novel settings.

RQ5: Out-Of-Domain NER Performance and Memorisation
In addition, there are differences in annotation guidelines between the two datasets. If the CoNLL annotation guidelines were more inclusive than the Ontonotes ones, then even a memorisation evaluation over the same dataset would yield this result. This is, in fact, the case: OntoNotes divides entities into more classes, not all of which can be readily mapped to PER/LOC/ORG.
For example, OntoNotes includes PRODUCT, EVENT, and WORK OF ART classes, which are not represented in the CoNLL data. It also includes the NORP class, which blends nationalities, religious and political groups. This has some overlap with ORG, but also includes terms such as "muslims" and "Danes", which are too broad for the ACE-related definition of ORGANIZATION. Full details can be found in the OntoNotes 5.0 release notes 8 and the (brief) CoNLL 2003 annotation categories. 9 Notice how the CoNLL guidelines are much more terse, being generally non-prose, but also manage to cram in fairly comprehensive lists of sub-kinds of entities in each case. This is likely to make the CoNLL classes include a diverse range of entities, with the many suggestions acting as generative material for the annotator, and therefore providing a broader range of annotations from which to generalise from -i.e., slightly easier to tag.
The lowest F1 of 0 is "achieved" on ACE BN. An examination of that corpus reveals the NEs contained in that corpus are all lower case, whereas those in OntoNotes NW have initial capital letters.
Results on unseen NEs for the out-of-domain setting are in in the context of named entity recognition [13]. In each of the exceptions except MSM, the difference is relatively small. We note that the MSM test corpus is one of the smallest datasets used in the evaluation, also based on a noisier genre than most others, and so regard this discrepancy as an outlier. give any useful performance in this natural, unconstrained kind of text.
In fact, it is interesting to see that the memorisation baseline is so effective with many genres, including broadcast news, weblog and newswire. This indicates that there is low variation in the topics discussed by these sources -only a few named entities are mentioned by each. When named entities are seen as micro-topics, each indicating a grounded and small topic of interest, this reflects the nature of news having low topic variation, focusing on a few specific issuese.g., location referred to tend to be big; persons tend to be politically or financially significant; and organisations rich or governmental [8]. In contrast, social media users also discuss local locations like restaurants, organisations such as music band and sports clubs, and are content to discuss people that are not necessarily mentioned in Wikipedia. The low overlap and memorisation scores on tweets, when taking entity lexica based on newswire, are therefore symptomatic of the lack of variation in newswire text, which has a limited authorship demographic [20] and often has to comply to editorial guidelines.
The other genre that was particularly difficult for the systems was ACE Usenet. This is a form of user-generated content, not intended for publication but rather discussion among communities. In this sense, it is social media, and so it is not surprising that system performance on ACE UN resembles performance on social media more than other genres.
Crucially, the computationally-cheap memorisation method actually acts as a reasonable predictor of the performance of other methods. This suggests that high entity diversity predicts difficulty for current NER systems. As we know that social media tends to have high entity diversity -certainly higher than other genres examined -this offers an explanation for why NER systems perform so poorly when taken outside the relatively conservative newswire domain.
Indeed, if memorisation offers a consistent prediction of performance, then it is reasonable to say that memorisation and memorisation-like behaviour accounts for a large proportion of NER system performance. To summarise, our findings are: • What time period an out of domain corpus is collected in plays an important role in NER performance.
• The context or other information represented by features shifts significantly more in Twitter than across different samples of newswire.
• The generalisations that can be drawn from newswire by modern NER systems are not sufficient to give any useful performance in this varied kind of text.
• Memorisation consistently predicts NER performance, both inside and outside genres or domains.

Conclusion
This paper investigated the ability of modern NER systems to generalise effectively over a variety of genres. Firstly, by analysing different corpora, we demonstrated that datasets differ widely in many regards: in terms of size; balance of entity classes; proportion of NEs; and how often NEs and tokens are repeated. The most balanced corpus in terms of NE classes is the CoNLL corpus, which, incidentally, is also the most widely used NERC corpus, both for method tuning of off-the-shelf NERC systems (e.g. Stanford NER, SENNA), as well as for comparative evaluation. Corpora, traditionally viewed as noisy, i.e.
the Twitter and Web corpora, were found to have a low repetition of NEs and tokens. More surprisingly, however, so does the CoNLL corpus, which indicates that it is well balanced in terms of stories. Newswire corpora have a large proportion of NEs as percentage of all tokens, which indicates high information density. Web, Twitter and telephone conversation corpora, on the other hand, have low information density.
Our second set of findings relates to the NERC approaches studied. Overall, SENNA achieves consistently the highest performance across most corpora, and thus has the best approach to generalising from training to testing data. This can mostly be attributed to SENNA's use of word embeddings, trained with deep convolutional neural nets. The default parameters of SENNA achieve a balanced precision and recall, while for Stanford NER and CRFSuite, precision is almost twice as high as recall.
Our experiments also confirmed the correlation between NERC performance and training corpus size, although size alone is not an absolute predictor. In particular, the biggest NE-annotated corpus amongst those studied is OntoNotes NW -almost twice the size of CoNLL in terms of number of NEs. Nevertheless, the average F1 for CoNLL is the highest of all corpora and, in particular, SENNA has 11 points higher F1 on CoNLL than on OntoNotes NW.
Studying NERC on size-normalised corpora, it becomes clear that there is also a big difference in performance on corpora from the same genre. When normalising training data by size, diverse corpora, such as Web and social media, still yield lower F1 than newswire corpora. This indicates that annotating more training examples for diverse genres would likely lead to a dramatic increase in F1.
What is found to be a good predictor of F1 is a memorisation baseline, which picks the most frequent NE label for each token sequence in the test corpus as observed in the training corpus. This supported our hypothesis that entity diversity plays an important role, being negatively correlated with F1. Studying proportions of unseen entity surface forms, experiments showed corpora with a large proportion of unseen NEs tend to yield lower F1, due to much lower performance on unseen than seen NEs (about 17 points lower averaged over all NERC methods and corpora). This finally explains why the performance is highest for the benchmark CoNLL newswire corpus -it contains the lowest proportion of unseen NEs. It also explains the difference in performance between NERC on other corpora. Out of all the possible indicators for high NER F1 studied, this is found to be the most reliable one. This directly supports our hypothesis that generalising for unseen named entities is both difficult and important.
Also studied is the proportion of unseen features per unseen and seen NE portions of different corpora. However, this is found to not be very helpful. The proportion of seen features is higher for seen NEs, as it should be. However, within the seen and unseen NE splits, there is no clear trend indicating if having more seen features helps.
We also showed that hand-annotating more training examples is a straightforward and reliable way of improving NERC performance. However, this is costly, which is why it can be useful to study if using different, larger corpora for training might be helpful. Indeed, substituting in-domain training corpora with other training corpora for the same genre created at the same time improves performance, and studying how such corpora can be combined with transfer learning or domain adaptation strategies might improve performance even further. However, for most corpora, there is a significant drop in performance for out-of-domain training. What is again found to be reliable is to check the mem-orisation baseline: if results for the out-of-domain memorisation baseline are higher than for in-domain memorisation, than using the out-of-domain corpus for training is likely to be helpful.
Across a broad range of corpora and genres, characterised in different ways, we have examined how named entities are embedded and presented. While there is great variation in the range and class of entities found, it is consistent that the more varied texts are harder to do named entity recognition in. This connection with variation occurs to such an extent that, in fact, performance when memorising lexical forms stably predicts system accuracy. The result of this is that systems are not sufficiently effective at generalising beyond the entity surface forms and contexts found in training data. To close this gap and advance NER systems, and cope with the modern reality of streamed NER, as opposed to the prior generation of batch-learning based systems with static evaluation sets being used as research benchmarks, future work needs to address named entity generalisation and out-of-vocabulary lexical forms.