Temporal expression extraction with extensive feature type selection and a posteriori label adjustment

The automatic extraction of temporal information from written texts is pivotal for many Natural Language Processing applications such as question answering, text summarisation and information retrieval. It allows to ﬁ lter information and infer temporal ﬂ ows of events. This paper presents ManTIME, a general domain temporal expression identi ﬁ cation and normalisation system, and systematically explores the impact of different features and training corpora on the performance. The identi ﬁ cation phase combines the use of conditional random ﬁ elds along with a post-processing pipeline, whereas the normalisation phase is carried out using NorMA, an open-source rule-based temporal normaliser. We investigate the performance variation with respect to different feature types. Speci ﬁ cally, we showthattheuseofWordNet-basedfeaturesintheidenti ﬁ cationtasknegativelyaffectstheover-all performance,andthat thereisno statisticallysigni ﬁ cantdifferenceintheresults based ongaz-etteers, shallow parsing and propositional noun phrases labels on top of the morpho-lexical features. We also show that the use of silver data (alone or in addition to the human-annotated ones) does not improve the performance. We evaluate six combinations of training data and post-processing pipeline with respect to the TempEval-3 benchmark test set. The best run achieved 0.95 (precision), 0.85 (recall) and 0.90 (F β =1 ) in the identi ﬁ cation phase. Normalisation accuracies are 0.86 (for type attribute) and 0.77 (for value attribute). The proposed approach ranked 3rd in the TempEval-3 challenge (task A) as the best performing machine learning-based system among 21 participants. ©


Introduction
A temporal expression, also called timex, refers to any natural language phrase denoting a temporal entity such as an interval or a time point [1].For example, in a sentence like "The Prime Minister said yesterday that the reform promoted three months ago has been very successful.", the phrases "yesterday" and "three months ago" are temporal expressions.
Timexes elicit a natural binding between the language and the time domain, making it possible to represent such language expressions as a time point, interval or set.
Temporal expressions can be of three different types [2]: fully-qualified, deictic and anaphoric.A timex is fully-qualified when it unambiguously refers to a precise interval or point in the time domain.For example, the following expressions fall in this category: "21st July 1985", "31/04/2011 at 12 o'clock" or "Martin Luther King's day 2013".In the case of deictic expressions, inferring the binding with the time domain necessarily requires to take into account the time of utterance (when the document was written or when the speech was given, often referred to as document creation time (DCT)).Typical deictic temporal expressions include "today", "yesterday", "last Sunday" and "two months ago".Finally, anaphoric expressions are a particular case of deictic expressions for which the utterance time varies according to the temporal expressions previously mentioned in the text.Examples of this category are "that year", "the same week" or "the previous month".
Research in temporal expression extraction aims at investigating novel and effective approaches to extraction of temporal information from texts.Several scientific challenges [3][4][5] have been organized over the years, providing human-annotated data as gold standard to evaluate performance of the state-of-the-art systems.
Early attempts of automatically annotating temporal expressions in texts started in late 1990s [6], and aroused an increasing interest with the proposal of a temporal annotation scheme [7], mainly aiming at enhancing performance of question answering systems.Following the work of Ahn et al. [2], the temporal expression extraction task is now conventionally divided into two main steps: identification and normalisation.In the former step, the effort is concentrated on how to detect the right boundary of temporal expressions in the text.In the normalisation step, the aim is to interpret and represent the temporal meaning of each pre-identified expression often using the TimeML format [8].It provides a specification for representing temporal expressions, events and temporal relations (see an example in Fig. 1).The normalisation task is usually focussed on predicting the two main temporal expressions attributes: type of the temporal expression (e.g.SET, DURATION, DATE or TIME) and its full value according to the ISO-8601 format [9].
In this article we introduce ManTIME, a temporal expressions extraction system, where the identification uses machine learning on an extensive set of features and an a posteriori label adjustment pipeline, which further improves the performance.The normalisation phase is carried out by using a set of rules.We evaluated ManTIME on the latest TempEval-3 official benchmark data, achieving 0.95 precision, 0.85 recall and 0.90 F β =1 in the identification phase with normalisation accuracies of 0.86 (for type attribute) and 0.77 (for value attribute).
ManTIME uses 93 features of 4 types, which have been engineered following a systematic review of the scientific literature in temporal information extraction.We explore what categories of feature provide the best performance.
We also investigate the role that silver training data have on the performance.Such resources are large automatically generated datasets, which have been created by merging the annotations provided by three state-of-the-art temporal extraction systems [10].We consider different training scenarios: silver data alone or in combination with gold data, using or not using the a posteriori label adjustment pipeline.

Related work
The identification step in temporal expression extraction is usually tackled by using machine learning-based approaches.A variety of features has been used such as morphological and dictionary-based.Ahn et al. [2] used morphological features with support vector machines (SVM) [11] and conditional random fields (CRFs) [12] showing a notable improvement in performance [13].Llorens et al. [14,15] successively added semantic features using a similar architecture.Poveda et al. [16] introduced a sophisticated semisupervised approach which particularly helped to improve the recall, while Mani et al. [7] used rules learned by a decision tree classifier.Ling and Weld [17] tried Markov Logic Network in order to extract temporal relations.Recently, the results from the last temporal information extraction challenge, TempEval-3 [5], show the identification performance ranges from 0.81 to 0.90 in terms of lenient F β=1 measure (from 0.70 to 0.83 for strict matching).
Fig. 1.TimeML annotation of the sentence "The Prime Minister said yesterday that the reform promoted three months ago has been very successful." in the TimeML format.The annotation contains: (I) two temporal expressions ("yesterday" and "three months ago"), (II) two events ("said" and "promoted"), and (III) three temporal relations ("said" → during "yesterday", "promoted" → during "three months ago" and "said" → after "promoted").
The second step in temporal expression extraction is the normalisation, which is typically accomplished using rule-based approaches.Grover et al. [18], for example, used regular expression-based rules on top of a pre-existing identification system.UzZaman and Allen [19] developed TRIOS, an open-source rule-based normaliser, focussing on type and value attributes prediction.Llorens et al. [20] extended this architecture making it community-driven: Internet users are allowed to candidate new rules to be integrated in a central rule repository.Angeli et al. [21] proposed a method to learn interpreting temporal representations through the use of a compositional grammar for temporal expressions.To the best of our knowledge, their system is the only piece of research that diverges from rule-based approaches, although the performance is noticeably lower.Recent TempEval-3 normalisation accuracies ranged from 0.68 to 0.86 (for value) and 0.86 to 0.94 (for type attribute) [5].
There are also monolithic temporal expression extraction systems, in which there is no separation between identification and normalisation.Saquete et al. [22], for example, produced a seminal work proposing a multi-lingual dictionary-based architecture for event ordering, which successively extended into a non-monolithic system [23].More recently, NavyTime [24] and HeidelTime [25] proposed a set of hand-crafted rules combined with an ad-hoc rule selection algorithm, whereas SUTime [26] used a deterministic rule-based system built on top of the Stanford Core NLP pipeline.
Recently, temporal information extraction aroused increasing interest in the medical domain [27][28][29][30], where temporal information can be used to automatically extract patient clinical histories or temporal cause-effect relations with respect to particular treatments.In the medical domain, the normalisation phase proved to be harder than in the general domain.More specifically, the results from i2b2 2012 [28] show that the identification accuracies range from 0.84 to 0.90, whereas normalisation accuracies rage from 0.54 to 0.73 (for value) and 0.72 to 0.89 for (for type attribute).
While a number of architectures, features and datasets are used for temporal expression extraction, we are not aware of any systematic studies on the types of features that are beneficial for temporal expression extraction, as the effect of different types of training data.
Fig. 2. ManTIME architecture.Documents are pre-processed using TreeTagger [32], which provides tokens, lemmas and POS-tags.The remaining features are extracted in order to build the token-feature matrix.The machine-learning based labeller predicts a label (B, I or O) for each token and the identification post-processing pipeline is applied.The annotations are finally exported in the TimeML format and for each annotated expression the normalisation component (NorMA) is run.

System architecture
The approach proposed in this paper adopts the dichotomy between identification and normalisation [2], and therefore it consists of two components.The general system architecture is depicted in Fig. 2. Each step of the architecture will be illustrated in detail in the next sections.For training and testing we mainly used the TempEval-3 datasets as explained in Section 4.1.

Temporal expression identification
The identification phase concerns the detection of temporal expressions in the text and the effort is concentrated on predicting their correct boundary or span.
We tackled the identification problem as a sequencing labelling task leading to the choice of CRFs.We trained the system using both human-annotated data and silver data (see Section 4.1) in order to investigate the potential contribution of different types of annotated data.
Although the silver data has the advantage of being far larger than the human-annotated data (666 K words vs. 95 K, see Table 6 in Section 4.1), our hypothesis is that manually-annotated corpora are more accurate (i.e. less noisy), and for this reason are still important in the training phase.Because of this trade-off, we developed a post-processing pipeline on top of the CRFs sequence labeller to boost the identification performance, similarly to the approach proposed by Adafre and de Rijke [31].
Below we describe the CRF-based labeller, the model selection and the post-processing pipeline components in detail.

Feature engineering
Temporal expression identification can be seen as a Named Entity Recognition (NER) problem.From this perspective, it is naturally approached as a sequence labelling task, for which we decided to use the linear chain conditional random fields (LC-CRFs).
LC-CRFs are a machine learning technique that defines a conditional probability distribution taking the following form: where Z(x) is the normalisation factor, K is the number of features, x represents the observation sequence, y represents the label, and f k and λ k represent the feature function and its weight respectively.We used the BIO format (each token is labelled as being at the (B)eginning, (I)nside or (O)utside of a temporal expression entity) in all the experiments presented here.The factor graph has been generated using the following topology: where w 0 represents the current token, w +k the following and w −k the previous tokens.
In addition to the labelling (or tagging) scheme (BI, BIO, BIOE or BIOEU1 ) and the topology of the factor graph, the effectiveness of using CRFs mainly depends on the quality of features.
ManTIME relies on 93 features, which have been collected as a result of a systematic review of the literature in temporal information extraction we conducted with the aim of explore feature contributions.These features belong to the following four disjoint categories.
3.1.1.3.Gazetteers.The matching of sub-expressions with gazetteer entries is also represented in the BIO format because gazetteers include multi-token entries.We used the following gazetteers: male and female names 4 along with world festivity names. 5We also used U.S. cities, nationalities and country names from the NLTK6 corpora.A total of seven gazetteer-based features have been engineered.

WordNet.
For each token we use the number of senses associated to the word, the first two most common senses, the first four lemmas, the first four entailments for verbs, antonyms, hypernyms and hyponyms.Each of them is defined as a separate feature.A total of 23 WordNet-based features have been engineered.We note that this group of features constitutes an extension of those previously used in the field [33,34].In particular, we note that temporal signals (which typically indicate the presence of temporal expressions nearby in text, e.g.'She slept for just [4 hours] timex .')are known in linguistics to be characterised by having antonyms, whereas the rest of temporal expression words typically do not [35].We hypothesized that such piece of information should have been integrated to help the machine learning model to highlight temporal expressions.
All the features used in the experiments are presented in Tables 1 and 2 with details.
All the experiments have been carried out using CRF++ 7 with parameters C = 1, η = 0.0001 and ℓ 2 -regularization function.

Model selection
The 93 features mentioned above have been combined in four different models combining the following types of features: We performed an extensive evaluation by repeating the experiments a number of times and assessing whether there is any statistical difference among the models.This allowed us to select the model that provides the highest F β = 1 score among the four proposed.
All the data provided by TempEval-3 (see Table 6), except for the TempEval-3 official benchmark test set, have been merged, shuffled at sentence level (seed = 490) and split into two sets: 80% as a training set and 20% as a test set.The training set has been shuffled 5 times, and for each of these, the 10-fold cross validation technique has been applied.
Table 3 shows the post-hoc ANOVA analysis and Fig. 3 shows the box-plot comparison of the models (F β = 1 measure).The analysis is statistically significant (p = 0.0054 with ANOVA test) and provides two important outcomes: 1.There is no statistically significant difference among the first three models (see Table 3), despite the presence of apparently important and computationally expensive information such as chunks, prepositional noun phrases and gazetteers.2. The set of WordNet-based features negatively affects the overall classification performance, as already noticed in the literature [38].This is mainly due to the sparseness of the labels: many tokens do not have any associated WordNet sense.
By virtue of this analysis, we opted for the smallest feature set, Model 1, which has two positive consequences: to help mitigate overfitting due to the smaller feature space, and reducing the computational cost of the system.
In order to get an educated estimation of the Precision/Recall performance of the selected model in the wild, we then trained it on the entire training set and tested it against the test set.The results for all the models are shown in Table 4. Model 1 showed a slightly better F β = 1 score, which corroborated our choice.
The models used for the final evaluation of the TempEval-3 benchmark data have been trained using all the data, except for the ones in the benchmark data set.

A posteriori label adjustment pipeline
Although the CRF-based labeller already provided reasonable performance on the training data, equally balanced in terms of precision and recall, we focussed on boosting the baseline performance through a post-processing pipeline composed of three modules, which aimed to adjust the CRF-predicted labels.
3.1.3.1.Probabilistic correction module.We noticed that the CRF-based labeller tends to assign labels with high confidence even for ambiguous tokens.We therefore aimed to design a module that would make predictions less strict and in some cases have the effect of changing the most likely label (mainly expected to bring an improvement in terms of recall).
For each token, we thus average the conditional probabilities from the trained CRF model with the prior probabilities extracted from the gold data only (see Section 4.1 for details about data).
For each token w in the gold data, we extracted the conditional probability P(L|w), where L = {`B ', `I ', `O '}.The probabilities have been estimated using frequencies.The list of tokens taken into account has been restricted to those appearing within temporal expressions at least twice.This process allowed us to obtain the prior label probabilities.For example, P(B|Monday) = 0.97, P(I|Monday) = 0.03 and P(O|the) = 0.95.
From the CRF-based labeller we extracted, for each token, the internal conditional probability of each label.The two probabilities (from the gold data and the CRF) were then averaged for every label of each token.
An example is given in Table 5.
3.1.3.2.Threshold-based label switcher.Some tokens have a high a priori probability of being part of a temporal expression (e.g., "Monday" or "today").However, some of these tokens might have been erroneously labelled as 'O' by the CRF labeller.This module changes the predicted label to the most likely one based on the a priori probabilities from the gold data only.This is triggered only when the prior probability of a certain label in the gold data is greater than a given threshold.Therefore, the application of this module forces the prior probabilities extracted from the human-annotated data.Through repeated empirical experiments on a small sub-set of the training data, we found an optimal threshold value (0.87).

Table 1
List of features used in the experiments (first part).Type column indicates whether a feature belongs to the (M)orpho-lexical, (S)yntactic, (G)azetteer or (W)ordNet category.Regular expression-based features, denoted with an *, are presented with a list of matching expressions whereas for the rest of them the notation (tokens → values) has been used.Feature #15, #16 and #18 are computed using the Python 2.x built-in operators.Feature #23 uses the Lancaster Stemming Algorithm [36] where feature #24 uses the Porter Stemming Algorithm [37] M* Ordinal number in digits "15th", "100th", "1st", … 47 M* Ordinal trigger "st", "rd", "th", "nd" 3.1.3.3.BIO fixer.Although CRFs are designed to handle sequences, they assign labels token-by-token.This leads to possibly inconsistent sequences of labels.For the BIO labelling scheme, the only possible source of inconsistency is the sequence O-I, as there should be a 'B' in between them.We found that, among the possible corrections (B-I or I-B), B-I applies to most cases (i.e. the first token has been most often incorrectly annotated).For example, "Three/O days/I ago/I ./O"should be converted into "Three/B days/I ago/I./O".We also merged adjacent expressions such as B-B or I-B, because different temporal expressions are always divided at least by a symbol or a punctuation character (e.g."Wednesday/B morning/B" becomes "Wednesday/B morning/I", "21st/B November/I 1990/B" becomes "21st/B November/I 1990/I").
We performed an extensive evaluation of the possible label adjustment pipeline configurations, which has been carried out with 5×10-fold cross validation (as described in Section 3.1.2).The results are presented in Fig. 4. The first configuration corresponds to the CRFs only.All the differences among the settings are statistically significant (measured with ANOVA test).Using the pipeline always leads to an improvement in performance, with the BIO fixer component as the major contributor.The optimal pipeline configuration provides a 2.76% averaged statistically significant increment (with respect to the strict F β=1 scores of the CRF model) and is composed of: 1. Probabilistic correction module 2. BIO fixer 3. Threshold-based label switcher 4. BIO fixer

Normalisation
The normalisation phase aims to interpret and represent the temporal meaning of each pre-identified expression using the TimeML format [8].Two attributes are particularly important in this respect: type and value.The first one can be either 'DATE', 'TIME', 'DURATION' or 'SET'.The second one expresses the ISO-8601 representation of each expression.
The proposed temporal expression normalisation approach is based on rules and it extends TRIOS [19].TRIOS' input is the temporal expression and the utterance time (document creation time) and its rules have the form of dictionary-driven regular expressions in a switch architecture: the activation of one of them excludes the activation of the remaining ones.
Our normalisation system, called NorMA (depicted in Fig. 5), is composed of three modules: pre-processing rules, extension rules and post-manipulation rules.

Pre-processing rules
This set of rules has been introduced to turn recognised temporal expressions into a more suitable form for normalisation.Some examples from this rule set are: determiners removal (e.g., "the day after" → 'day after'), misspelling correction (e.g., "wendsday" → 'Wednesday'), and lower-case and trimming transformation (e.g., "every Friday mornin."→ 'every Friday morning').

Extension rules
The extension rules are new rules that cover temporal expressions not handled by TRIOS.Such rules are matched before the TRIOS' ones.Examples of those are duration expressions (e.g."3-year", "3-day"), frequency expressions (e.g."every half an hour", "every two days") or period expressions (e.g."'90s", "eighties").

Post-manipulation rules
The post-manipulation rules are mainly used to validate the syntax of the predicted value attribute and to normalise frozen expressions transformed by the previous groups of rules.For example, some of the rules are used to normalise expressions of festivity dates such as "Queen's birthday" or "Saint Patrick's day".
Overall, NorMA extends TRIOS with 40 new rules: 16 pre-processing rules, 20 extension rules, and 4 post-manipulation rules.The system has already been proven to provide statistically better performance with respect to TRIOS and consequently state-of-the-art performance against the TempEval-2 benchmark test set [39].

Experiments and results
In this section we present the experiments performed.In particular, we describe the data, the evaluation metrics and the results.Also the findings of the error analysis are presented in order to investigate the system annotation errors.

Data
The human-annotated data come from two existing corpora: AQUAINT and TimeBank. 8Both data sets have been revised by the TempEval-3 organizers in order to fix erroneous annotations.These two corpora have been used for training purposes as opposed to a human-annotated corpus, TempEval-3 benchmark, which has been used as a test set.
In addition, for training we used the TempEval-3 silver corpus, which has been made by merging, through an ad-hoc algorithm [10], three state-of-the-art temporal extraction systems: TIPSem, TipSEM-B [15] and TRIOS [19].This corpus is much larger than the gold ones, although its annotations are not as reliable.Table 6 summarises the main characteristics of each corpus.
Every document has been annotated using the TimeML standard and released with its document creation time (DCT).Each annotated temporal expression carries its type and value attributes.

Evaluation metrics
The identification phase (prediction of the temporal expression boundaries) has been evaluated using Precision, Recall and F β = 1 measure, according to the following formulae: where TP, FP and FN stand for the number of true positive, false positive and false negative examples respectively.Precision, Recall and F β = 1 measures are computed according to two different definitions of matching: strict and lenient, following TempEval-3 [5].The strict matching considers a predicted boundary correct only if it strictly matches the gold boundary, whereas the lenient matching considers a predicted boundary correct as long as it overlaps with the gold one.
The performance of the normalisation task is measured on two temporal attributes: type and value (ISO-8601 representation).What is measured here is the prediction accuracy of the correctly identified temporal expressions only, according to the following formulae: 8 Both corpora are available at http://www.cs.york.ac.uk/semeval-2013/task1/index.php?id=data.
The type of each temporal expression can be inferred from the value attribute.Consequently, the overall score for temporal information extraction is computed using the following formula (also used at TempEval-3): where Fβ¼1 denotes the lenient matching measure [5].

Results
Six different experimental settings have been evaluated as combinations of different training sets (gold, silver, gold&silver) with or without the application of label adjustment pipeline.The results are shown in Table 7 where the overall score is computed by Formula (8).We point out that the setting #4 was submitted as an official submission for the TempEval-3 challenge (Task A identification and normalisation of temporal expressions) and has been ranked 5th out of 21 submitted runs, as the best performing machine learning-based system.
All the settings showed high precision (strict ranging from 0.76 to 0.82, lenient ranging from 0.87 to 0.92) and reasonable coverage (strict ranging from 0.63 to 0.70, lenient ranging from 0.79 to 0.85) in the identification stage.This indicates the fact that the system has partially generalised from the training data.
The training of the system by using the gold data only combined with the use of the label adjustment pipeline proved to be the best overall result, although not leading to the highest normalisation accuracy.Somewhat surprisingly, the use of the silver data did not improve the performance, neither when used alone nor in addition to the gold data (regardless of the label adjustment usage).
The a posteriori label adjustment pipeline showed the highest precision when applied to the silver data only.In this case, the pipeline acted as a reinforcement of the human-annotated data, helping improving the boundaries.As expected, the postprocessing pipeline boosted the performance of both precision and recall.Still, we note the best improvement with the human-annotated data.
We also investigated the contribution of each component in the label adjustment pipeline with respect to the test set.Fig. 6 shows the results.The probabilistic correction module negatively affects the performance (making less strict predictions) although its output is then corrected by the use of the BIO fixer module.The threshold-based label switcher introduces an equal number of false and true positives.False positives are always 'I' labels, which are then propagated by the next component in the pipeline, BIO-fixer, by adding a 'B' label to the previous tokens.This explains the slight downward trend visible in the last step of Fig. 6.The limited size of the TempEval-3 benchmark test set, on which this analysis is based, might not be enough to explain this behaviour.Therefore, the effect should be taken with caution.
The normalisation task proved to be challenging.Among the correctly typed temporal expressions, there was still about 10% for which an incorrect value is provided (value ranges from 0.76 to 0.78).

Error analysis
The analysis of the predicted annotations against the gold ones allows us to pinpoint errors both in identification and normalisation phase.We analysed the errors in the experimental setting #4.

Identification errors
The system correctly identified the majority of temporal expressions annotated in the test set, and incorrect annotations are mainly due to specific limitations of the system in addition to some issues in the gold standard data.
Examples of false positives (incorrectly recognised expressions) due to the CRF model are "of flu" and "and".Those expressions have been wrongly classified and the post-processing pipeline has not been able to discard them from the predictions.This is due to a very high confidence from the CRF module.
We noticed a significant amount of partial errors mainly due to errors in the tokenization phase.For example, in "early 2012." and "2007." the full stop should have been removed, whereas "2009-2010" should have been split in three different tokens.It appears that wrong tokenisation is the major cause of the difference between strict and lenient performance.In few cases, the system excluded modifiers (e.g., "late" in "late last July") or signals (e.g., "every" in "every morning") at the beginning (or at the end) of the expressions, leading to false negatives.Those errors are due to the CRF model which discarded such words with very high confidence.
These results suggest that reducing the complexity of the CRFs factor graph (see Section 3.1.1)and using a better tokeniser may lead to better performance.
False negatives (missed temporal expressions) are also connected to the low frequency of some types of expression in the training data: "15:00GMT Saturday", "a mere 24 hours".We also noticed cases of false negatives due to rare surrounding morphological contexts in the training data.
In three cases (2%) out of a total of 138 temporal expressions, the errors are due to questionable human annotations in the test set: "digital" alone (in the expression "digital age"), "tenure" and "second term".In five cases (4%), the system correctly annotates expressions missed by the human annotators (e.g., "the next decade" or "every morning").

Normalisation errors
The normalisation error analysis has been carried out on the correctly identified temporal expressions and it consists of checking whether the content of the value and type attributes are equal to the ones provided by the human annotators.A total of 33 temporal expressions have been correctly identified but wrongly normalised (VALUE).The major source of error (18/33 cases: 55%) remains the normalisation of partially extracted temporal expressions (e.g., "100" instead "100 days", or "a mere 24" instead "a mere 24 hours").In eight cases (24%), the normaliser failed to correctly distinguish between dates and durations (e.g., "the 99th day" was normalised as a duration of 99 days, instead of a precise day), whereas in five (15%) it failed to detect the right orientation in time (future or past), leading to the choice of a wrong year (e.g., "early August" normalised as "2013-08" instead of "2012-08").
We found only one (3%) possibly wrongly annotated temporal expression in the benchmark test set, i.e. for the expression "20th Century", a value "19" was provided instead of "19XX".In another case, the expression "a decade" was normalised with "P10Y" instead of the more correct "P1E".In both cases the normaliser provided the right value, although these were considered errors.

Conclusions
This paper has presented a novel architecture for temporal information extraction (identification and normalisation) of texts from general domain with an extensive feature type selection.We also described the results with respect to the TempEval-3 benchmark test set and the error analysis for both identification and normalisation phases.

Summary of contributions
In summary, the contributions of this paper are: • We conducted an extensive evaluation of the feature space and training configurations, which, to the best of our knowledge, has never been done before in the context of temporal expression extraction.The results indicate the key importance of morpholexical features to the detriment of syntactic features, as well as gazetteer and WordNet-related ones.In particular, while syntactic and gazetteer-related features do not affect the performance, WordNet-related features appear not to have positive impact.This conclusion, although statistically significant, is necessarily limited by the fact that the features analysis strictly depends on the way previous work has used WordNet.It does not mean that there is not a different way of using WordNet which may positively contributing to the temporal expression identification.Also, the feature analysis is meant to be relevant only in the temporal information extraction context.We do not suggest that some of the features experimented with here will produce the same effects in a different NER task.• We designed and built an automatic a posteriori label adjustment pipeline on top of the CRF module which we show to provide statistically significant positive impact on the results.We have also investigated the contribution of different possible configurations.Somewhat surprisingly, the use of the label adjustment pipeline, originally introduced mainly to be used with models trained on silver data, proved its efficacy with the gold data too.We provided an extensive statistical analysis on the a posteriori label adjustment pipeline which sheds light on the contribution of each pipeline component in isolation and in the context of others.The experiments also proved its use to be promising for both precision and recall enhancement.• Furthermore, we found out that the use of silver data does not improve the performance, although we consider the benchmark test set arguably too small to made this conclusion generalisable.

Future work
The a posteriori label adjustment pipeline proved to be promising and it constitutes, de facto, a novel approach to temporal expression extraction.We believe that it can be improved from many aspects, including: • Using the N most likely predicted sequences from the CRFs-based labeller in order to discriminate the most ambiguous/difficult tokens.• Using the rules from the normaliser in order to enhance the accuracy of the identification phase: discarding identified expressions not recognised by the normaliser (false positives reduction) and adding expressions recognised by the normaliser but ignored by in the identification phase (increment of true positives).
Our other future work will focus on the investigation of local semantics representation for temporal expressions [40].This representation provides a way to separate the temporal expressions' semantics from the contextual information.
To aid replicability of this work, the source code of the entire system, the machine learning pre-trained models, the statistical validation details and an online demo are available at: http://www.cs.man.ac.uk/~filannim/mantime.html

Fig. 3 .
Fig. 3. F β = 1 measure across the four models.5 × 10-fold cross validated.The box indicates the upper/lower quartiles, the horizontal line inside each of them shows the median value, while the dotted crossbars indicate the maximum/minimum values.There is no significant difference among the first three models, whereas the last one is statistically worse than the rest.

Fig. 4 .
Fig. 4. Analysis of different post-processing pipeline configurations (with respect to the F β = 1 measure).5×10-fold cross validated.P stands for Probabilistic Correction Module, B for BIO-fixer and T for Threshold-based label switcher.All the differences among the settings are statistically significant (measured with ANOVA test).The configurations have been collapsed when they provided the same result.The box indicates the upper/lower quartiles, the horizontal line inside each of them shows the median value, while the dotted crossbars indicate the maximum/minimum values.The horizontal line is the median of the configuration without pipeline.

Fig. 5 .
Fig. 5. NorMA architecture diagram.Each pre-identified temporal expression, along with the document creation time, is pre-processed and then subjected to rules matching.Post-manipulation rules are activated to cope with exact matchings like season or festival names.

Fig. 6 .
Fig.6.Analysis of the a posteriori label adjustment pipeline components.The upper group of curves refers to the lenient matching, whereas the bottom refers to the strict matching.Every component on the x-axis is applied on top of the previous ones.

Table 2
List of features used in the experiments (second part).Type column indicates whether a feature belongs to the (M)orpho-lexical, (S)yntactic, (G)azetteer or (W)ordNet category.Regular expression-based features, denoted with an *, are presented with a list of matching expressions whereas for the rest of them the notation (tokens → values) has been used.The WordNet-based features are computed from the TreeTagger lemma of each token.No word-sense disambiguation algorithm has been used.

Table 3
Post-hoc ANOVA analysis of the models (F β = 1 measure): p-values of two-tailed paired T-tests for each pair of models.Small p-values indicate statistical significance.
Pairs of models denoted with * have a statistically significant difference.Model 4 is significantly worse than the rest of the models.At the same time, there is no statistically significant difference among the first three models.

Table 4
Estimation of the expected results for the benchmark.Precision, Recall and F β = 1 score have been computed using strict matching.Model 1 performed slightly better with respect to F β = 1 .

Table 5
Probabilities updated for the token 'Saturday' in the sentence "Northern Ireland's World Cup qualifier with Russia has been postponed until Saturday due to heavy snow".The predicted label changes from the 'O' (predicted by CRF) to 'B'.