Time-Aware Evidence Ranking for Fact-Checking

Truth can vary over time. Fact-checking decisions on claim veracity should therefore take into account temporal information of both the claim and supporting or refuting evidence. In this work, we investigate the hypothesis that the timestamp of a Web page is crucial to how it should be ranked for a given claim. We delineate four temporal ranking methods that constrain evidence ranking differently and simulate hypothesis-specific evidence rankings given the evidence timestamps as gold standard. Evidence ranking in three fact-checking models is ultimately optimized using a learning-to-rank loss function. Our study reveals that time-aware evidence ranking not only surpasses relevance assumptions based purely on semantic similarity or position in a search results list, but also improves veracity predictions of time-sensitive claims in particular.


Introduction
While some claims are incontestably true or false at any time (e.g."Smoking increases the risk of cancer"), the veracity of others is subject to time indications and temporal dynamics (e.g."Face masks are obligatory on public transport") [1].Not only their veracity, but also their semantics are time-sensitive or time-dependent as connotations and real-world references can change over time.Evidence supporting or refuting such timesensitive claims is likewise time-dependent.The relevance of evidence documents, which reflects both their semantic relatedness to the claim and suitability for accurate claim veracity prediction, is thus relative to a claim's publication date and/or the documents' publication date.A fact-checking model's inability to correctly frame both claim and evidence in time and its inability to rank evidence documents by relevance can result in inaccurate semantic representations, truth predictions and relevance estimations.Nonetheless, automated fact-checking research has paid little attention to the temporal dynamics of truth, semantics and relevance.In this work, we focus on the temporal relevance of Web documents that serve as evidence to a given claim.We introduce four temporal ranking methods which constrain evidence ranking relying on diverse hypotheses for evidence relevance and explore how timeaware evidence ranking impacts the veracity prediction performance of three fact-checking models (Figure 1).
The first two methods simply rank evidence by date in descending order: evidence-based date ranking sorts all given evidence, while claim-based date ranking only ranks evidence published before and at claim time.The other two methods rank evidence by distance in days to either the claim (claim-centered distance ranking) or the other evidence in the same set (evidence-centered distance ranking), both in ascending order.These methods then simulate method-specific ground-truth rankings given the timestamp of each evidence snippet.Ultimately, the evidence ranking module of a fact-checking model is directly optimized using a dedicated learningto-rank loss function, which measures the agreement between the model's ranking output and the simulated ground-truth rankings.
In summary, the contributions of this work are as follows.
• We propose to model the temporal dynamics of evidence for content-based fact-checking and show that it outperforms a ranking of Web documents based purely on semantic similarity -as used in prior work -and search engine ranking.• We test various hypotheses for evidence relevance using timestamps and explore the performance differences between those hypotheses.
• We train evidence ranking by optimizing a learning-to-rank loss function.This elegant, yet effective approach requires only a few adjustments to the model architecture and can be easily added to any fact-checking model.
• Optimizing evidence ranking using a dedicated learning-to-rank loss function is, to our knowledge, novel in automated fact-checking research.

Related Work
Previous work on content-based fact-checking that exploits both claim and evidence has differentiated between pieces of evidence in various manners.Some consider different evidence documents to be equally important [2,3].Others weigh or rank evidence according to its assumed relevance to the claim.Liu et al. [4], for instance, link evidence relevance to node and edge kernel importance in an evidence graph using neural matching kernels.Li et al. [5,6] define evidence relevance in terms of evidence position in a search engine's ranking list, while Wang et al. [7] relate it with source popularity.Another line of work views evidence documents as interconnected chains and selects them based on their semantic connections, investigating multi-hop fact checking [8,9].However, evidence relevance has been principally associated with semantic similarity between claim-evidence pairs and is computed using language models [10], textual entailment/inference models [11,12,13], cosine similarity [14] or token matching and sentence position in the evidence document [15].In contrast to previous work, we hypothesize that the timestamp of a piece of evidence and reasoning with the temporal information are crucial to how evidence relevance should be defined and how it should be ranked for a given claim.
The dynamics of time in fact-checking have not been widely studied yet.Yamamoto et al. [16] incorporate the idea of temporal factuality in a fact-checking model.Uncertain facts are input as queries to a search engine, and a fact's trustworthiness is determined based on the detected sentiment and frequency of alternative and counter facts in the search results in a given time frame.However, the authors point out that the frequency of a fact can be misleading, with incorrect claims possibly having more hits than correct ones [6].Hidey et al. [17] recently published an adversarial dataset that can be used to evaluate a fact-checking model's temporal reasoning abilities.In this dataset, arithmetic, range and verbalized time indications are altered using date manipulation heuristics.Zhou et al. [18] study pattern-based temporal fact extraction.By first extracting temporal facts from a corpus of unstructured texts using textual pattern-based methods, they model pattern reliability based on time cues, such as text generation timestamps and in-text temporal tags.Unreliable and incorrect temporal facts are then automatically discarded.However, relying on the above method, a large amount of data is needed to determine a claim's veracity, which might not be available for new claims yet.

Time-Aware Evidence Ranking
We encourage a content-based fact-checking model to reason about the temporal semantics and time dependency of both claim and evidence by constraining the model's ranking module.This module assigns a relevance/ranking score to each Web document serving as evidence.During training, the ranking module is optimized using a learning-to-rank loss function, which measures the agreement between the learned ranking output and the expected ranking.As ground-truth evidence rankings are lacking, we introduce four ranking methods relying on several hypotheses on temporal rel-evance and use these methods to simulate ground-truth rankings.
We first discuss the three fact-checking models whose ranking module we will optimize on the simulated ground-truth evidence rankings.Next, we elaborate on the specific learning-to-rank loss used during training.We then explain how timestamps for a given claim and evidence set are extracted and normalized.Finally, we introduce the four temporal ranking methods that (a) constrain evidence ranking following several hypotheses for evidence relevance and (b) simulate hypothesis-specific ground-truth evidence rankings.From Section 3.3 onwards, we illustrate the time extraction/normalization process and the temporal ranking methods with a sample evidence set and claim taken from the MultiFC dataset [19].

Fact-Checking Model Architecture
In this section, we describe the model architecture of three fact-checking models.In a training setup without any ranking constraints, all model parameters are optimized using a loss function on the verification classification task.Evidence ranking is then learned implicitly.In a training setup with our ranking constraints, a model's ranking parameters are directly optimized using a learning-to-rank loss on the yielded evidence rankings while the other model parameters are optimized using a loss on the predicted veracity labels.By applying our temporal ranking methods to various neural architectures, we show their advantage over time-unaware approaches in a transparent manner.We take the Joint Veracity Prediction and Evidence Ranking model presented in the MultiFC dataset paper [19] as base model architecture and experiment with different neural architectures for the other two models.For the sake of simplicity, we name the models by their sentence encoder architecture.An overview of the model architectures is given in Figure 2.

BiLSTM
This is the Joint Veracity Prediction and Evidence Ranking model as presented in the dataset paper [19].The BiLSTM model takes as input claim sentence c i , evidence set E i = {e i1 , ..., e i K } and claim metadata 1 m i -with K ≤ 10 the total number of evidence snippets in evidence set E i .The sentence encoder -a bidirectional LSTM with skip-connections -transforms c i and e ij into their respective hidden representation h ci 1 Metadata contains information on speaker, tags and categories.and h ei j while the metadata encoder -a CNN -transforms m i into h mi .Each h ei j is then fused with h ci and h mi into a joint representation f ij following a natural language inference matching method proposed by Mou et al. [20]: where semi-colon denotes vector concatenation, "−" element-wise difference and "•" element-wise multiplication.All f ij are sent through two modules: a label scoring and a ranking module.In the label scoring module, similarity between f ij and each label across all domains is scored by taking the dot product between f ij and all label embeddings, which are updated during model training.This way, relationships between labels across all fact-check domains are learned.This results in label similarity matrix S ij .As the model needs to predict a domain-specific label, a domain-specific mask is applied over S ij , masking all out-of-domain label similarity scores.Ultimately, a fully-connected layer computes domain-specific label score vector l ij .In the ranking module, a two-layer fully-connected layer computes a ranking score r(f ij ) for each f ij .The ranking score reflects the relevance of e ij : the higher the score, the higher the evidence snippet's relevance to the claim.The dot product between each label score vector l ij and its respective ranking score r(f ij ) is taken, the resulting vectors are summed and domain-specific label probabilities p i are obtained by applying a softmax function.Finally, the model outputs the domain-specific veracity label with the highest probability.

RNN
This model's architecture is similar to that of the BiL-STM model but differs in terms of sentence encoder architecture, fusion mechanism and number of fullyconnected layers in the ranking module.Instead of a bidirectional LSTM with skip-connections, a two-layer unidirectional RNN encodes c i and e ij ∈ E i into their hidden representations h ci and h ei j .These representations are then fused with the hidden metadata representation h mi using a simple concatenation operation instead of the natural language inference matching method: f ij = [h ci ; h ei j ; h mi ].Lastly, we halve the number of fully-connected layers in the ranking module so that the ranking module is a shallower network.

DistilBERT
In this model, a DistilBERT Transformer model with a sequence classification head on top2 jointly takes as input claim sentence c i and evidence snippet e ij .It returns the '[CLS]' embedding from its final contextual layer and a probability distribution over all labels p ij .The '[CLS]' embedding represents the joint claimevidence representation f ij .All joint representations f ij are then sent through a two-layer fully-connected layer that computes a ranking score r(f ij ) for each f ij .As DistilBERT already yields a probability distribution over all labels p ij , we simply apply a domain-specific mask over p ij to obtain label score vector l ij .The dot product between each label score vector l ij and its respective ranking score r(f ij ) is taken, the resulting vectors are summed and domain-specific label probabilities p i are obtained by applying a softmax function.Finally, the model outputs the domain-specific veracity label with the highest probability.

Learning-to-Rank Loss
In order to optimize evidence ranking, we need a loss function that measures how correctly an evidence snippet is ranked with regard to the other snippets in the same evidence set.For this, the ListMLE loss [22] is computed: ListMLE is a listwise, non-measure-specific learningto-rank algorithm that uses the negative log-likelihood of a ground-truth permutation as loss function [23].It is based on the Plackett-Luce model, that is, the probability of a permutation is first decomposed into the product of a stepwise conditional probability, with the u-th conditional probability standing for the probability that the snippet is ranked at the u-th position given that the top u -1 snippets are ranked correctly.ListMLE is used for optimizing the evidence ranking with each of the four temporal ranking methods.
In case an evidence snippet is excluded from the ground-truth evidence ranking R i or lacks a timestamp, we apply a mask over the predicted evidence ranking vector and compute the ListMLE loss over the ranking scores of the included evidence snippets with timestamps.We assume that the direct optimization of these evidence snippets' ranking scores will indirectly influence the ranking scores of the others as they may contain similar explicit time references further in their text or exhibit similar patterns.
We are, to our knowledge, the first to specifically constrain evidence ranking in automated fact-checking using a dedicated, learning-to-rank loss on simulated ground-truth time-dependent rankings.

Temporal Relevance and Ranking Methods
We explain how temporal information for both claim and evidence is extracted and normalized.We then introduce the four temporal ranking methods that rank the Web documents in the evidence set.For illustration purposes, we use a claim and evidence set from the Mul-tiFC dataset [19]:

Claim-centered distance ranking
Evidence-centered distance ranking Figure 3: Overview of our temporal ranking methods that constrain evidence ranking following different hypotheses for temporal relevance: evidence-based date ranking, claim-based date ranking, claim-centered distance ranking and evidence-centered distance ranking.The example is taken from the MultiFC dataset [19].Although the methods actually rank up to ten snippets per claim in the experiments, we display four snippets for simplicity.The evidence timestamps are first extracted and normalized to their distance to the claim in days (∆t).In the fourth ranking method, evidence-centered distance ranking, the mediod of the evidence set and the snippets' distance to that mediod is computed (∆cl).Based on the ∆t or ∆cl values, the temporal ranking methods assign a ranking score r to each evidence snippet.A higher score denotes a higher degree of relevance.

Timestamp Extraction and Normalization
The dataset contains N claim sentences c i ∈ C and N evidence sets E i ∈ E, each providing K ≤ 10 evidence snippets e ij ∈ E i .A timestamp for each claim c i is included as metadata.For evidence snippet e ij , however, we need to extract the timestamp from the evidence text itself.A publication timestamp is frequently given at the beginning of the text, which facilitates timestamp extraction.We then normalize all extracted timestamps to year-month-day, resulting in: For each evidence snippet e ij ∈ E i , the temporal distance in days between claim time t(c i ) and evidence time t(e ij ) is calculated by subtracting t(c i ) from t(e ij ).This results in: with ∆E t i = {∆t(e i1 ), ..., ∆t(e i K )}.Positive and negative ∆t(e ij ) denote evidence snippets that are published later and earlier than t(c i ), respectively.These integer time values are subsequently used for simulating the ground-truth evidence rankings.

.5. Temporal Ranking Methods
All four temporal ranking methods assign ranking scores to the evidence snippets following a methodspecific ranking constraint.
with R i = {r(e i1 ), ..., r(e i K )}.In all methods, an evidence snippet's ranking score depends on its own position in time and that of the other evidence snippets in the same evidence set.The ranking scores are proportional to relevance, i.e., a higher ranking score denotes higher relevance.The ranking methods fall into two categories: date-based and distance-based ranking [24,25].The date-based ranking methods follow the general hypothesis that more information on a subject becomes gradually available over time and newer evidence should therefore be ranked higher than older evidence [26].
Both evidence-based date ranking and claim-based date ranking rank evidence by date in descending order, but the latter excludes evidence that is published after the claim date.While these date-based ranking methods consider relevance as a property proportional to recency, the distance-based ranking methods define relevance in terms of temporal distance to the claim (claim-centered distance ranking) or the other evidence in the evidence set (evidence-centered distance ranking).In these ranking methods, evidence is ranked by distance in ascending order.Table 1 provides an overview of all methods illustrated with simulated ground-truth rankings for sample evidence set E s .
The dataset used for training and testing contains claims from multiple fact-check domains.In Section 5 (Results), we take the model performance results for each fact-check domain when applying one of the temporal ranking methods and average over all domains to report the overall impact of each ranking method.We also report time-aware evidence ranking performance by taking the best performing temporal ranking method for each domain and averaging over all domains.This way, we show how choosing the appropriate ranking method for each domain influences model performance.

Evidence-based date ranking
Hypothesis 1.Evidence published later in time is more relevant than evidence published earlier in time and should thus be ranked accordingly.
The evidence-based date ranking method ranks the evidence snippets in a given set by their publication date in descending order.The simulated ground-truth ranking R i = {r(e i1 ), ..., r(e i K )} satisfies the following constraint: For two evidence snippets in the same evidence set, the constraint imposes a higher ranking score r(e  published more recently is assigned a higher position in the evidence ranking than an evidence snippet that is published further in the past. Example: For E s = {e s1 , e s2 , e s3 , e s4 } with ∆E t s = {−3, 0, −8791, 95}, the simulated groundtruth ranking R s = {r(e s1 ), r(e s2 ), r(e s3 ), r(e s4 )} should satisfy the method-specific constraint (7):

Claim-based date ranking
Hypothesis 2. Although newer evidence is more relevant than older evidence, fact-checkers can only base their veracity estimation on information that was available at fact checking time.In this case, evidence published after the claim date is not relevant and should be excluded from the documents to be ranked.
In this method, we mimic the information accessibility and availability at claim time t(c i ).As a result, only evidence that had been published before or at the same time as the claim is considered in this approach.For groundtruth evidence ranking R i , all e ij with ∆t(e ij ) ≤ 0 are ranked according to the following constraint: For two evidence snippets in the same evidence set, the constraint imposes a higher ranking score r(e i k ) for e i k and a lower ranking score r(e ij ) for e ij if ∆t(e i k ) is larger than ∆t(e ij ) and both ∆t(e i k ) and ∆t(e ij ) are negative or zero.If ∆t(e i k ) and ∆t(e ij ) are equal, e ij and e i k obtain the same ranking score.This is the only temporal ranking method that does not necessarily rank all given e ij ∈ E i because it excludes all ∆t(e ij ) > 0 from simulated ground-truth ranking R i .Following the intuitions that recent evidence is more relevant than past evidence and fact-checkers can only rely on information that was available at claim time, an evidence snippet that is published more recently and before or at claim time is assigned a higher position in the evidence ranking than an evidence snippet that is published further in the past.

Claim-centered distance ranking
Hypothesis 3. Assuming that a topic and its related subtopics are discussed around the same time [27], evidence is more relevant when it is published around the same time as the claim and becomes less relevant as the temporal distance between claim and evidence grows.
This ranking method assigns ranking scores to evidence snippets in terms of their temporal vicinity to the claim.For ground truth ranking R i , we rank all e ij based on |∆t(e ij )| in ascending order, satisfying the following constraint: For two evidence snippets in the same evidence set, the constraint imposes a higher ranking score r(e ij ) for e ij Following the intuition that relevant evidence clusters around the claim in terms of time, an evidence snippet with a timestamp close to claim time is assigned a higher position in the evidence ranking than an evidence snippet with a timestamp distant from claim time.

Evidence-centered distance ranking
Hypothesis 4. Analogous to the assumption that relevant documents have a tendency to cluster in a shared document space [28], relevant evidence snippets also cluster in time.Therefore, evidence snippets that are clustered in time are more relevant than evidence snippets that are temporally distant from the others.This ranking method assigns ranking scores to evidence snippets in terms of their temporal vicinity to the other snippets in the same evidence set.We first detect the medoid of all ∆t(e ij ) ∈ ∆E t i by computing a pairwise distance matrix, summing the columns and finding the argmin of the summed pairwise distance values.Then, the Euclidean distance between each ∆t(e ij ) and the detected medoid is calculated.

∆cl : ∆E
We rank all ∆cl(e ij ) ∈ ∆E cl i in ascending order, resulting in ground-truth ranking R i which satisfies the following constraint: For two evidence snippets in the same evidence set, the constraint imposes a higher ranking score r(e ij ) for e ij and a lower ranking score r(e Following the intuition that relevant evidence clusters in time, an evidence snippet with a timestamp close to the evidence mediod timestamp is assigned a higher position in the evidence ranking than an evidence snippet with a timestamp distant from the evidence mediod timestamp.

Experimental Setup
Dataset.We opt for the MultiFC dataset [19] as it is the only large, publicly available fact-checking dataset which provides temporal information for both claims and evidence pages and follow their experimental setup.The dataset contains 34,924 real-world claims extracted from 26 different fact-check websites.The fact-check domains are abbreviated to four-letter contractions (e.g.Snopes to snes, Washington Post to wast).For each claim, metadata such as speaker, tags and categories are included, and a maximum of ten evidence snippets crawled from the Internet using the Google Search API are used to predict a claim's veracity label.To retrieve evidence snippets, [19] submit each claim verbatim as a query without quotes.Regarding temporal information, the dataset provides an explicit timestamp for each claim as structured metadata.For the evidence snippets, however, we need to extract their timestamp from the evidence text itself.The publication date is often contained in the document text, most frequently at the beginning of the text and immediately followed by an ellipsis (i.e., '...').We split the text at the ellipsis and take the left part as timestamp.If Python's datetime module recognizes the extracted timestamp as a real timestamp, it is regarded as ground-truth evidence timestamp.Otherwise, the timestamp is not included in the dataset.Both claim and evidence date are then automatically formatted as year-month-day using Python's datetime module.We randomly extracted 150 timestamps from claims and evidence snippets in the dataset and manually verified whether datetime correctly parsed them in year-month-day.As zero mistakes were found, we consider datetime sufficiently accurate.If datetime is unable to correctly format an extracted timestamp, we do not include it in the dataset.Figure 4 displays the distribution of evidence snippets with and without a retrievable timestamp per domain.Finally, the dataset is split in training (27,940), development (3,493) and test (3,491) set in a label-stratified manner.
Pre-Training and Fine-Tuning.The models are first pre-trained on all domains before they are finetuned for each domain separately.Pre-training the models on all domains is advantageous as some domains contain few training data.In the pre-training phase, batches from each domain are alternately fed to the model during each epoch with the maximum number of batches for all domains equal to the maximum number of batches for the smallest domain.This way, the model is not biased towards the larger domains.
For the veracity classification task, cross-entropy loss is computed over the label probabilities, and RMSProp (BiLSTM/RNN) or AdamW (DistilBERT) optimizes all the model parameters except the evidence ranking parameters.For the evidence ranking task, ListMLE loss is computed over the ranking scores, and Adam optimizes only the evidence ranking parameters.We use the ListMLE loss function from the allRank library [29].An overview of all hyperparameter settings is included in the Appendix.In the fine-tuning phase, we select the best performing pre-trained model based on the development set for each domain individually and fine-tune it on that domain.We found that directly optimizing the evidence ranking on the temporal ranking methods in both the pretraining and fine-tuning phase yields the highest results for all models (Table 2).

Results
We take the best performing model per domain based on the development set and report the average over all domain-specific test results on the veracity prediction task (Table 3).As done in Augenstein et al. [19], we use Micro and Macro F1 score as evaluation metrics 4 .The results of our BiLSTM base model are comparable 5 to those of Augenstein et al. [19].In the random evidence experiment, we establish how well the models distinguish between the relevance of extracted evidence sets that are considered relevant by the Google Search API and the assumed irrelevance of evidence sets that  are randomly assigned to the claims (random seed = 0).As an additional ranking baseline, we optimize evidence ranking on the evidence snippets' position in the Google search ranking list (search ranking).As the MultiFC dataset kept the order in which the Google Search API provided the first ten query results for each claim query, we can simply take that order for the search ranking setup: the higher in the provided evidence set, the higher in the simulated ground-truth evidence ranking.For the time-aware evidence ranking results, we take for each domain the best performing temporal ranking method and report again the average over all domain-specific test results.We present domain-specific test results for the BiLSTM model in Table 4 and include the domainspecific test results for the RNN and DistilBERT model in the Appendix.Given that the number of classification labels per domain range from 2 to 40 labels, we discuss the performance results in terms of Micro F1 as we suspect label imbalance.
The random evidence results indicate that the Dis-tilBERT and BiLSTM base models are able to distinguish relevant evidence sets from random evidence sets without any guidance on evidence ranking (-1.38%/-5.18%Micro F1; BiLSTM/DistilBERT).Regarding the temporal ranking methods, the methods generally outperform search engine ranking, which consistently performs worse than the base models (-0.53/-2.71/-7.65%Micro F1; BiLSTM/RNN/DistilBERT).The temporal ranking methods seem to affect model performance to various extents.Constraining evidence ranking in all domains using a date-based ranking method positively influences the BiLSTM and RNN model performancewith evidence-based date ranking (+2.73/+3.06%Micro F1) leading to slightly higher results than claimbased date ranking (+1.47/+2.18%Micro F1).We observe similar performance differences between the date-based ranking methods in the DistilBERT model, however, both methods fall behind the base model (-1.34/-7.67%Micro F1; evidence-based/claim-based date ranking).For this model, only evidence-centered distance ranking yields increased test results (+2.01%Micro F1).Although this ranking method returns higher test results for the RNN model as well, it is the only temporal ranking method that decreases the BiLSTM model performance (-7.65% Micro F1).In contrast to the other three temporal ranking methods, claim-centered distance ranking does not lead to a substantial performance gain in any of the three fact-checking models (+0.11/-0.72/-3.01%Micro F1; BiLSTM/RNN/DistilBERT).While performance gains are often limited when applying a single temporal ranking method to all domains, higher results can be obtained by selecting the best performing temporal ranking method per domain based on the development set.As a result, time-aware evidence ranking increases model performance by 7.44% for the BiLSTM model, 8.46% for the RNN model and 3.63% for the DistilBERT model (Micro F1).

Discussion
Introducing time-awareness in a fact-checking model by directly optimizing evidence ranking using temporal ranking methods positively influences classification performance.Moreover, time-aware evidence ranking consistently outperforms search engine evidence ranking.This suggests that the temporal ranking methods themselves -and not merely the act of direct evidence ranking optimization -lead to higher results.

Do Time-Aware Models Learn Different Rankings?
One could ask to which extent the temporal ranking methods actually change the evidence ranking order.To quantify the difference in returned evidence rankings between the base model and the model optimized on one of the temporal ranking methods, we compute the Spearman's rank correlation coefficient r s .We find that the temporal ranking methods consistently change ranking orders in the BiLSTM model as the time-aware evidence rankings are rather weakly correlated with the base rankings (r s = .24/.18/.22/.17; for evidencebased date ranking, claim-based date ranking, claimcentered distance ranking and evidence-centered distance ranking, respectively).Although the changes in ranking order are more drastic for the BiLSTM model, the temporal ranking methods have a weaker but still considerable impact on evidence ranking in the RNN model (r s = .56/.58/.61/.54).The lower impact could be attributed to the ranking module's depth: the BiL-STM model's ranking module consists of twice as many nonlinear fully-connected layers than the RNN's ranking module.As a result, the deeper ranking module is able to learn more detailed and abstract representations.
While the correlations between time-aware and base ranking order are mainly positive in the BiLSTM and RNN model, evidence rankings are either negatively or completely uncorrelated in the DistilBERT model (r s = −.23/−.35/−.02/−.07).In that model, the sentence encoder yields both a label probability distribu-tion and a joint representation for a given claim and evidence snippet.For the other two models, the sentence encoder outputs separate hidden representations for claims and evidence snippets, which are later fused to joint representations.The label scoring module then infers label scores for each joint representation.It can thus be argued that the joint representations in the Distil-BERT model are more label-aware, which in turn lead to a more label-biased ranking module that possibly runs counter to a time-aware one.Hence the negative correlations between time-aware and base rankings.We can conclude that the temporal ranking methods indeed cause changes in evidence ranking order, although their impact varies with each model.

What Sparks Time-Aware Ranking Success in a
Domain?The temporal ranking methods affect overall model performance to various extents: the BiLSTM and RNN model mainly prefer date-based ranking methods, while the DistilBERT model's performance only increases using the evidence-centered distance ranking.Moreover, there is not a single temporal ranking method that increases veracity prediction performance across all models.Similar observations can be made on a by-domain level.Some domains benefit from all four temporal ranking methods, while others consistently perform worse or are not even affected at all.
Whether or not time-aware evidence ranking affects a domain's model performance might be ascribed to the share of evidence snippets with retrievable timestamps in that domain.An evidence snippet's computed ranking score can only be optimized by the learningto-rank loss function if its timestamp can be extracted and, in case of claim-based date ranking 6 , is included in the method-specific ground-truth evidence ranking.A large share of time-grounded evidence snippets enables the model to learn the expected rankings in a more direct and constrained manner, while a small share allows more flexible ranking learning.Consequently, it could be argued that the temporal ranking methods influence domains with a large share of time-grounded evidence snippets (farg, chct, wast, goop) more strongly and positively than those with a small share (clck, faly, fani).However, that argument is refuted as several small-share domains have their performance increased by large margins, and large-share domains do not consistently benefit from time-aware evidence ranking to a great extent.These findings suggest that the effectiveness of our temporal ranking methods does not rely on a large amount of time-grounded evidence.
Another possible cause of the inter-domain differences in time-aware ranking effect is the time-sensitivity of those domains.We hypothesize that domains tackling claims on time-sensitive subjects benefit more from time-aware evidence ranking than those discussing time-insensitive claims.We retrieve the categories 7from several domains and analyze their time-sensitivity.The analysis confirms our hypothesis: domains which mainly tackle time-sensitive subjects such as politics, economy, climate and entertainment (abbc, para, thet) benefit more from time-aware evidence ranking than domains discussing both time-sensitive and timeinsensitive subjects such as food, language, humor and animals (snes, tron).We can therefore conclude that relating evidence relevance with time and ranking evidence snippets accordingly is beneficial for timesensitive claims.

Do Certain Evidence Distributions Prefer Specific
Ranking Methods?We observe not only inter-model and inter-domain but also inter-method differences.Regarding date-based ranking, a domain or model preference for either evidence-based or claim-based date ranking might depend on the share of evidence posted after the claim date.If an evidence set mainly consists of later-posted evidence, the ranking of only a few evidence snippets is directly optimized with the claim-based date ranking method, leaving the model to indirectly learn the ranking scores of the others.In that case, the evidence-based method might be favored over the claim-based method.However, the share of later-posted evidence is not consistently different in domains preferring evidence-based date ranking than in domains favoring claim-based date ranking.
Concerning distance-based ranking, evidence and claim (claim-centered distance ranking), and evidence and evidence (evidence-centered distance ranking) are more likely to discuss the same topic when they are published around the same time.The distance-based ranking methods would thus increase classification performance for domains in which the dispersion of evidence snippets in the claim-specific evidence sets is small.We measure the temporal dispersion of each evidence set using the standard deviation in domains which mainly favor distance-based ranking over date-based ranking (afck, faan, pomt, thal; Group 1), and vice versa (chct, vogo; Group 2).We then check whether the domains in these groups display similar dispersion values.Kruskal-Wallis H tests indicate that dispersion values statistically differ between domains in the same group (Group 1: H = 258.63,p < 0.01; Group 2: H = 192.71,p < 0.01).Moreover, Mann-Whitney U tests on domain pairs suggest that inter-group differences are not consistently larger than intra-group differences (e.g., thalchct: M dn = 337.53,M dn = 239.26,p = 0.021 > 0.01; thal-pomt: M dn = 337.53,M dn = 661.25,p = 9.95e −5 < 0.01).Therefore, the hypothesis that small evidence dispersion causes a preference for distancebased ranking methods is rejected.

On Claim-Specific Ranking and Temporal Semantics
In the experiments, the ground-truth rankings of all evidence sets within a domain are simulated using a single temporal ranking method.Our approach successfully improves classification performance -especially when selecting the best temporal ranking method per domain (time-aware evidence ranking) -and significantly influences ranking order.Nevertheless, it could be argued that additional performance increases could be gained from inferring the most beneficial temporal ranking method for each claim individually.This would require a more in-depth reasoning over the temporal semantics and importance hierarchy of information contained within a claim.Consider the following claims taken from the abbc domain: In the first claim, temporal cues now in the main clause (Australians now need to visit a doctor to obtain codeine) and after in the subordinate clause (after the drug was taken off pharmacy shelves and made available only with a prescription) indicate that the situation described in the main clause (a) is happening at claim time and (b) occurred after the two events described in the subordinate clause.Moreover, the temporal cues mark a causal discourse relation where the main clause is caused by the subordinate clause.This causal relation can be considered world knowledge as prescribed drugs can only be legally obtained after a doctor's visit.As a result, a fact-checking model can rely on either one of the clauses to predict the veracity of the entire claim.In terms of the temporal relevance of the claim's evidence set, evidence published either around the same time as the claim or after the claim are more relevant to predict the veracity of the claim.As now in the main clause implies that the Australian policy for codeine was different before the claim date, old evidence would provide incorrect information -even if the evidence is provided by reliable sources.That issue specially highlights the importance of temporal semantics in fact-checking.
In contrast to the first claim, there is a greater importance hierarchy between the clauses of the second and third claim.On the one hand, a fact-checking model can decide to focus on the entire claim and verify whether Greens Senator Sarah Hanson-Young and the Liberal Party actually verbalized the respective subordinating clauses.For the second claim, the model needs to establish when that specific ABC's Q&A session with the senator occurred and look for evidence that reports on the event, preferable published on the same day or the day after the event.Given the past tense of the main verb (said), the event arguably occurred before the claim data, so the model should look for older evidence.However, the definite article the in the ABC's Q&A suggests that the main clause refers to the latest Q&A session.So the model should not look too far in the past.For the third claim, an explicit date is provided (September 4, 2013).A model can thus look for evidence published on that specific day to check whether the Liberal Party made that promise.
Instead of focusing on the entire claim, a factchecking model could also choose to verify the information given in the subordinating clauses.This would require an even more thorough temporal reasoning as the model should not only place the event or situation in the subordinating clause in time, but also establish the temporal semantics of the main clause and the temporal relation between the two clauses.The temporal cues at the moment and right now in the subordinating clause of the second claim ("Indigenous children at the moment are 10 times more likely to be living out of home right now,") suggest that the situation is happening at claim time.However, the past tense of the verb (said) in the main clause puts the entire subordinating clause in the past.As a result, evidence from the time of the Q&A session will be considered more relevant than evidence published around the claim time.For the third claim, the subordinating clause refers to an uncertain event that might happen in the future given the future tense will establish and the phrase if elected.It requires extensive world knowledge to check the information in the subordinating clause.Firstly, the model needs to check whether the Liberal Party was indeed elected and a Coalition government was formed.It then needs to establish whether or not that government agreed on the new seniors employment incentive payment.Lastly, the model can take it a step further and take into account evidence published after the claim date to check whether that government was able to keep its promise before the end of its administration.As the subordinating clause might not be true at claim time, its veracity could have changed over time.
Future work can zoom in on the temporal semantics at claim level and apply various temporal reasoning methodologies to extract temporal information and construct event timelines from text [30].A comprehensive overview of temporal reasoning methodologies can be found in Leeuwenberg and Moens [31].Based on a claim's temporal semantics, an appropriate temporal ranking method can be selected to simulate a claimspecific ground-truth evidence ranking.

Limitations
The impact and success of the temporal ranking methods still depend on the informativeness of the given Web documents which serve as evidence to the claim.As the Web documents are automatically crawled from the Internet given the claim as query to the Google Search API, it is not guaranteed that they are all useful for predicting that claim's veracity.If a large number of Web documents in the evidence set do not contain useful information for refuting or supporting the claim, enforcing a temporal ranking will then have little to no effect on model performance.To analyse this, we randomly pick a claim with a large number of evidence snippets with a retrievable timestamp, but for which both base and time-aware models consistently return an incorrect veracity label (Table 5).The veracity and semantics of the claim are time-dependent (i.e., trade balance varies over time), but the majority of Web documents are irrelevant to the claim.Consequently, a fact-checking model cannot make an informed veracity prediction -independent of how it ranks the evidence.
It should generally be noted that the Google Search engine has been evaluated as being highly effective in returning diverse results for queries [32].In future work, to obtain a larger set of evidence snippets for temporal ranking, one could experiment with retrieving more evidence snippets per query, use multiple search engines or automatically formulate multiple search queries per claim, as opposed to merely submitting the claim as a query.

Conclusion
Introducing time-awareness in evidence ranking arguably leads to more accurate veracity predictions in fact-checking models -especially when they deal with claims about time-sensitive subjects such as politics and entertainment.These performance gains also indicate that evidence relevance should be approached more diversely instead of merely associating it with the semantic similarity between claim and evidence.By integrating temporal ranking constraints in neural architectures via appropriate loss functions, we show that factchecking models are able to learn time-aware evidence rankings in an elegant, yet effective manner.To our knowledge, evidence ranking optimization using a dedicated ranking loss has not been studied before in the context of fact-checking.Whereas this study is limited to integrating time-awareness in the evidence ranking as part of automated fact-checking, future research could build on these findings to explore the impact of timeawareness at other stages of fact-checking, e.g., document retrieval or evidence selection, and in domains beyond fact-checking.Alternatively, the analogy with spatial relevance can be explored by adopting similar spatial ranking methods for space-aware evidence ranking.

2 )Figure 1 :
Figure 1: Given a claim and the date at which the claim was published (claim time; CT), the relevance estimation and ranking of evidence snippets positioned on the same timeline as the claim (1-6) can follow various assumptions.

Figure 2 :
Figure 2: Overview of the fact-checking models.The BiLSTM and RNN model (a) share a similar model structure but differ in sentence encoder architecture (BiLSTM vs. RNN), fusion mechanism (inference matching vs. concatenation) and number of fully-connected layers in the ranking module (two vs. one).The colored boxes with dashed borders indicate vectors while the grey boxes with solid borders indicate model modules.Each color refers to an evidence snippet, showing how the evidence set is encoded throughout the model.

Figure 4 :
Figure 4: The share of evidence snippets with retrievable timestamps per domain (in yellow).

Metadata Metadata Encoder Metadata Representation Claim Representation Evidence Representation
i k ) for e i k and a lower ranking score r(e ij ) for e ij if ∆t(e i k ) is larger than ∆t(e ij ).If ∆t(e i k ) and ∆t(e ij ) are equal, e i k and e ij obtain the same ranking score.Following the intuition that recent evidence is more relevant than past evidence, an evidence snippet that is
and a lower ranking score r(e i k ) for e i k if |∆t(e ij )| is smaller than |∆t(e i k )|.If |∆t(e ij )| and |∆t(e i k )| are equal, e ij and e i k obtain the same ranking score.
i k ) for e i k if |∆cl(e ij )| is smaller than |∆cl(e i k )|.If |∆cl(e ij)| and |∆cl(e i k )| are equal, e ij and e i k obtain the same ranking score.

Table 3 :
Aggregated test results for the veracity prediction task, with improvements over base model performance underlined and the performances differences as subscript.

Table 5 :
[19]m on Scotland's positive trade balance compared to UK nations is ... with a new Mostly False verdict and correction note, and re-posted to ... / Advice, and information so that anyone can check the claims we hear about ... on Scotland's positive trade balance compared to UK nations is Mostly False Mar 10, 2018 Mar 10, 2018 ... Claim: The US is suffering from a trade imbalance, with a trade ... Reality Check verdict: The President is incorrect about the US trade deficit -it was $566bn (£ 410bn) in 2017.... declined in most major western industrialised nations over the past ... 9 Two British soldiers injured in Islamic State attack in Syria ... Sep 7, 2016 Sep 7, 2016 ... Adam Smith's 1776 classic "Wealth of Nations" may have had the ... What was the most important document published in 1776?... Smith, a Scottish philosopher by trade, wrote the book to upend the mercantilist system.... fallacies in an argument that is framed as the invisible hand versus the government.Apr 17, 2018 Apr 17, 2018 ... Everything You've Been Told About Government Debt Is Wrong ... usually presume that the government will run a primary surplus (the excess of ... For example, during the era of relative peace following 1960, decadal ... Using this as a parameter, Barrett estimates a safe debt/GDP level for the U.K. of 140%.Apr 20, 2018 Apr 20, 2018 ... President Donald Trump's trade policy leaves international ... trade system, which I'd argue has benefited the nation economically overall ... of thought most economists believe Adam Smith extinguished after he ... Second, imposing new and higher tariffs on imports won't make the U.S. trade deficit go away.Jun 29, 2016 Jun 29, 2016 ... Similar information for devolved administrations are available at Scotland: Fire and Rescue Statistics ... FIRE0103: Fires attended by fire and rescue services by nation and ... FIRE0104: Fire false alarms by reason for false alarm, England (MS .... FIRE1303: Firefighters' pension surplus and deficit (MS Excel... Aug 6, 2017 Aug 6, 2017 ... After the bitter referendums over Scottish independence and ... Judged on hard metrics, confidence in UK media has fallen .... Remainers detected dangerous instances of false balance, most notoriously when a poll found that 88% of UK ... who said that Brexit would not damage trade and the UK economy./... the failure of people in the rich nations to make any significant sacrifices in ... It is not simply the absence of charity, let alone of moral saintliness: It is wrong, and one cannot claim to be a morally decent ... without thereby sacrificing anything of comparable moral importance, we... Claim and accompanying evidence set from the MultiFC dataset[19]for which both base and time-aware models consistently predict an incorrect veracity label.

Table 11 :
Overview of classification test results (Micro F1 and Macro F1) for BiLSTM model -Only FT, with improvements over base model results underlined.We present three optimization approaches, in addition to the approach used in the main paper.Instead of optimizing the evidence ranking based on single temporal ranking methods, all four temporal ranking methods (evidence-based date ranking, claim-based date ranking, claim-based distance and evidence-based distance; All Temporal Ranking Methods) are considered, and the evidence ranking is optimized based on the sum of the four ranking losses.We also explore a combination of the date-based ranking losses (Evidence Date + Claim Date) and the distance-based ranking losses (Claim Distance + Evidence Distance).The lower results motivate our choice for the optimization approach presented in the main paper (i.e., optimization based on single temporal ranking methods and selecting the best method for each domain; Best Temporal Ranking Method).