Information Processing and Management

Research on automated social media rumour verification, the task of identifying the veracity of questionable information circulating on social media, has yielded neural models achieving high performance, with accuracy scores that often exceed 90%. However, none of these studies focus on the real-world generalisability of the proposed approaches, that is whether the models perform well on datasets other than those on which they were initially trained and tested. In this work we aim to fill this gap by assessing the generalisability of top performing neural rumour verification models covering a range of different architectures from the perspectives of both topic and temporal robustness. For a more complete evaluation of generalisability, we collect and release COVID-RV, a novel dataset of Twitter conversations revolving around COVID-19 rumours. Unlike other existing COVID-19 datasets, our COVID-RV contains conversations around rumours that follow the format of prominent rumour verification benchmarks, while being different from them in terms of topic and time scale, thus allowing better assessment of the temporal robustness of the models. We evaluate model performance on COVID-RV and three popular rumour verification datasets to understand limitations and advantages of different model architectures, training datasets and evaluation scenarios. We find a dramatic drop in performance when testing models on a different dataset from that used for training. Further, we evaluate the ability of models to generalise in a few-shot learning setup, as well as when word embeddings are updated with the vocabulary of a new, unseen rumour. Drawing upon our experiments we discuss challenges and make recommendations for future research directions in addressing this important problem.


Automated misinformation detection
The proliferation of misinformation on social media poses a serious threat to the functioning of society and the health and wellbeing of its citizens (Islam, et al., 2020).This issue has motivated efforts by fact checkers, journalists, social media platforms and researchers to develop ways to identify and debunk misinformation so as to mitigate its impact (Graves & Mantzarlis, 2020;Karafillakis, Van Damme, Hendrickx, & Larson, 2022;Shu, et al., 2020).A range of Natural Language Processing (NLP) techniques have been developed to address the challenges of identifying the veracity of content circulating on social media.A usual first step

Our work
In this work we assess and analyse the ability of several top-performing rumour verification models to generalise to unseen rumours and events from the perspectives of both topic and temporal robustness.Our focus here is on rumour verification leveraging conversational threads discussing the rumours, following a line of research that has been widely studied in recent years (Bian, et al., 2020;Gao, Han, Song, & Ciravegna, 2020;Khoo et al., 2020).
To enable an up to date evaluation of generalisability across time, we also collect and release a novel dataset containing social media conversations around rumours involving COVID-19, a topic that sparked a high volume of posts and controversy (Pian, Chi, & Ma, 2021).Despite numerous efforts at producing COVID-19 misinformation datasets (Cui & Lee, 2020;Hossain, et al., 2020;Zhou, Mulay, Ferrara, & Zafarani, 2020), existing datasets do not contain conversations around rumours, along with their associated thread structures.They contain claims (Dharawat, Lourentzou, Morales, & Zhai, 2020), tweets (Hossain, et al., 2020), news headlines (Cui & Lee, 2020) and scientific publications (Wang, Lo, et al., 2020).The absence of social media conversations in these datasets currently prevents evaluation of verification models operating on the conversation structure (Bian, et al., 2020;Ma, Gao, & Wong, 2018).Following the structure and format of the PHEME and Twitter15/16 datasets, we therefore collect and release a novel, carefully curated dataset of rumours and corresponding Twitter conversations discussing them, which we use to evaluate the effectiveness of state-of-the-art rumour verification models.

Research questions
In evaluating generalisability of social media rumour verification models we are interested both in the performance gap when operating across datasets and also, importantly, in understanding what aspects of models and datasets affect this, so that verification techniques developed in future work may leverage these findings.
Specifically, we address the following research questions: • RQ1: How well can rumour verification models generalise to unseen rumours across datasets from similar time periods and those more distant in time?• RQ2: If the models experience performance drop on unseen datasets, will the ranking of model performance evaluated across datasets align with the ranking of these models when evaluated within one dataset?
• RQ3: Which models or groups of models show better generalisability?
• RQ4: Which training datasets lead to better model generalisability?Does increasing the size of the training data improve generalisability?• RQ5: Which data properties are related to performance drop?• RQ6: Does the evaluation strategy used on the original training dataset play a role in training more generalisable models or estimating their future performance more realistically?• RQ7 : How receptive are different models to strategies such as few-shot learning and using embeddings updated with data from a new rumour?

Contributions
We make the following contributions: • We are the first to test generalisability of top performing rumour verification models across datasets in two settings: (1) between datasets from the same time period; and (2) on test data collected at a later time period.We show that rumour verification models fail to generalise, with a sizeable performance drop when applied across datasets for all models.• We provide extensive analysis of similarities and differences in model performance for five different models in several training settings.
• We release COVID-RV, a novel COVID-19 dataset of false claims and corresponding Twitter conversations to facilitate generalisability analysis.• We investigate the difference in vocabulary between datasets and find that with the rise in distance between vocabularies we observe a higher performance drop.• We discuss challenges and provide suggestions on ways to improve the generalisability of verification models.

Related work
In this section we describe the landscape of relevant works on rumour verification in terms of models and benchmark datasets, as well as novel datasets focusing specifically on COVID-19 rumours, and discuss how our work fits within the model generalisability domain.

Rumour verification models
Automated social media rumour verification is an active research area in NLP.Models have achieved high performance by leveraging linguistic, network-and user-related features (Chen, Zhou, Zhang, & Bonsangue, 2021;Kumari, Ashok, Ghosal, & Ekbal, 2022;Li, Fan, Yuan, & Zhang, 2022), propagation patterns, stance of the responses and conversation structures (Bian, et al., 2020;Dougrez-Lewis et al., 2021;Ma et al., 2018).Generalisability to new, unseen rumours is a crucial requirement for these models to be useful in real world settings.The fact that each new rumour introduces a new topic associated, which may be linked with fast-paced, ongoing and new real-life events, and attract discussions from diverse groups of individuals, makes this a challenging task and a problem inherent to rumour verification.Social media rumour verification models aim to resolve rumours at an early stage, and so they cannot always rely on the availability of confirmation from a specific, reliable source to determine their veracity (unlike the setup in Hossain, et al. (2020)).Thus, these models have to find generalisable signals among linguistic, network-and user-related features of Twitter conversations, rather that memorise information from the training set.Twitter15, Twitter16 (Ma, Gao, & Wong, 2017) and PHEME (Kochkina et al., 2018;Zubiaga, et al., 2016) are widely-used benchmark datasets for the tasks of rumour detection and verification.Many of the proposed models achieve high accuracy, such as 88% on Twitter15 in Bian, et al. (2020), 85% in Khoo et al. (2020), and 90.5% in Yuan, et al. (2019).This outperforms the performance of a non-professional human annotator on this task, which was estimated at around 60-65% in Kochkina et al. (2018).Every year, new approaches advancing the state-of-the-art performance appear, however, the question on how ready these are for application in a real-world setting remains unexplored.Papers proposing rumour verification models generally focus just on either Twitter15, Twitter16 (Huang et al., 2020;Huang, et al., 2019;Tu, et al., 2021;Wakamiya & Aramaki, 2020;Yuan, et al., 2019;Zhang, Cook, & Yilmaz, 2021) or PHEME (Kumar & Carley, 2019;Lee, et al., 2021;Roy, Bhanu, Saxena, Dandapat, & Chandra, 2022), with few papers using all three datasets (Khoo et al., 2020;Kochkina & Liakata, 2020) but still evaluating performance on each dataset separately.Ni, Li, and Kao (2021) perform cross-dataset performance comparison for several rumour datasets but their work is limited to a single BERT model operating on individual tweets.Furthermore, the work is limited to a binary classification setup, and the paper confuses the label definitions in the tasks, e.g. they perform rumour detection on PHEME using only rumour-non-rumour labels (meaning check-worthiness rather than veracity), but they are actually using veracity labels 'True' and 'False' from Twitter 15 and Twitter 16.Hence, we are the first to evaluate generalisability of a selection of top performing models across these popular datasets and also on a novel dataset of Twitter rumour conversations that is distant in time from the training data. 2 We expect a reasonable level of E. Kochkina et al. generalisability across the Twitter15/16 and PHEME datasets because data were collected around the same time period and share some of the topics.Generalisability to new rumours is currently addressed by using Leave-One-Event-Out (LOEO) cross-validation (CV) evaluation in the PHEME dataset.It has been shown that the performance difference between LOEO CV and random split CV is sizeable, and can reach around 38% (Khoo et al., 2020;Kochkina et al., 2018).Similarly, Zhang, Cao, et al. (2021) used temporal splits on the Weibo dataset (dataset of rumour conversations in Chinese from the Weibo microblog platform).However, temporal or event level splits are not addressed in research using the Twitter15/16 datasets.Thus, we hypothesise that due to the more strict LOEO CV evaluation, performance on PHEME is less likely to be overestimated than on Twitter15/16.

Generalisability
Generalisability is the capacity of a model to perform well on new, unseen data.Models are often evaluated on the test set with the assumption that future cases come from the same distribution as the training data.However, various NLP studies have reported a lack of generalisability among state-of-the-art models when tested on out-of-distribution data (Ettinger, Rao, Daumé, & Bender, 2017;Thakur, Reimers, Rücklé, Srivastava, & Gurevych, 2021), e.g., for hate speech (Yin & Zubiaga, 2021) and sentiment analysis (Moore & Rayson, 2018).Furthermore, recent studies (Alkhalifa, Kochkina, & Zubiaga, 2021;Röttger & Pierrehumbert, 2021) raise the issue of temporal robustness, i.e. performance drops when models are evaluated on data from the same domain but distant in time.This draws attention to the importance of generalisability and reveals domain-specific reasons for the lack of generalisability to inform future research.
To fulfil their purpose in the real-world, rumour verification models need to be able to deal with the constant growth in and evolution of rumours.However, existing research has not assessed the generalisability of rumour verification models.We fill this gap by evaluating the generalisability of rumour verification models across datasets, within a similar time period (between PHEME and Twitter15, Twitter16) and distant in time (testing on COVID-RV).
Domain adaptation (DA) and domain generalisation (DG) focus on models that learn from one or several different but related domains that will generalise well on unseen testing domains.In DA one can leverage unlabelled target information, but this is not the case for DG.Ramponi and Plank (2020) and Wang, et al. (2021) provide comprehensive recent surveys of DA and DG areas.One important point that Ramponi and Plank (2020) highlight is that there is no common ground on what constitutes a domain in NLP.For example, in the case of social media rumour verification, we could call each new rumour a new domain; or we could aggregate rumours into larger groups such as politics, celebrity or football, and treat these groups as domains; another option yet would be to treat the source of the data (e.g., Twitter, Facebook, News) or even each individual user as a domain.
Thus, in this paper, we do not consider generalisability of rumour verification models to be strictly a domain adaptation or domain generalisation problem.Rumours are always concerned with new, unexpected topics and events of various scales, and thus rumour verification models are designed and trained to identify features that are common/inherent to rumours or misinformation (such as writing style, stance and propagation patterns).These are expected to perform well across different topics and events.Depending on one's perspective of what constitutes a domain (e.g., each rumour or Twitter overall) we may or may not need to perform domain adaptation.Furthermore, as we are unable to foretell the topics of future rumours, we are also unable to accurately predict the scale of events that trigger rumours and the degree of their propagation.Therefore, it is impossible to tell whether new unsupervised event/rumour data will be accessible in time to support rumour resolution, and it is unclear whether DA or DG methodologies should be applied.Thus, contributing to DA or DG methodology is out of the scope of the current paper.Instead our goal is to evaluate the extent of generalisability of SOTA rumour verification models and what could be future fruitful avenues of research to improve it.While we study two setups from DA that can improve generalisability (few-shot learning and updated word embeddings), these have been chosen to serve as a baseline for future work in this direction.
In addition to providing a much needed generalisability investigation of rumour verification models,3 we cater for multiple training data setups and compare leave-one-event-out with random cross-validation splits as evaluation approaches.This is in line with Wang, et al. (2021) and Zhou, Elfardy, Christodoulopoulos, Butler, and Bansal (2021) who stress the importance of train data quality and diversity, as well as the importance of evaluation setups, where the testing domain is unseen during training, such as leave one-domain-out cross-validation.
Another important question is what has a more significant effect on the generalisability of a rumour verification system: choice of model architecture or choice of training dataset.Gröndahl, Pajola, Juuti, Conti, and Asokan (2018) evaluate crossdataset performance of hate speech detection models and report that for successful hate speech detection, model architecture is less important than the type of data and labelling criteria.This is somewhat expected as in their study different datasets include different types of hate speech, such as racist, sexist, offensive; furthermore, the notion of hate and offensive speech has a subjectivity component.Unlike Gröndahl, et al. (2018), we use datasets that have a consistent definition of a rumour and the verification process is arguably less subjective than that of identifying hate speech.Therefore, it remains to be seen whether the influence of the model architecture or the training data will be more significant, and we investigate this in this study.

COVID-19 datasets
The COVID-19 pandemic has been accompanied by a so-called misinfodemic: the wide spread of rumours and conspiracy theories about the coronavirus.To address this, the scientific community has been collecting datasets of true and false information on this topic from various sources.These include scientific publications, news articles and their headlines, social media posts and claims about COVID-19.Table 1 summarises existing datasets and some of their properties.Datasets relevant to our work include: • Poynter: 4 The CoronaVirusFacts/DatosCoronaVirus Alliance Database that gathers all of the falsehoods that have been detected by the CoronaVirusFacts/DatosCoronaVirus alliance.This database unites fact-checkers from more than 70 countries and includes articles published in at least 40 languages.• FakeCOVID (Shahi & Nandini, 2020): multilingual cross-domain dataset of 7623 fact-checked news articles about COVID-19 from 92 fact-checking websites after obtaining references from Poynter and Snopes.Manually annotated into 11 categories of the fact-checked news according to their content.• PANACEA (Arana-Catania, et al., 2022): dataset consisting of heterogeneous claims on COVID-19 and their respective information sources.• ReCOVery (Zhou et al., 2020): news articles on coronavirus annotated using the level of credibility of their source, along with tweets that reference these news articles up to May 2020.• COAID (Cui & Lee, 2020): COVID-19 fake news on websites between December, 2019 and September, 2020 and social media platforms, users' social engagement with such news; tweets automatically identified as relevant to the claims.• COVID-HERA (Dharawat et al., 2020): individual COVID-19 tweets from COAID (Cui & Lee, 2020) and their health risk assessment.• MM-COVID (Li, Jiang, Shu, & Liu, 2020): multilingual, multimodal dataset containing fake news and the relevant social context from February to July 2020.• CMU-MisCOV19 (Memon & Carley, 2020): communities of Twitter users, with their posts collected over three weeks beginning 29th March, 15th June and 24th June 2020 and classifying them as either informed or misinformed.
Our dataset COVID-RV (Section 3) extends CovidLies (Hossain, et al., 2020) by associating social media conversations with claims in CovidLies.The set of claims has been further refined compared to CovidLies to remove time-dependent, multi-part, ambiguous and duplicate claims.Furthermore, unlike other datasets, which either use individual posts (Dharawat et al., 2020;Hossain, et al., 2020;Hou et al., 2022;Shaar, et al., 2020) or do not provide the connections between posts (Cheng, et al., 2021), in our new dataset we focus on finding relevant tweets that are sources of conversations around a rumour and then collect the relevant conversations, which are then used in rumour verification models.The conversations in COVID-RV are of the same type as those in PHEME (Zubiaga, et al., 2016) and Twitter15, Twitter 16 (Ma et al., 2017), which led to the creation of a plethora of automated verification models (see Section 2.1), the robustness of which we test in Sections 4 and 5.It is also worth noting that our dataset covers a wider time period than the above-mentioned prior datasets, that is, between January and November 2020.

Claim-tweet matching
We start with a set of 62 false claims (misconceptions) about COVID-19 refined from CovidLies (Hossain, et al., 2020).The claims in CovidLies were sourced from Wikipedia and manually re-written to be a positive expression of a misconception. 5laims pertaining to the actions of particular political parties, governments, religious groups, or ethnicities were removed as these do not usually pertain to the topic of the COVID-19 pandemic but rather events happening during the same time period; claims referring to photos or videos (multi-modal) were also removed, as they require different approaches to verification, which involve taking other modalities into account.Compound claims were split into atomic ones, some claims were corrected and some edited for brevity and duplicate claims were removed (see Appendix A).
We use the COVID-19-TweetIDs collection (Chen, et al., 2020) to identify tweets matching the 62 claims so as to collect associated tweet threads.In line with previous work (Zubiaga, et al., 2016), we only use tweets in English with over 100 retweets.These are dated between January and November 2020, resulting in a total of 424,073 tweets.We do not edit the content of the tweets, and each model defines its own input representation based on the text of the tweet.Models that use multimodal aspects of the tweets are not represented in the current study; this is left for future work.To match claims and tweets, Hossain, et al. (2020) used BERTScore (Zhang, Kishore, Wu, Weinberger, & Artzi, 2020), which resulted in only 15% of the pairings being actual matches.We found that BM25-based (Robertson, et al., 1995) re-ranking methods are better at retrieving relevant tweets compared to BERTbased methods, with BM25+Mono T5 re-ranking (Nogueira, Lin, & Epistemic, 2019) being the most effective in our preliminary experiments.

Tweet retrieval with BM25+Mono T5
To match claims with relevant tweets we index the set of tweets and use claims as queries.We retrieve the top 100 matches per claim.6 according to their BM25 score and then re-rank these retrieved pairs using the Mono T5 model.We use the implementation of BM25 in Pyserini and Mono T5 re-ranking in Pygaggle Python packages. 7The scores returned by the re-ranking algorithm are converted to relevance probabilities, and only claim-tweet pairs with probability threshold > 90%8 are kept for annotation (resulting in 1215 instances to annotate).Tweet retrieval with DPR In addition to BM25+Mono T5 we apply the Dense Passage Retrieval9 (DPR) (Karpukhin, et al., 2020) method to the same set of tweets and claims as above.We use DPR as it represents a different type of approach compared to BM25, relying on contextual language models rather than exact word matching.We are interested to see whether it would be a complementary approach, which would allow us to retrieve instances missed by BM25.DPR employs a dual-encoder framework to produce dense representations of queries and passages using a neural encoder, e.g., BERT (Devlin, Chang, Lee, & Toutanova, 2019).

E. Kochkina et al.
Once representations are obtained, retrieval is performed using cosine similarity.We obtain our encoder by further fine-tuning on COAID (Cui & Lee, 2020) the encoder originally trained on Natural Questions, using the default settings (Kwiatkowski, et al., 2019).For annotation we take the top 20 instances for each claim.We chose this number of instances to approximately match the number of instances manually annotated for the BM25+Mono T5 matching method.Among those instances we found that the overlap between results returned by the two methods is very low (only 8%), showing that the two methods can indeed be used in a complementary manner.

Relevance annotation
We annotated the tweets identified by the BM25+Mono T5 and DPR methods for relevance to the claim they were paired with.For this we used Amazon Mechanical Turk (MTurk) and recruited 3 annotators per claim-tweet instance.As per our annotation guidelines the relevance of each tweet is judged based on its connection to the specific claim, rather than the general topic of COVID-19.Annotators were given the option to open any links present in the tweet if they judged it useful for determining relevance.Annotators could also flag issues with a tweet (see Appendix B for details on annotator recruitment and guidelines).
Out of 2445 annotated pairs: 1969 received 100% annotator agreement, i.e., all 3 annotators selected the same label (relevant or non-relevant), 21 pairs were flagged as having an issue by at least one of the annotators and in 681 tweets links were opened by at least one of the annotators.The top two rows of Table 2 shows the results of manual annotation of relevance on claim-tweet pairs returned by both matching methods.We found that BM25+Mono T5 re-ranking returns many more relevant claim-tweet pairs than DPR.
We then dropped any accidental duplicates, instances flagged as having an issue and included only relevant tweets with 100% annotator agreement in the next round of annotations for stance.
Stance annotation Tweets annotated as relevant were subsequently annotated for stance, which is known to be particularly useful for rumour verification from social media conversations (Zubiaga, Kochkina, et al., 2018).The stance annotation interface was similar to the relevance annotation one but, rather than using MTurk workers, we employed students and university staff volunteers from the United States with English proficiency.Annotation guidelines (Appendix C) provide instructions and examples for labelling the stance of a tweet towards a claim as either: Agree if the tweet agrees with the claim, Disagree if it disagrees, and No Stance if the tweet expresses no opinion towards the claim. 10As before, annotators could flag any issues with tweets and three annotators worked on each claim-tweet pair.We labelled each claim-tweet pair as Agree, Disagree, or No Stance based on the majority agreement between annotators, or as Tie/No label if there was no consensus.The inter-annotator agreement was 88%.This was computed as the average percent agreement between annotators per instance.The four bottom rows of Table 2 present the results of the stance annotation for each of the matching methods.We observe that the majority of relevant tweets agree with the matched claim.
Collecting conversations As in previous work involving the resolution of social media rumours (see Section 2.1) we collect conversations consisting of tweets discussing the rumour.For tweets that are labelled as either Agree or Disagree, we collect the associated conversations using the Twitter API v2.The majority of tweets (97%) initiate the conversations (i.e., are source tweets), so we use their IDs to download the replies tree.When an annotated tweet is not a source tweet, we download the tweet object first, then get the conversation ID and finally get the conversation.For each conversation, we reconstruct the conversation tree using parent-child tweet pairs.
We notice that COVID-RV contains more tweets per conversation than other datasets (see Table 3), which may be due to several reasons: (1) we choose our source tweets to be the tweets that attracted at least 100 retweets.While this is also a criterion for source tweet selection in the PHEME dataset, the COVID-19 pandemic has attracted unprecedented attention from the public worldwide compared to events in PHEME and other datasets; and (2) the set of claims that COVID-RV contains are the ones that have attracted the most attention out of all circulating claims since they are described in Wikipedia.This property of COVID-RV differentiates it from previous datasets.However, this a natural effect, mainly linked to the prominence and scale of a rumour and it can not be controlled for when collecting new unseen rumours and events.Furthermore Ma et al. (2018) and Bian, et al. (2020) show performance increase in time as more information becomes available.
Veracity labels Rumour verification models operating on conversation threads require veracity labels.We label conversations on the basis of the source tweet of the conversation introducing the rumour.The conversation is labelled as 'False' if the corresponding source tweet agrees with the rumour and 'True' otherwise.Table 3 shows the resulting number of tweets contained in conversations around the claims, as well as the statistics for other rumour datasets we use for training in Section 4. Fig. 1 shows an example of an instance from our dataset.

Models
Here we describe the models whose generalisability we test across datasets.These models were selected among the topperforming rumour verification models with publicly available code that enabled reproducibility.In all cases, we keep the original model parameters proposed in the corresponding articles.This selection includes comparison of models of different types in terms of: (1) the input word-level representation, including models that take as input one-hot embeddings and Bag-of-Words representations (TD-RvNN, BiGCN, SAVED), ones that take Word2vec (Mikolov, Chen, Corrado, & Dean, 2013) embeddings (branchLSTM) and large contextual LMs (BERT, CT-BERT); (2) the representation of the conversation structure including models that use trees (TD-RvNN,BiGCN), linear sequences (branchLSTM, SAVED) or individual tweets only (BERT, CT-BERT); (3) vocabulary associated with the rumour, with most models unaware of COVID vocabulary and a couple exposed to unannotated COVID tweets (CT-BERT, CT-SAVED).See summary of model properties in Table 4.Here we focus our study on single-task learning models.Since multitask learning has been a very successful method for rumour verification (Lee, et al., 2021), we plan to explore this setting in future work.An overview of the models chosen is provided below: branchLSTM Kochkina and Liakata (2020), Kochkina et al. (2018) uses linear tweet branches from rumour conversations as input and an LSTM-based model to process them. 11An average of per-branch predictions is used to obtain the final veracity prediction for the full tree.This model was originally proposed for RumourEval-2017 dataset (Kochkina, Liakata, & Augenstein, 2017) and then was tested on PHEME (Kochkina et al., 2018) and Twitter15/16 (Kochkina & Liakata, 2020).It was also a strong baseline for RumourEval-2019 (Gorrell, et al., 2019).
Top-Down Recursive Neural Networks (TD-RvNN) Ma et al. (2018) are top-down tree-structured neural networks for rumour representation learning and classification, which naturally conform to the propagation layout of tweets or a tree structure of a conversation.12This model was originally proposed and tested on Twitter15/16 datasets (Ma et al., 2018).
Bidirectional Graph Convolutional Neural Network (BiGCN) Bian, et al. (2020) operates on both top-down and bottom-up propagation of rumours. 13It leverages a GCN with a top-down directed graph of rumour spreading to learn the patterns of rumour propagation, and a GCN with an opposite directed graph of rumour diffusion to capture the structures of rumour dispersion.This model was originally proposed on and tested on Twitter15/16 and Weibo datasets in Bian, et al. (2020).
Stance-Augmented VAE Disentanglement framework (SAVED) Dougrez-Lewis et al. ( 2021) is a two stage approach to rumour verification, the current state-of-the-art on the PHEME dataset.14First, a Variational Autoencoder is used to obtain representations of each rumour by disentangling the informational content of a tweet from the manner in which it is written.This is achieved by obtaining latent topic vectors in an adversarial learning setting using the auxiliary task of stance classification.The resulting latent vectors are then used to predict rumour veracity.This model was originally proposed and tested on the PHEME-5 dataset, i.e., using only the 5 largest events from PHEME.Here we use the full PHEME dataset with 9 events, therefore reported results differ from Dougrez-Lewis et al. (2021).We have also trained a CT-SAVED model which is a variant of SAVED, where the Variational  Autoencoder is trained using the data from the training set as well as additional unlabelled COVID-19 Twitter conversations. 15hus the topic-discourse module of the CT-SAVED model has been exposed to COVID-19 vocabulary and can therefore produce COVID-19-aware tweet representations.
Large pre-trained language models BERT Devlin et al. (2019) and CT-BERT (Müller, Salathé, & Kummervold, 2020). 16We use these to obtain a representation and then classify rumours based purely on the source tweet of the conversation.We compare a generalpurpose pre-COVID BERT model with a Twitter-specific one that includes a more up-to-date lexicon of COVID tweets (CT-BERT).
BERT has been previously tested on the PHEME-5 dataset in Dougrez-Lewis et al. ( 2021) in a similar setting.

Data
To train the models we use publicly available datasets of Twitter rumour conversations (see Table 3 for details).These include: PHEME Twitter conversations discussing rumours about nine breaking news events, which were labelled as True, False or Unverified by journalists (Zubiaga, et al., 2016).The PHEME dataset also contains conversations that discuss the same events but are labelled as Non-Rumours (Kochkina et al., 2018).
Twitter15/16 The Twitter15 and Twitter16 datasets (Ma et al., 2017) were created using reference datasets from Ma, et al. (2016) and Liu, Nourbakhsh, Li, Fang, and Shah (2015).Claims were annotated using veracity labels on the basis of articles corresponding to claims found in rumour debunking websites such as snopes.comand emergent.info.These datasets merge rumour detection and verification into a single, four-way classification task containing True, False and Unverified rumours as well as Non-Rumours.
Both datasets are split into 5 folds for cross-validation and, contrary to the PHEME dataset, folds are of approximately equal size with a balanced class distribution.It is not possible to apply leave-one-event-out cross-validation to Twitter15 and Twitter16 datasets as the event split is not provided.The overall data format is practically equivalent in all of the datasets (as shown in Fig. 1), which enables their use within the same rumour verification models interchangeably.

Experiment setup
We split our experiments into 4 groups: 1. In-dataset evaluation of models (on PHEME, Twitter15, Twitter16) using the evaluation strategy per dataset as published in previous works (leave-one-event-out cross-validation for PHEME and 5-fold cross-validation for Twitter15, Twitter16); 2. Cross-dataset evaluation of models on datasets from a similar time period (training on PHEME, testing on Twitter15/Twitter16 and vice versa); 3. Cross-dataset temporal robustness assessment by evaluating all models on COVID-RV.This presents models with new rumours distant in time from the training data.When evaluating the models on COVID-RV we also compare results obtained using different training datasets and their combinations, to evaluate the effect of existing resources on model performance.4. Assessing few-shot learning capabilities of the models on COVID-RV.
We use the following training combinations: 3-class classification with PHEME dataset (PHEME3, i.e.True(T)/ False(F)/ Unverified(U)), 4-class classification with PHEME dataset (PHEME4, i.e.True(T)/ False(F)/ Unverified(U)/ Non-Rumour(NR)), 4class classification with Twitter15, Twitter16 datasets, and combinations of Twitter15 + Twitter16 and Twitter15 + Twitter16 + PHEME3 The new COVID-RV dataset is used exclusively for testing.While the PHEME dataset also has 4 classes, in previous work it is usually split into two separate tasks -binary rumour detection (Rumour vs Non-Rumour) and 3-way veracity classification (T/F/U).Here, to enable cross-dataset evaluation between PHEME and Twitter15/16, we use all available conversations from PHEME together for 4-way classification.While COVID-RV used for testing only has True and False classes, we chose to train the models using the original number of labels in the training datasets, including Unverified and Non-rumour classes.We aim to imitate a realistic scenario in which an existing pre-trained model is used for predictions.
In our experiments we truncate the largest conversations from COVID-RV to have a maximum of 1000 branches, and use the first 20 responses from each branch in order to make it computationally feasible.
We keep all the original hyper-parameter values as fixing hyper-parameters allows us to compare the models in different training settings (see the values in Appendix D).We report the result of a single run in our results tables.
We evaluate the performance of our models in terms of accuracy and macro-averaged F1-score.Macro-averaged F1-score (MF) is particularly suitable to evaluate performance on the PHEME dataset as it contains significant class imbalance.Evaluation on COVID-RV mainly focuses on per-class performance for the True and False classes.

Generalisability across datasets from the same time period
Table 5 shows in-and cross-dataset model performance for datasets from a similar time period (PHEME and Twitter15/16, 17 both from 2014-2016).In the in-dataset experiments we do not observe a consistent ranking of model performance.We acknowledge that this observation can be somewhat affected by tuning hyper-parameters of each model to each dataset individually.However, it is not the goal of this paper to reach the highest possible performance with each model, but to evaluate their generalisability in various settings.Thus we preserve the hyper-parameters across our experiments for fair comparison between setups.BERT and BiGCN models do consistently better than branchLSTM, TD-RvNN and SAVED in this setting.Performance on the PHEME dataset E. Kochkina et al.  is noticeably lower than on Twitter15/16.This could be due to several important differences in the datasets.Firstly, performance on the PHEME dataset is evaluated using leave-one-event-out cross-validation, while Twitter15/16 are split into 5 folds without separation between events.The folds in PHEME are of different size and different class proportions, while Twitter15/16 folds are balanced.Evaluation on the PHEME dataset is more challenging and closer to a real-world setting.
In the cross-dataset experiments between PHEME and Twitter15/16, we observe a sizeable performance drop (RQ1).This highlights that none of the datasets realistically represent the distribution of the unseen data and the original performance was overestimated.We notice that for models trained on PHEME4 and evaluated on Twitter15/16 the drop in macro-averaged F1-score is not as dramatic compared to the drop of the models trained on Twitter15.This indicates that the performance on PHEME is not overestimated to the same degree because of its more realistic evaluation setup (RQ6).
We find that for models trained on PHEME4 the performance ranking order of models remains similar, with BiGCN and BERT being the top performing models (RQ3).However, for models trained on Twitter15 and evaluated on PHEME, there is a dramatic drop in performance, the performance ranking is flipped, and the simpler branchLSTM model gives the best results (RQ3).This shows that previous performance scores and even model ranking can be unreliable when tested on a dataset different to the one used for training (RQ2).The performance of the models trained on PHEME4 dataset is also higher comparing to models trained on Twitter15.It shows the robustness of models trained on a dataset which contains cross-event variability, such as PHEME.

Generalisability to COVID-RV
Table 6 shows the performance of rumour verification models on COVID-RV using different training data scenarios.In all of the scenarios, due to its size, COVID-RV is only used for testing and not for training.The training dataset used is shown in the column names.We observe very low performance scores in terms of accuracy and macro-averaged F1-score; all the models score lower than a majority class baseline in terms of macro-averaged F1-score, demonstrating the challenge of generalising across time (RQ1).These low performance scores can be somewhat explained by the fact that the models are trained to predict three or four classes rather E. Kochkina et al. than two, and in experiments with 3-way classification we see higher performance than in those with 4-classes. 18For example, zero performance occurs in several cases with the CT-BERT model when it predicts all of the instances as either 'Non-rumour' or 'Unverified'.However, we chose to preserve the amount of classes in the data as our task is to imitate the realistic scenario in which ready made models are facing new data with new vocabularies and class balance.
We find that model ranking differs from what we observed in previous experiments (RQ2).Each of the models outperforms the rest for some training dataset in Table 6, especially SAVED and CT-SAVED.This is no longer the case for BiGCN and BERT, which now have the lowest performance.When we calculate the mean average of scores per model across all training settings, SAVED (the model which exploits the difference in topics discussed in a conversation from they way they are discussed) has the highest performance in terms of both accuracy and macro-averaged F1-score (RQ3).
When we look at overall performance in Table 6 we can notice that not all of the models benefit from using all of the datasets for training.In particular, some models experience changes in performance when using combinations of the datasets (e.g., CT-BERT model trained on Twitter15 or Twitter16 individually has better performance than on their combination, and when training on Twitter15 it performs better than training on all three datasets combined).This is somewhat expected and can be explained by the combinations of datasets affecting per class performance differently, e.g., due to differences in class balance.As a result the models in some cases will start predicting the classes not present in the testing data more frequently.When we calculate the average performance for each of the datasets from Table 6 across all pre-COVID models (i.e.excluding CT-BERT and CT-SAVED), we find that indeed on average the best performance is achieved by the combination of all three training datasets (RQ4).
In our experiments, both across PHEME and Twitter15/16 datasets and testing on COVID-RV, unlike Gröndahl, et al. (2018), who found that the choice of training data is more significant than the choice of the model architecture for model generalisability (see Section 2.2), we do not observe that the change of model architecture has a significantly higher or lower impact on rumour verification performance than the change of the training dataset.
Given the low overall scores, we focus on per-class evaluation on the True and False classes to compare performance of models in various training setups.Fig. 2 shows per-class F-scores for True and False classes for each model and training set.These results in a table format, including per-class precision and recall, can be found in Table 7, along with Figs. 3 and 4 visualising per-class precision and recall.The best performing model needs to score high on both the True and False class, thus models that have balanced performance on both classes would lie along an  =  line on Fig. 2. On the plot we can see that models trained on PHEME3 tend to perform better on the True class as it is a majority class in that dataset, while models trained on Twitter15 perform better on the False class.Models trained on the combinations of all of the training data are the closest to the  =  line, i.e., have the most balanced performance.The CT-SAVED model trained on the Twitter15 dataset stands out in Fig. 2 as having the best and most balanced performance, followed by TD-RvNN and CT-SAVED trained on all of the training datasets.We also found per-class precision to be consistently higher than recall across the majority of the experiments, which is expected due to a high number of instances being classified as either Non-Rumour or Unverified.
We also test the benefits of using embeddings trained with data covering the topic and time period in the test set via CT-BERT and CT-SAVED.Results in Table 6 show that CT-BERT performs better than BERT in most cases.CT-SAVED also outperforms SAVED in most setups.This confirms that updating embeddings is a promising method to improve performance of rumour verification models for new rumour events in line with work in other fields (Alkhalifa et al., 2021) (RQ7 ).

Few-shot learning experiments
Few-shot learning (Wang, Yao, et al., 2020) enables testing the ability of models to learn effectively from a small number of instances.We evaluate the benefits of few-shot learning in experiments with COVID-RV, combining it with the use of up-to-date COVID tweet (CT) embeddings (RQ7 ).We consider two settings: (1) adding one conversation from each claim into the training data; and (2) adding three conversations from each claim.We have made this choice of a number of instances to add so that all of the claims would receive equal coverage among the few-shot learning examples added to the training.The COVID-RV dataset is not very large, and some claims only have 5 tweets associated with them.A few-shot learning approach implies only using a small amount of data, and therefore we chose to use 1 example per claim (lowest possible) and 3 examples per claim (towards the higher end, whist still being applicable to all of the claims).This setup somewhat changes our task as now the model is exposed to a set of annotated conversations around each claim during training and thus 'knows' the correct facts or has a chance of memorising them.
Here we used the combination of all datasets (Twitter15+Twitter16+PHEME3) as our training data.Table 8 shows performance of the four-class classification models in a few-shot learning setup in terms of accuracy and macro-averaged F1-score.We compare these to the results in the last column of Table 6 (denoted as zero-shot in Table 8).We see improvement of performance for all of the models.Adding more instances is also beneficial in most of the cases.The observed performance is now on par with performance in evaluation across datasets from a similar time period.Therefore, few-shot learning helps bridge the gap between the datasets distant in topic and time.However, there is still need for improvement to reach the in-dataset performance.
We have also analysed these performance improvements from the per-class perspective (see Table 9).The combination of the few-shot approach with updated CT embeddings leads to further improvement in per-class performance.As COVID-RV is imbalanced towards the False class, our few-shot sample is also imbalanced towards False, therefore, for most of the models we see higher improvement in per-class performance for the False class.We find that CT-SAVED and CT-BERT benefit the most from few-shot training and result in a relatively balanced performance on the True and False classes compared to TD-RvNN and BiGCN, which perform poorly and do not show improvement in the True class.This could be explained by the ability of these models to make better use of the COVID vocabulary, recognising post COVID words as meaningful rather than treating them as unknown tokens, which would make it harder to learn these from few-shot examples.The strengths of CT-SAVED and CT-BERT appear complementary to each other with CT-BERT performing best on the False class, and CT-SAVED performing best on the True class.

Effect of distance between datasets
We hypothesise that a performance drop arises from differences between training and test data and that the performance gap decreases the closer the datasets are to each other (RQ5).To test this hypothesis we measure the difference between datasets using the Kullback-Leibler Divergence (KL)   ( ∥), Jaccard Index (Intersection over Union -IoU) and DICE coefficient.We chose these metrics because they are common metrics to measure distance between corpora (Lu, Henchion, & Mac Namee, 2021;Peinelt, Liakata, & Nguyen, 2019).We define them below.
A corpus can be regarded as a probability distribution across words in a vocabulary, and the KL divergence between two corpora P and Q can be calculated as , where  is the number of unique words in the two corpora,   and   are the probabilities of observing word  in corpus  and  respectively estimated through dividing the th word occurrence frequency by the total number of words in the corpus.
The Jaccard Index (Intersection over Union -IoU) and DICE coefficient are calculated using the following equations: where   and   -are the sets of unique words from two corpora, |  | denotes the size of the set, thus |  ∩   | -number of unique words present in the intersection of the corpora vocabularies, and |  ∪    | -number of unique words present in the union of   and   .
IoU and DICE equal to 1 when the datasets are identical, and zero for datasets with no vocabulary overlap.KL divergence is zero when the datasets are identical and has an unbounded upper value when the datasets are dissimilar.Further, we calculate the Pearson correlation coefficient between the distance scores and mean average of accuracy scores across models for the corresponding dataset pairs.Table 10 shows the distance metrics and average performance scores for the dataset pairs as well as Pearson correlation coefficients between the two.Note that IoU and DICE are symmetrical metrics (do not depend on the order of  and ), while KL divergence is not.Non-symmetrical metric in our case is beneficial to use in order to distinguish between setups in which training and testing datasets switch places.For example, cross-dataset evaluation between PHEME4 and Twitter 15 leads to different outcomes for the model performance depending which dataset was used for training.In Table 10 the dataset shown on the left is used for training , and the one of the right for testing  .Within dataset performance evaluations follow cross-validation procedure, so the distance metrics shown in Table 10 are mean average of distance scores calculated for each fold.
We find a negative correlation between the model performance and distance between datasets for all metrics, i.e. the higher the distance, the lower the performance scores.Adding few-shot examples into the training somewhat decreases the distance between datasets in line with improvement in experimental results when using few-shot learning.This highlights that, indeed, the vocabulary difference is an important factor for performance drop.The correlation is strong for KL divergence (coefficient of −0.76), and only moderate for IoU and DICE (coefficient values around 0.3), therefore KL divergence is a more suitable metric and holds some predictive power of an expected performance drop.

Effect of conversation length
Investigating which data properties may be related to performance drop, another aspect to be explored is the role of conversation length across rumours.For example, Zubiaga, et al. (2016) found that, in the PHEME dataset, the longer the conversations are, the more likely they are to diverge from the original topic.By contrast Ma et al. (2018) and Bian, et al. (2020) show that for Twitter15/16 performance increases in time as more information becomes available.These contradictory observations indicate a difference in the speed of topic shift within conversations across different datasets, which may affect model performance.Such topic shifts should therefore be exploited both in training and in evaluating robust models.
We consider whether the lengths of the conversations in COVID-RV affect the performance of the models.For each instance in COVID-RV and for each model we calculate in how many training settings the model made a correct prediction, as a proxy for the instance 'difficulty', as well as the length and the depth of the corresponding conversation.We then calculate Pearson correlation coefficients between the conversation 'difficulty' and its length and depth.Table 11 shows these Pearson correlation coefficients for each of the models.We find that the correlation coefficients are very low and thus we cannot establish either a positive or negative effect of the conversation length on model performance in this case.

Discussion
This section discusses the implications of this research for future work on automated social media rumour verification.The main goals were to analyse whether automated rumour verification models would encounter generalisability issues (RQ1) and, if so, where the challenges lie such that future research directions can focus on those aspects (RQ2-7 ).RQ2,3,7 involve the role of model architecture and understanding what properties of models may affect generalisability, while RQ4-6 discuss the role of training data and setup.
RQ1: How well can rumour verification models generalise to unseen rumours across datasets from similar time periods and those more distant in time?For RQ1 we found a drastic performance drop for both types of cross-dataset experiments.This lack of generalisability undermines the practical value of rumour verification models and highlights the need for further efforts to create generalisable methods.The fact that the original performance of each model was overestimated highlights that none of the datasets can realistically represent the distribution of the unseen data.
Discussion on the role of models in generalisability of rumour verification systems RQ2: If the models experience performance drop on unseen datasets, will the ranking of model performance evaluated across datasets align with the ranking of these models when evaluated within one dataset?Considering RQ2, we discovered variability of model rankings when tested outside the original dataset.This suggests that developing models and evaluating them within a single dataset is not enough to ensure creation of generalisable approaches to rumour verification for real-world applications.It is crucial to test models on events unseen during training (see also RQ6) and to include cross-dataset evaluation.Novel techniques that are able to better promote generalisability of rumour verification models is an important research direction.
RQ3: Which models or groups of models show better generalisability?There doesn't seem to be a model or group of models that consistently generalise better across datasets.However, in the experiments with COVID-RV, SAVED (the model that obtains conversations around rumours by disentangling the topic and the manner of speech) has shown promising results.For future work this suggests that we need to develop models that use or find generalisable features indicative of rumour veracity (potential examples could be user stance or propagation patterns).We should draw on developments in domain adaptation and generalisability domains (Ramponi & Plank, 2020;Wang, et al., 2021), incorporating generalisability tools and approaches into rumour verification models.
RQ7: How receptive are different models to strategies such as few-shot learning and using embeddings updated with data from a new rumour?Our investigation of RQ7 found that updating word embeddings and providing models with a few training examples from a new event (few-shot learning) helps bridge the gap between datasets distant in topic and time with BERT and SAVED models benefiting the most.Thus, updating and/or temporally aligning (Alkhalifa et al., 2021) embeddings may be important in improving performance across time.Ni et al. (2021) show that BERT fine-tuned for rumour detection cannot identify common sense rumours with more than 50% accuracy.Incorporating commonsense knowledge and other inductive biases along with few-shot learning could be fruitful avenues for improving generalisability in future research.Bragg, Cohan, Lo, and Beltagy (2021) introduce a few-shot NLP benchmark and provide recommendations for reliable few-shot evaluation, which can aid future work on developing strong few-shot learners.However, these are only available when there is access to some labelled or unlabelled data from the domain of interest.

RQ4: Which training datasets lead to better model generalisability? Does increasing the size of the training data improve generalisability?
The quality of the training data is one of the key elements for training a generalisable model (Wang, et al., 2021).In this work we performed experiments with the widely used benchmark datasets for automated rumour verification and we investigated the impact of different combinations of datasets in training.While we found that the aggregate of all datasets does not always lead to performance improvement, on average the best performance is indeed achieved by combining all three training datasets.A considerable limitation in rumour verification research is the small size of existing datasets.A possible way to address this would be to develop novel strategies for effectively leveraging combinations of existing datasets with differences in annotations, e.g., through transfer and/or multitask learning.Alternatively, creating synthetic data instances or whole datasets can be beneficial and also lead to modelling innovation, as shown in Liu, Lee, Jia, and Liang (2021).Overall, we did not find strong effects that would suggest that we should weigh the contributions of training data over model architecture or vice versa, thus we recommend that future research should look into both of these directions.
RQ5: Which data properties may be related to performance drop?We have shown that the difference between training and testing datasets is an important factor in performance drop.We have measured the distance between dataset vocabularies and argue that metrics such as KL divergence can be potentially used to estimate expected performance drop of a model.Other metrics defining the 'distance' between benchmarks can be also explored, e.g. benchmark concurrence as defined in Liu et al. (2021).However, we recommend that other data properties should also be covered in future work.These could be model-specific and depend on the produced embeddings, e.g., for fake news detection, Zhou, et al. (2021) show that similarity between RoBERTa embeddings of article titles in training and testing datasets are correlated with performance.The differences in stance and propagation patterns could also be addressed more explicitly in future work given extra annotations, rather than implicitly through models utilising conversations.
An approach to assess model behaviour and thus reveal potential deficiencies in the data used for training them, is to use a checklist (Ribeiro, Wu, Guestrin, & Singh, 2020).A checklist is a set of unit tests to assess different aspects of model functionality.
Task-specific checklists can be created, e.g.Röttger, et al. (2021) created one for hate speech, which includes low level functional tests such as 'leet speak' as well as higher level test instances containing 'hate expressed using slurs' vs 'hate expressed using profanity'.Ni et al. (2021) show that rumour detection models can learn shortcuts due to spurious correlation between words and veracity labels in training datasets.Checklists could be used to identify such data artefacts and further expand training data by creating artificial and, perhaps, adversarial examples.In the case of rumour verification, handling negations correctly would be a very important test.Human-created negative variations of claims used in COVID-RV are made available by Hossain, et al. (2020).Additional negations can be created using checklists.Furthermore, Zhou, et al. (2021) find that unreliable news detection datasets can be biased by the ways they are curated, annotated, and split.Steps should be taken to identify and mitigate these biases in rumour verification datasets.
RQ6: Does the evaluation strategy used in the original training dataset play a role in training more generalisable models or more realistically estimating their future performance?We find that indeed the evaluation strategy used in the original training dataset does play a role in realistically estimating model performance.The performance on the PHEME dataset is not overestimated to the same degree as it allows evaluation through a leave-one-event-out cross-validation setting.This enables a more challenging and realistic evaluation scenario, leading to lower but more reliable scores.Thus, here we highlight again that releasing datasets that cover multiple events is a good practice that should be followed in future work.
There are important limitations that make rumour verification a challenging task.Rumour conversations may not always contain sufficient information to support a veracity verdict.Models may rely on the stances of users or their choice of words (learning rules like ''formal language is more trustworthy'', or ''expression of doubt is indicative of unverified rumours''), rather than evidence.We believe that this may be addressed by combining social media signals with signals extracted from a range of trusted sources, such as peer-reviewed publications, trustworthy news organisations, independent fact-checking organisations, etc.Finally, annotating datasets in such a way that helps models learn to identify and provide explanations for their predictions (rationales) is also important if they are to be trusted and thus be effective in real-world settings (Jain, Kumar, & Shrivastava, 2022).

Conclusions
We have evaluated, quantified and characterised the generalisability of social media rumour verification models in two settings: across datasets from similar time periods and on a newly collected COVID-19 dataset, distant in time from the training data.We have demonstrated a significant performance drop in both settings, which is further pronounced when datasets are distant in time.The extent of the divergence between training and testing datasets is analogous to the drop in performance.We found that fewshot learning and updating unsupervised embeddings with posts from the new events reduces the drop in performance.However, significant scope for improvement remains and, based on our findings, we have outlined directions for future work.

Ethical considerations
This work involves ethical considerations concerning the spread of rumours and misinformation on social networks such as Twitter and Facebook.Although the systems analysed in this work are intended to prevent such information from being disseminated, the data we collect for evaluating these systems could potentially be mis-purposed by bad actors to adversarially construct misinformation that avoids detection.
Pending publication, the COVID-RV dataset will be released in compliance with the Twitter Developer guidelines, which require further compliance of all downstream users of our data.These guidelines include provisions that users of our dataset do not disclose the identities of Twitter users who have protected or deleted their accounts or tweets during or after data collection.
Annotations were collected using a combination of paid crowd workers and student volunteers.Crowd workers from Amazon Mechanical Turk were paid $2.05 per HIT (Human Intelligence Task), which was calculated using the wage of $11.93 per hour and our estimation of average time to complete the HIT, where each HIT included annotation of 15 claim-tweet pairs.

Data availability
Data will be made available on request.

Fig. 1 .
Fig. 1.Example instance from the COVID-RV dataset.Each instance contains a False claim, a Relevant source tweet annotated as Agreeing or Disagreeing with the claim, as well as tree-structured conversation around the source tweet conveying the rumour.

Fig. 2 .
Fig. 2. The plot of F-scores for True and False classes of COVID-RV.The colour and shape of a marker identify a training dataset and each point is labelled with the model used to obtain the result.The best performing models are the ones closest to the dotted  =  line.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Table 2
Relevance and stance annotation outcomes for the two claim-tweet matching methods.

Table 3
Number of posts, conversation trees and class distribution in the datasets (T -True, F -False, U -Unverified, NR -Non-Rumour).

Table 4
Summary of model properties.

Table 6
Performance of the models trained on PHEME, Twitter15/16 and their combinations, evaluated on COVID-RV (Acc.-accuracy, MF -macro-averaged F1-score.).The majority class in the training data is True, so if the model predicts True all the time we will see accuracy of 0.398 and macro-averaged F1-score of 0.285.

Table 7
Per-class performance for True and False classes of models trained on PHEME, Twitter15/16 and their combinations, evaluated on COVID-RV.Bold: highest result in a column; underscore: highest result in a row.A majority baseline (always True class) would score 0.57 True class F-score.

Table 8
Performance of the models on COVID-RV in a few-shot learning setup compared to zero-shot setup (first column).Best performance per row is highlighted in bold.

Table 9
Per-class performance of the models on COVID-RV in our few-shot learning setup.Each column shows the results of including the additional tweets for each claim into the training data.

Table 11
Pearson correlation coefficient between model performance and conversation length and depth.