CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

Social media platforms play an essential role in crisis communication, but analyzing crisis-related social media texts is challenging due to their informal nature. Transformer-based pre-trained models like BERT and RoBERTa have shown success in various NLP tasks, but they are not tailored for crisis-related texts. Furthermore, general-purpose sentence encoders are used to generate sentence embeddings, regardless of the textual complexities in crisis-related texts. Advances in applications like text classification, semantic search, and clustering contribute to the effective processing of crisis-related texts, which is essential for emergency responders to gain a comprehensive view of a crisis event, whether historical or real-time. To address these gaps in crisis informatics literature, this study introduces CrisisTransformers, an ensemble of pre-trained language models and sentence encoders trained on an extensive corpus of over 15 billion word tokens from tweets associated with more than 30 crisis events, including disease outbreaks, natural disasters, conflicts, and other critical incidents. We evaluate existing models and CrisisTransformers on 18 crisis-specific public datasets. Our pre-trained models outperform strong baselines across all datasets in classification tasks, and our best-performing sentence encoder improves the state-of-the-art by 17.43% in sentence encoding tasks. Additionally, we investigate the impact of model initialization on convergence and evaluate the significance of domain-specific models in generating semantically meaningful sentence embeddings. The models are publicly available at: https://huggingface.co/crisistransformers


Introduction
Social media platforms, such as Facebook and Twitter, have become an essential medium for information sharing and communication during times of crisis [1,2].Particularly during disasters, such as wildfires, earthquakes, hurricanes, tsunamis, floods, cyclones, and epidemics, social media platforms play a pivotal role in the timely dissemination of information.These platforms are critical information sources for affected individuals and emergency responders, enabling real-time updates on evolving situations and providing firsthand accounts from those directly and indirectly impacted.In general, social media contributes to community mobilization, i.e. enabling individuals to seek and offer assistance and organize relief efforts.The enormous amount of user-generated content on social media platforms acts as a rich source of historical as well as real-time data.However, the volume and textual complexity of crisis-related social media texts give rise to multiple challenges for effective analysis and understanding.The volume necessitates automated analysis as the number of conversations increases exponentially during a crisis, and the textual complexity involves dealing with informally written texts with a significant presence of acronyms, misspellings, hashtags, mentions, etc. Domain-specific pre-trained language models have helped produce stateof-the-art results for numerous NLP tasks in various areas such as biomedical research [3], scientific literature analysis [4], clinical text analysis [5] and financial text analysis [6].Trained on massive amounts of domain-specific texts, these models produce contextual text representations within their respective domains.Likewise, the potential of transformer-based [7] pre-trained models can be harnessed to understand and analyze crisis-related social media texts for effective and efficient crisis response and management.Despite the broad array of domains in which pre-trained models have been employed, a notable gap exists, i.e., the absence of pre-trained models explicitly tailored for crisis-related social media texts.Furthermore, pre-trained language mod-els do not produce semantically rich sentence embeddings 1 , critical for tasks like semantic search and clustering [8].Currently, the generation of semantically meaningful sentence embeddings, regardless of the domain, relies on general-purpose sentence embedding models (sentence encoders) [8,9].These models utilize pre-trained models that have been trained on corpora comprising texts from broad and general domains.Hence, there exists a necessity to investigate the efficacy of utilizing domain-specific pre-trained language models and sentence encoders for processing crisis-related social media texts.
To address the above-discussed gaps in the crisis informatics literature, this study proposes CrisisTransformers, an ensemble of pre-trained language models and sentence encoders trained on hundreds of millions of crisis-related tweets from over 30 different crisis events, including the COVID-19 pandemic.CrisisTransformers provide valuable embeddings that enhance crisis response and management.These embeddings can be utilized in various tasks, including text classification, semantic search, clustering, and topic modelling.Advancements in these applications contribute to a more comprehensive understanding of crisis-related social media texts, thereby aiding decision-making processes and facilitating targeted interventions and communication strategies during times of crisis.
This study contributes the following to the crisis informatics literature: • We provide the first set of experiments relative to domain-specific pretraining to address the following research questions: -How does the choice of model initialization impact pre-training in terms of loss convergence?
-With BERTweet [10] and other strong baselines in place, can yet another domain-specific pre-trained model demonstrate superior performance in crisis-related social media text classification?
-To what extent do domain-specific pre-trained models help generate sentence embeddings with semantic richness, in comparison to current pre-trained models and sentence encoders?
-Can the performance gains achieved by increasing training data size lead to substantial improvements in domain-specific sentence encoding tasks?
• We introduce CrisisTransformers, the first pre-trained language models and sentence encoders designed for processing crisis-related social media texts.The pre-training of CrisisTransformers was done on 6 NVIDIA A100 GPUs over a period of 2 months.
• Our pre-trained models outperform existing models across all 18 crisisrelated datasets in classification tasks, and our best-performing sentence encoder improves the current state-of-the-art by 17.43% in sentence encoding tasks.Results confirm that CrisisTransformers can capture distinct linguistic nuances, informal language structures, and unique contextual cues present in crisis contexts.
• We publicly release CrisisTransformers, which can be used with the Transformers [11] library.We anticipate that these models will serve as a robust baseline for tasks involving the analysis of crisis-related social media texts.
The rest of the paper is organized as follows: Section 2 discusses related work, Section 3 details the materials and methods used in designing Cri-sisTransformers, Section 4 presents evaluation results and discussions, and Section 5 concludes the paper.

Related Work
Transformer-based models have shown remarkable success in various NLP tasks, outperforming traditional approaches and significantly advancing the state-of-the-art.The key to their performance lies in pre-training -a stage that involves training the model with unsupervised learning objectives on large-scale corpora such that the model captures contextual information and learns rich language representations.Researchers have used contextualized embeddings generated by the transformer-based models to design powerful classification/regression models and further adjusted the models to make their embeddings suitable for semantic search and clustering tasks.In this section, we review the literature associated with Encoder-only models, such as BERT, and others, predominantly used to extract embeddings for tasks such as classification/regression, semantic search, clustering, etc.
BERT, introduced in [12], has become a ubiquitous baseline in NLP tasks.BERT uses two pre-training objectives -masked language modelling (MLM) and next sentence prediction (NSP).The MLM objective involves randomly masking specific tokens of an input sentence and training the model to predict the original masked tokens based on the context (surrounding words).Through this objective, BERT learns relationships between words and captures rich contextualized representations.Since the introduction of BERT, MLM has become a standard pre-training objective for many transformerbased models.Various improvements in training approaches and variants of MLM have been explored in subsequent research.In [13], Liu et al. proposed RoBERTa, which outperformed BERT in various downstream tasks with some changes in the pre-training process -large batch size, longer training, more training data, and removal of the NSP objective.In [14], ALBERT was introduced, which offered competitive results with reduced parameters through factorized embedding parameterization and cross-layer parameter sharing.MPNet was introduced in [15] combining MLM and permuted language modelling (PLM).In PLM [16], a sequence is randomly permuted, and the model autoregressively predicts the tokens.In [17], with XLM-RoBERTa, Conneau et al. confirmed the usefulness of pre-training multilingual language models on large-scale data containing 100 languages for cross-lingual transfer tasks.In [18], Clark et al. introduced ELECTRA, a pre-training objective where two models (generator and discriminator) are involved -the generator replaces tokens in a sequence, and the discriminator predicts which tokens are originals and which are the ones replaced by the generator.The above-discussed models were pre-trained on datasets such as Wikipedia, BooksCorpus, OpenWebText, CC-News, etc., which contain general domain texts.Researchers have also introduced domain-specific pre-trained models; we discuss some of those models next.
BERTweet [10] is a transformer-based model specifically designed for processing Twitter data and other social media texts.It leverages the BERT model configuration and incorporates RoBERTa's pre-training approach.During pre-training, it was exposed to a massive corpus containing 16 billion word tokens.BioBERT, which was introduced in [3], was pre-trained on biomedical texts, including PubMed abstracts (PubMed) and PubMed Central full-text articles (PMC), using the same architecture as BERT.Similarly, SciBERT [4] also shared the architecture of BERT but was pre-trained on a random sample of over 1 million papers.Its pre-training corpus consisted of 18% computer science and 82% biomedical domain full-text papers.Additionally, a variant of the BERT architecture called ClinicalBERT [5] was developed by pre-training on electronic health records.This specific pre-training made ClinicalBERT suitable for processing clinical text and medical data.BERT's application has also been extended to the finance domain.FinBERT, introduced in [6], is a pre-trained model trained on an extensive financial communication corpus containing 4.9 billion tokens.
When the pre-trained models utilize either the embeddings of the CLS token or the mean-pooling of all tokens to generate sentence embeddings and subsequently undergo fine-tuning with a classification head, they produce state-of-the-art results in text classification/regression tasks.However, previous research shows that such sentence embeddings lack semanticity and are actually worse than averaging GloVe embeddings [8].For effective semantic search and clustering tasks, it is critical to have semantically meaningful embeddings that position sentences in a vector space, such that semantically similar sentences are located closely together.Generating such sentence embeddings is an extensively researched area, and various methods have been proposed, which we discuss next.
In [19], Kiros et al. trained an encoder-decoder model to reconstruct the surrounding sentences of an encoded sequence so that the sentences that share semantic properties are mapped to similar vector representations.In [20], a siamese BiLSTM network was trained with max-pooling on the Stanford Natural Language Inference (SNLI) dataset which outperformed previous unsupervised methods [19,21].In [22], a transformer network was trained and unsupervised learning was extended with training on the SNLI dataset.Additionally, in [23], Yang et al. presented an unsupervised learning approach to sentence-level semantic similarity based on conversational data.Until this period, the sentence encoding approaches involved training the respective networks from scratch.After the introduction of BERT in 2018, replacing the unsupervised training part of designing sentence encoders became possible.In [8], BERT was finetuned through siamese and triplet networks on SNLI and Multi-Genre natural language inference (MultiNLI) datasets, with softmax classifier over "contradiction", "entailment", and "neutral" labels.Similarly, Gao et al. proposed SimCSE [9], a contrastive approach to finetune pre-trained models with natural language inference datasets using "contradiction" pairs as hard negatives.Following [9], Reimers and Gurevych [8] fine-tuned multiple pre-trained models using the contrastive training objec- tive on over 1 billion sentence pairs and publicly released all their models as Sentence-Transformers.

The crisis corpus
A large-scale social media text corpus was curated for unsupervised pretraining, with Twitter serving as the primary data source.Our main objective was to create a comprehensive corpus containing texts discussing a diverse range of crisis events, such as disease outbreaks, natural disasters, terrorist attacks, conflicts, and other critical incidents.In general, as illustrated in Figure 1, the corpus underwent curation across three distinct stages: (i) consideration of an in-house dataset consisting of billions of tweets, (ii) hydration of Twitter identifiers collected from various data repositories, and (iii) utilization of Twitter's full-archive endpoint to search historical tweets.We maintained an in-house billion-scale COVID-19 tweets dataset from the onset of the COVID-19 outbreak until March 2023.The initial version of the dataset, COV19Tweets [24], comprised more than 2.2 billion tweets.Subsequently, we created the second version, BillionCOV [25], by filtering out unavailable tweets, resulting in over 1.4 billion tweets.For this study, we considered all the tweets present in BillionCOV, excluding retweets.Although BillionCOV contains COVID-19-related tweets, the COVID-19 discourse was not solely limited to discussions about the virus.Numerous other events unfolded worldwide along with the pandemic, including economic crises, natural disasters, humanitarian crises, social unrest, mental health concerns, and social issues.
Next, we collected tweet identifiers from multiple data repositories such as CrisisNLP [26] and DocNow Catalog2 .Tweets collected from these sources needed to be hydrated to re-create the datasets locally, as Twitter's data re-distribution policy restricts sharing data other than tweet identifiers.At this stage, the corpus had texts related to more than 30 crisis events that occurred after 2014.Furthermore, to fill the temporal gap in the corpus, we utilized Twitter's full-archive endpoint to search for historical tweets created between 2006 and 2013.We applied lang:en condition and used the following keywords (along with their #hashtag and plural variants): crisis, disaster, earthquake, typhoon, volcano, flood, landslide, hurricane, tornado, cyclone, wildfire, famine, drought, tsunami, avalanche, epidemic, hailstorm, storm, protest, virus, war, and riot.Below are some of the crisis events covered in the corpus.
• Civil War: Fall of Aleppo.

Text pre-processing
Each tweet in the corpus was pre-processed as follows: We (i) replaced URLs with "HTTPURL" token, (ii) replaced mentions (usernames) with "@USER" token (iii) decoded HTML entities to their original form (e.g., &amp; to &), (iv) removed newline characters and replaced multiple consecutive whitespaces with a single space, (v) fixed text encoding to correct various encoding issues and improve consistency in text representation, and (vi) replaced emojis with their textual representation, as their descriptive text counterparts are meaningful.We considered only the tweets with more than ten tokens.Refer to Table 1 for the descriptive statistics of the corpus.

Unsupervised pre-training 3.2.1. Architecture and pre-training procedure
CrisisTransformers use the same architecture as BERT BASE .In contrast to existing studies [3,6,5,10], we adopted a more versatile approach to selecting a pre-training procedure for our models.Instead of starting with a specific pre-training procedure, we experimented with multiple state-of-theart models, namely MPNet, BERTweet, BERT, RoBERTa, XLM-RoBERTa, ALBERT, and ELECTRA, on classification tasks using 18 crisis-related labelled datasets (detailed in Section 3.3.1).We observed that RoBERTa's pre-training procedure outperforms others in our domain, as RoBERTa and BERTweet emerged as the top-performing models on average.Therefore, we selected RoBERTa's pre-training procedure for training CrisisTransformers.Due to the extensive adoption of BERT and RoBERTa, we do not provide an in-depth explanation of the architecture in this paper; for more comprehensive insights, please refer to [12,13].

Pre-training data
We trained a Byte-Level BPE (Byte-Pair Encoding) tokenizer using the Tokenizers library [11] for our domain, utilizing the pre-processed crisis corpus discussed in Section 3. media texts (the crisis corpus had 36 million unique tokens), we also set the vocabulary size to 64k [10].Next, we used the trained tokenizer to tokenize the crisis corpus, thus generating sequence blocks of size 128, on which we trained the CrisisTransformers.Table 2 provides a comparative analysis of token counts in the vocabularies of established pre-trained models and Crisis-Transformers.Among the existing models, RoBERTa and BERTweet share the highest similarity in vocabulary with CrisisTransformers.

Optimization
We pre-trained three models (as shown in Figure 2), utilizing 6 NVIDIA A100 GPUs (each with 80GB of memory).The training configurations for these models were as follows: CT-M1 (or CrisisTransformer-Model1) was pre-trained from scratch with randomly initialized weights; CT-M2 had weights initialized with pre-trained RoBERTa's weights; and CT-M3 had weights initialized with pre-trained BERTweet's weights.CT-M1 was trained for 40 epochs, while CT-M2 and CT-M3 were trained for 20 epochs each.We used the Transformers library [11] to implement these models.
For optimization, we employed the AdamW optimizer with a peak learning rate set to 0.0004.To utilize the available GPU memory efficiently, we used a batch size of 8k with gradient accumulation steps of 16.Additionally, we set 5% of the total training steps for warming up the learning rate.All three models finished training in two months.

Fine-tuning
For fine-tuning the pre-trained models for text classification, as outlined in [10], we added a linear prediction layer to the pooled output.We imple- mented mean pooling over the token embeddings of an input sequence while considering the attention mask.Both baselines and CrisisTransformers were fine-tuned under identical conditions.Each model was fine-tuned across 18 labelled crisis-related datasets for a maximum of 30 epochs, a batch size of 32, a learning rate of 1e-5, and AdamW as an optimizer.During each epoch, classification performance was assessed on a validation set.Early stopping was configured with a patience of 5 and a threshold of 0.0001.The final checkpoint was then used for evaluation on a test set.The fine-tuning procedure was repeated 5 times per model and dataset, with average performance scores being reported.

Labelled crisis-related datasets
Table 3 lists the datasets we considered to evaluate both baselines and CrisisTransformers.Evaluating the performance across such diverse datasets from the crisis informatics domain was essential to test the robustness of our proposed models.CrisisBench [27] provided the train/validation/test splits for datasets D-01 through D-06.For the remaining datasets, we implemented stratified sampling, allocating 70% for training, 10% for validation, and 20% for testing, using scikit-learn's train-test split 3 with a random state of 42.

Enriching sentence encoding
By default, CrisisTransformers do not produce semantically rich embeddings, even though they were trained on a domain-specific corpus.Such pre-trained models require additional fine-tuning to learn to represent semantically similar sentences closer together within the vector space.These enhanced embeddings, capable of capturing semantic meanings, can then be effectively compared using cosine similarity.Their significance becomes particularly evident in tasks involving semantic search and clustering.
Our sentence encoders (CT-M1-*-SE, CT-M2-*-SE, and CT-M3-*-SEwhere, "SE" stands for Sentence Encoder) are built upon the recent success of utilizing siamese and triplet networks on sentence pairs [20] with pretrained transformers [8] while leveraging the idea that adding corresponding contradicting pairs as "hard negatives" alongside in-batch negatives further improves the performance [9].Expanding upon the method introduced in [9], we adapt it to utilize domain-specific pre-trained models instead of the existing general pre-trained models like BERT and RoBERTa.We used the following contrastive learning objectives to train our sentence encoders: • Multiple Negative Ranking (MNR): This loss incorporates the (anchor, positive) pairs.Given a batch of pairs (a 1 , a + 1 ), (a 2 , a + 2 ), . . ., (a n , a + n ) where (a i , a + i ) are positive pairs and (a i , a + j ) for i ̸ = j are considered negative pairs.The training objective for (a i , a + i ) with mini-batch N is: where, r i and r + i are embeddings of a i and a + i generated by our Crisis-Transformers, similarity(r i , r + i ) is cosine similarity, and τ is temperature hyperparameter.
• MNR with hard negatives: This loss incorporates the (anchor, positive, hard negative) pairs, i.e., (a n , a + n , a − n ).The training objective in Equation 1 can be modified to: The MNR loss maximizes the similarity between an anchor sentence and its positive sentence while considering all other positives in a batch as negatives.In MNR with hard negatives, the similarity between an anchor sentence and its positive sentence is maximized while using its hard negative and all other positive sentences in the same batch as negatives.We include the MNR training objective in the experiments for comparison purposes, even though MNR with hard negatives has been shown to outperform it [9].We train our sentence encoders (as shown in Figure 3) with these two objectives on (Question, Answer) pairs from GooAQ [41], (anchor, positive, hard negative) triplets from QQP 4 [42] and (anchor, entailment, contradiction) triplets from AllNLI [42,43,44] with a large batch size of 512 for a maximum of 20 epochs.We utilize a learning rate of 2e-05 and allocate 1% of the total training steps for warm-up.

Evaluation setup 3.5.1. Classification task
In line with prior research [10,4], we evaluate baselines and CrisisTransformers for the classification task using F1-macro, which considers the precision and recall of each class and provides an overall evaluation of the models' classification performance.For each dataset, we compute the F1-macro score as follows: where, T P class i is the number of true positive predictions for class i, F P class i is the number of false positive predictions for class i, F N class i is the number of false negative predictions for class i, and N classes is the total number of classes in the dataset.

Sentence encoding task
There is an absence of standard benchmark datasets in the crisis informatics domain to assess the semantic quality of the generated embeddings.In agreement with Reimers and Gurevych (2019) [8] and Gao et al. (2021) [9] that the primary objective of the embeddings is to capture and represent semantic relationships in text data, we designed an alternative task.Our evaluation method involved calculating the weighted average cosine similarity among encoded tweets within individual classes in a labelled dataset, thereby measuring the semantic consistency of tweets belonging to the same class.This approach allowed us to capture the complexities and semantics of crisis-related content, resulting in a more insightful evaluation of the sentence embeddings.
Task definition: Let N represent the total number of crisis-related tweets in a dataset and K denote the number of unique classes within the dataset.Let E be a matrix of sentence embeddings, where each row e i corresponds to the normalized embedding of the i-th tweet.Additionally, let y be a vector containing the class labels associated with each tweet.
For each unique class c k , the class weight w k is computed as the inverse of the count of tweets belonging to that class: These class weights are then normalized to obtain ŵk : ŵk = w k K i=1 w i For each unique class c k , the intra-class cosine similarity d k is computed.For each tweet e i within class c k , the average cosine similarity to other tweets within the same class is determined: Here, similarity(e i , e j ) calculates the cosine similarity between tweet embeddings e i and e j , where e j is a tweet within the same class as e i .
The weighted average distance D avg is computed across all classes, considering their respective normalized class weights ŵk : D avg quantifies the average within-class semantic similarity of crisis-related tweets while accounting for the distribution of class weights.
The cosine similarity between sentence embeddings reflects how semantically similar or related the sentences are.If the embeddings are better at capturing the semantic content of crisis-related tweets within each class, the cosine similarity values within a class would be high.A higher cosine similarity within each class indicates that the embeddings effectively represent tweets that share similar content or context related to a specific crisis-related class.In summary, the higher the value of D avg , the better the performance of a sentence encoder.We considered all the datasets listed in Table 3 for this task.

Checkpoints and convergence
After the pre-training, we were interested in multiple checkpoints of Cri-sisTransformers: CT-M1-*, CT-M2-*, and CT-M3-*.CT-M1 was built from scratch and had two variants, CT-M1-BestLoss, representing the model at its lowest loss achieved during training at the 26th epoch, and CT-M1-Complete, representing the model after 40 epochs.On the other hand, CT-M2 and CT-M3 were initialized using weights from pre-trained RoBERTa and BERTweet, respectively, and were trained up to 20 epochs each.CT-M2-OneLook represents the model after 1 epoch, while CT-M2-BestLoss and CT-M2-Complete represent the model at its lowest loss and the model after 20 epochs, respectively.The same setup was applied to CT-M3 models.In total, CrisisTransformers has 8 variants based on different checkpoints of CT-M1, CT-M2, and CT-M3 models.
Figure 4 visualizes the validation loss versus epoch for each model.The graph provides insights into the impact of different initialization on the models' convergence.The loss patterns of the three models revealed distinct behaviours.CT-M1 demonstrated a gradual and consistent reduction in loss throughout the training period, suggesting steady convergence.CT-M2, on the other hand, exhibited a sharp initial drop in the loss within a few training steps, indicating rapid convergence and a smoother decline.Similarly, CT-M3 also displayed a significant initial loss drop.While CT-M3 initially shared a sharp loss drop with CT-M2, its convergence pattern aligned more with CT-M1 in the later epochs.The final loss of CT-M3 ultimately converged closer to that of CT-M1.All models seemed to plateau in their loss during the later epochs, indicating a potential convergence point.These loss patterns highlight the influence of different initializations on the time and trajectory of loss convergence; the pre-trained models seem to leverage their existing knowledge for a more efficient initial convergence than the model whose weights were randomly initialized.

Evaluations
For the classification task, we considered MPNet, BERTweet, BERT, RoBERTa, XLM-RoBERTa, ALBERT, and Electra as baselines for Crisis-Transformers.As discussed in Section 3.3, we finetuned the baselines and CrisisTransformers for the classification task across 18 different crisis-related Epoch Loss (log scale)  datasets, each identified by a unique identifier (D-01 through D-18) (refer to Table 3).Results from the experiments are summarized in Table 4. Amongst the baselines, RoBERTa consistently outperforms other models with high F1-macro scores across several datasets.However, with the introduction of CrisisTransformers, the checkpoints of CT-M1 and CT-M2 stand out; overall, CrisisTransformers outperform the existing pre-trained models across all 18 datasets.The following models outperformed others in the respective number of datasets: CT-M1-BestLoss (in 4 datasets), CT-M1-Complete (in 3 datasets), CT-M2-OneLook (in 4 datasets), CT-M2-BestLoss (in 1 dataset), CT-M2-Complete (in 4 datasets), and CT-M3-OneLook (in 2 datasets).These results confirm the potential of CrisisTransformers for generalization and applicability in various crisis text classification tasks, which is particularly valuable for real-world applications such as disaster response, emergency communication, and crisis management.
Next, we performed sentence encoding tasks across all 18 datasets with the existing pre-trained models, CrisisTransformers, Sentence-Transformers, SimCSE, and CrisisTransformers-based sentence encoders.The results from the sentence encoding task are summarized in Tables 5-7.
The pre-trained models do not yield semantically meaningful sentence embeddings out-of-the-box.Nevertheless, one of our objectives was to investigate how well domain-adapted models perform in generating semantically meaningful sentence embeddings.Results show that (refer to Table 5), within the existing pre-trained models, BERTweet emerged as a robust performer, consistently achieving competitive weighted average cosine similarity scores.However, CrisisTransformers, particularly the CT-M3 variants, invariably achieved the highest or second-highest scores regardless of the specific configuration (OneLook, BestLoss, or Complete).The performance of BERTweet and CT-M3 variants can be attributed to BERTweet's pre-training on an extensive corpus of tweets.The results further indicate that the performance of the pre-trained RoBERTa is subpar.Consequently, the CT-M2 variants do not notably enhance performance.In contrast, the CT-M1 variants demonstrate a significant performance advantage over the CT-M2 variants.These findings suggest that further pre-training a domain-specific model on a subdomain corpus (where, "tweets" reflect "domain" and "crisis-related tweets" indicate "sub-domain") leads to improved performance in generating better sentence embeddings.Furthermore, we trained CrisisTransformers using siamese and triplet networks with MNR and MNR with hard negatives training objectives, as discussed in Section 3.4, to create sentence encoders specifically designed for crisis-related social media texts.We used GooAQ (Question, Answer) pairs for MNR, and QQP (anchor, positive, hard negative) triplets for MNR with hard negatives.As baselines for our sentence encoders, we utilized Sentence-Transformers and SimCSE.We considered the "all-mpnet-base-v2" model, which is the highest-performing model in Sentence-Transformers, and the "sup-simcse-roberta-base" model, a high-performing base architecture model for SimCSE.We used only the first 10k pairs from GooAQ and QQP, for both training objectives.We explored different sample sizes and ultimately found that using 10k pairs struck a balance between model performance and having fewer training samples.This is in contrast to our baselines, where Sentence-Transformers was trained on over 1 billion samples, and SimCSE was trained on 314k samples.
Table 6 and Table 7 summarize the performance of the baselines and our sentence encoders in terms of the weighted average cosine similarity, and Table 8 reports the overall performance.Across all 18 datasets, our sentence encoders outperform both Sentence-Transformers and SimCSE.Notably, CT-M1-Complete-SE (MNR) and CT-M2-Complete-SE (hard negatives) each achieved the best performances across 4 datasets, and CT-M1-BestLoss-SE (MNR) and CT-M2-BestLoss-SE (hard negatives) each in 3 datasets.Overall, CT-M1 variants performed better in 11 datasets, CT-M2 variants in 6 datasets, and CT-M3 in 1 dataset.Considering training objectives, models trained with hard negatives achieved the highest scores across 11 datasets.CT-M1-Complete-SE (hard negatives), although trained on 10k samples, achieved an average score of 0.7140, surpassing the current state-of-the-art by 12% while outperforming Sentence-Transformers' average score of 0.6374.These results highlight the adaptability and effectiveness of CrisisTransformers-based sentence encoders in capturing semantic similarity within sentences, particularly in crisis-related contexts.This reinforces the idea that tailoring models to specific domains, like crisis situations, can yield significant improvements over more general-purpose models in sentence encoding tasks, even when trained with less data.Among the baselines, Sentence-Transformers performed better compared to SimCSE across all datasets.In fact, our CT-M3 variants (avg.scores ranging from 0.2663 to 0.2792) outperformed SimCSE (avg.score of 0.1765).The noticeable performance advantage of Sentence-Transformers over SimCSE can be attributed to the comprehensive training of its "all-mpnet-base-v2" model, which involved training on more than 1 billion sentence pairs/triplets.This extensive training likely provided the model with a broader and richer understanding of general language nuances, thus contributing to its superior performance.
Motivated to study the effect of training samples, we re-trained CT-M1-Complete-SE (hard negatives) while increasing the training samples from 10k to 102k samples (complete QQP) and further augmented the AllNLI dataset to create a training size of 378k.After this re-training, we observed an improvement of approx.3.56% with complete QQP and approx.4.83% with QQP+AllNLI.Overall, our best-performing sentence encoder improved the current state-of-the-art by around 17.43%.This observation sets the stage for potential enhancements to our sentence encoder.Going forward, our future objectives include training our sentence encoders on a scale similar to Sentence-Transformers for an even more substantial improvement.

Conclusion
In this study, we introduced CrisisTransformers, an ensemble of pretrained language models and sentence encoders designed for processing crisisrelated social media texts.The pre-trained models were trained on a largescale corpus of over 15 billion word tokens sourced from tweets associated with more than 30 crisis events that occurred between 2006 and 2023.Additionally, we fine-tuned the pre-trained models using siamese and triplet networks to create sentence encoders.Existing models and CrisisTransformers were evaluated on 18 crisis-specific datasets for classification and sentence encoding tasks.Our pre-trained models outperform strong baselines across all 18 datasets in classification tasks, and our best-performing sentence encoder improves the state-of-the-art by 17.43% in sentence encoding tasks.We publicly release CrisisTransformers, which include 8 variants of pre-trained models and the best-performing sentence encoder, hoping that they will serve as a robust baseline for tasks that involve processing crisis-related social media texts.
CrisisTransformers offers checkpoints of models trained from scratch (CT-M1) and those initialized with RoBERTa's weights (CT-M2) and BERTweet's weights (CT-M3).During experimentations, we observed that pre-trained models (CT-M2 and CT-M3), which undergo further pre-training, leverage existing knowledge for efficient initial unlike randomly initialized CT-M1.CT-M2 and CT-M3 exhibited rapid initial drops in loss; CT-M3 later aligned with CT-M1 in terms of final loss.All models plateaued, implying convergence.In classification, CT-M1 performed best on 7 datasets, CT-M2 on 9, and CT-M3 on 2. Regarding sentence encoding, CT-M1 outperformed in 11 datasets, CT-M2 on 6, and CT-M3 on 1. Considering the training objectives, models trained with hard negatives achieved the highest scores across 11 datasets, which remains in line with what has been reported in the literature.We noticed that the CT-M1 at the lowest loss utilizing only 10k training samples with the MNR with hard negatives training objective outperformed the state-of-the-art Sentence-Transformers (trained on 1 billion samples) by a significant margin of 12%.By increasing the training samples to 378k using the QQP+AllNLI datasets, the performance improved further to 17.43%.This observation confirmed that domain-specific pre-trained models demonstrate significant improvements over general-purpose models in sentence encoding tasks.Going forward, our future objectives include training the sentence encoders on a scale similar to Sentence-Transformers.Also, the proposed models process only English-language tweets.As a future task, we aim to release their multi-lingual versions.

Disclaimer
The training corpus used by CrisisTransformers had a significant volume of unfiltered tweets, which inherently carry non-neutral content.As a result, both the pre-trained models and their finetuned versions are susceptible to biased predictions.LIEF HPC-GPGPU Facility hosted at the University of Melbourne, which was established with the assistance of LIEF Grant LE170100200.The cloud infrastructure required to maintain COV19Tweets over the last three years was provided by DigitalOcean.We appreciate the insights provided by Dat Quoc Nguyen (BERTweet's co-author) during the pre-training phase of Cri-sisTransformers.

Figure 3 :
Figure 3: Training of our sentence encoders.

Figure 4 :
Figure 4: Validation loss versus epoch for CrisisTransformers' CT-M1-*, CT-M2-*, and CT-M3-* checkpoints, showing the impact of different initializations.The loss for CT-M1 at Epoch 0 was 9.841, and it achieved its lowest loss at the 26th epoch.For CT-M2, the loss at Epoch 0 was 2.26, and it achieved its lowest loss at the 8th epoch.Lastly, CT-M3 started with a loss of 2.856 at Epoch 0 and reached its lowest loss at the 15th epoch.The maximum value for the y-axis in this figure has been set to 3.

Table 1 :
Descriptive statistics of the preprocessed corpus.Note: A tweet can have multiple sentences.

Table 2 :
1. Acknowledging the nuanced nature of social Vocabulary similarity between existing pre-trained models and CrisisTransformers.Note: intersection denotes the number of tokens shared between the existing models and CrisisTransformers, while unique indicates the tokens exclusive to the vocabulary of the existing models.

Table 3 :
Labelled crisis datasets considered in this study for evaluating the performance of baselines and CrisisTransformers. 3https://scikit-learn.org

Table 4 :
Performance of the existing pre-trained models and CrisisTransformers (CT-*) on classification task across 18 crisis datasets (D-01 through D-18), with average F1-macro being reported.For the corresponding dataset names of each dataset identifier, please refer to Table3.The best scores are shown in bold.

Table 5 :
Performance of the existing pre-trained models and CrisisTransformers on sentence encoding task across 18 crisis datasets (D-01 through D-18), with weighted average cosine similarity being reported.The best scores are shown in bold, and the second-best scores are underlined.Note that results reported in this table are intended for comparative purposes only; embeddings generated by pre-trained models out-of-the-box do not produce semantically meaningful sentence embeddings.

Table 8 :
Overall performance of the evaluated sentence encoders across 18 datasets.