The Effect of Training Data Size on Disaster Classification from Twitter

: In the realm of disaster-related tweet classification, this study presents a comprehensive analysis of various machine learning algorithms, shedding light on crucial factors influencing al-gorithm performance. The exceptional efficacy of simpler models is attributed to the quality and size of the dataset, enabling them to discern meaningful patterns. While powerful, complex models are time-consuming and prone to overfitting, particularly with smaller or noisier datasets. Hyper-parameter tuning, notably through Bayesian optimization, emerges as a pivotal tool for enhancing the performance of simpler models. A practical guideline for algorithm selection based on dataset size is proposed, consisting of Bernoulli Naive Bayes for datasets below 5000 tweets and Logistic Regression for larger datasets exceeding 5000 tweets. Notably, Logistic Regression shines with 20,000 tweets, delivering an impressive combination of performance, speed, and interpretability. A further improvement of 0.5% is achieved by applying ensemble and stacking methods.


Introduction
The increasing use of social media platforms during disaster events has opened up new avenues for extracting valuable information and enhancing crisis response efforts.The identification of informative messages and the classification of crisis-related content have become essential tasks in crisis informatics.
In particular, natural disasters such as floods, earthquakes, storms, extreme weather events, landslides, and wildfires could see better operational management and assessment by utilizing the immediate and valuable information provided by social media platforms such as Facebook, Twitter, Instagram, Flicker, and others.Real-time information disseminated through public posts could promote community disaster awareness and warnings [1], helping to mobilize the scientific community to produce more accurate forecasting of the evolution of extreme events as well as to improve authorities' actions and response [2].For example, in [3] the authors studied the capacity of Twitter to spread vital emergency information to the public based on real-time posts uploaded on the platform during Hurricane Sandy.Belcastro et al. [4] focused on identifying secondary disaster-related events from social media.Annis and Nardi [5] integrated crowdsourced data and images into a hydraulic model to improve flood forecasting and support decision-making in early warning situations.Peary et al. [6] examined the use of social media in disaster preparedness for earthquakes and tsunamis.Styve et al. [7] utilized text Twitter data to assess the level of extreme weather events such as heavy precipitation, storms, and sea level rise events to enhance the adaptive capacity to mitigate extreme events.
The training strategies employed vary, with some studies utilizing publicly available crisis-related datasets such as CrisisNLP, CrisisMMD, and SMERP and others leveraging their own collected data from specific events such as Hurricanes Sandy [15] and Harvey [17].Training sets are typically partitioned into subsets for training, validation, and testing purposes.In some cases, only one set is used and the evaluation is performed through crossvalidation [11,14,16].Evaluation metrics, including accuracy, F1 score, precision, recall, area under the curve (AUC), and weighted average precision and recall, are employed to assess the performance of the classification models.
Comparing the results across relevant studies reveals valuable insights into the effectiveness of different approaches.The application of CNN often demonstrates superior performance compared to other algorithms such as SVM, NB, RF, and CART in identifying informative messages and classifying crisis-related content [8,9,13,20].Deep learning models, particularly CNNs, exhibit the ability to capture complex patterns and representations from the noisy and diverse nature of social media data during crisis events.
Furthermore, the utilization of advanced language models such as BERT, DistilBERT, and RoBERTa combined with careful preprocessing steps has shown promising results in benchmarking crisis-related social media datasets [10].The removal of symbols, emoticons, invisible and non-ASCII characters, punctuation, numbers, URLs, and hashtag signs enhances the generalization capabilities of these models [9,10,18,19].Using the RoBERTa and BERT models with a combination of datasets has demonstrated improved performance in generalization and classification accuracy [10,16].
Several papers have challenged the notion that deep learning models consistently outperform other algorithms or have produced results showing that the performance difference is small.These studies provide valuable insights into alternative approaches that yield competitive results.While deep learning models excel at capturing complex patterns and representations in crisis-related data, other algorithms can offer complementary advantages such as interpretability, efficiency, and robustness.
For example, a number of studies have explored the effectiveness of traditional machine learning algorithms such as SVM, NB, RF, and DT for information identification and content classification.These algorithms leverage carefully engineered features such as unigrams, bigrams, and trigrams as well as topic modeling techniques such as LDA.When coupled with non-deep learning algorithms, these feature engineering techniques have demonstrated competitive performance in various scenarios [15].Moreover, research comparing deep learning models against NB in identifying informative tweets during disasters has shown contrasting results.While CNNs consistently outperform NB classifiers, the performance of NB classifiers depends heavily on the nature of the data.Notably, NB classifiers exhibit notably poorer performance when trained on natural disaster events and evaluated on non-natural disasters, as well as and vice versa.These findings emphasize the importance of understanding the characteristics and distribution of the data when selecting appropriate algorithms [20].
Additionally, several studies have explored the use of ensemble methods and boosted algorithms such as XGBoost, AdaBoost, and Deep Belief Networks (DBN) to leverage the strengths of multiple models.These techniques aim to improve overall classification performance by combining the outputs of individual classifiers or by employing sophisticated learning algorithms that capture complex relationships in the data [13].
By considering these alternative approaches, it is possible to gain a more nuanced understanding of the strengths and limitations of different algorithms in crisis informatics.While deep learning models, particularly CNNs, have shown remarkable performance, they are not always the optimal choice in all scenarios.Other algorithms, such as traditional machine learning models and ensemble methods, offer valuable alternatives that balance performance, interpretability, efficiency, and robustness.

Related Work
Several studies have investigated the impact of training set size and dataset size on supervised machine learning.The authors of [21] examined various algorithms in the field of land cover classification using large-area high-resolution remotely sensed data, including Artificial Neural Network (ANN), Support Vector Machine (SVM), Random Forest (RF), and others, across a wide range of data sizes from 40 to 10,000 samples.The study found that RF achieved a high accuracy of almost 95% with small training sample sets, and there were minimal variations in overall accuracy between small and even very large sample sets.
Another study explored the impact of training and testing data splits by investigating the performance of Gaussian Process models on time series forecasting tasks.The study examined data sizes ranging from 2 months to 12 months, and found that the performance of the models varied depending on the specific training data splits [22].
Furthermore, the work of [23] focused on image datasets for plant disease classification.Their study emphasized the need for a substantial amount of data to achieve optimal performance.
In the context of machine learning algorithm validation with limited sample sizes, in [24] the authors examined simulated datasets ranging from 20 to 1000 samples.Their study compared Support Vector Machine (SVM), Logistic Regression (LR), and different validation methods.They concluded that K-fold Cross-Validation (CV) can lead to biased performance estimates, while Nested CV and training/testing split approaches provide more robust and unbiased estimates.
Additionally, the impact of dataset size on training sentiment classifiers was explored in [25].The study used seven datasets ranging from 1000 to 243,000 samples, and evaluated algorithms such as Decision Tree (DT), Naive Bayes (NB), K-Nearest Neighbors (KNN), and Support Vector Machine (SVM).The findings showed that all algorithms improved with increased training set size, with NB performing the best overall and reaching a plateau after 81,000 samples.
In the context of political text classification, the study of [26] addressed data scarcity using deep transfer learning with BERT-NLI.BERT-NLI fine-tuned on 100-2500 texts outperformed classical models by 10.7-18.3points.It matched the performance of classical models trained on 5000 texts with just 500 texts, showing significant data efficiency and improved handling of imbalanced data.
In the domain of sentiment analysis of Twitter data, [27] investigated the influence of different training set sizes ranging from 10% to 100%.The results indicated that changing the training set size did not significantly affect the sentiment classification accuracy, with SVM outperforming Naive Bayes.
Another Twitter study on pharmacovigilance [28] used weak supervision with noisy labeling to train machine learning models, exploring various training set sizes from 100,000 to 3 million tweets.Classical models such as SVM and deep learning models such as BERT performed well, showing similar accuracy to gold standard data.
Finally, [29] examined the impact of dataset size on Twitter classification tasks using various datasets and models, including BERT, BERTweet, LSTM, CNN, SVM, and NB.The results indicated that adding more data is not always beneficial in supervised learning.More than 1000 samples were recommended for reliable performance; notably, transformerbased models remained competitive even with smaller sample datasets.
The contributions of the present paper are summarized as follows: • Algorithm performance analysis: This study systematically evaluates the performance of multiple machine learning algorithms for tweet classification in the context of disaster events.It provides valuable insights into the strengths and weaknesses of each algorithm, aiding practitioners in making informed choices.It also employs ensemble and stacking techniques to further boost performance.

•
Hyperparameter tuning importance: By emphasizing the significance of hyperparameter tuning, particularly through Bayesian optimization, this work underscores the potential performance gains achievable by systematically exploring the hyperparameter space.This knowledge can guide practitioners in optimizing their models effectively.

•
Occam's razor in machine learning: The application of Occam's razor to machine learning model selection is explored, emphasizing the advantages of simpler models in terms of interpretability and reduced overfitting risk.

•
Impact of dataset size on model choice: This research establishes a practical guideline for selecting the most suitable algorithm based on dataset size.This contribution can aid practitioners in making efficient and effective algorithm choices that are the most appropriate based on the scale of their data.

Materials and Methods
This section offers all the necessary resources for comprehending the basis of our experiments and grasping their outcomes.

Classification Models
Naive Bayes.We used a variant of Naive Bayes called Bernoulli Naive Bayes [36], which assumes binary features.Given the class, it models the conditional probability of each feature as a Bernoulli distribution.It is commonly used for text classification tasks where features represent the presence or absence of words.Despite its simplifying assumptions, Bernoulli Naive Bayes can achieve good performance and is computationally efficient.It works by calculating the likelihood of each feature given the class, and then combining the results with prior probabilities to make predictions using Bayes' theorem.
Light Gradient Boosting.Light Gradient Boosting (LightGBM) [37] is a gradient boosting framework that uses tree-based learning algorithms.It aims to provide a highly efficient and scalable solution for handling large-scale datasets.LightGBM builds trees in a leaf-wise manner, making it faster and more memory-efficient than other boosting algorithms.It uses a gradient-based approach to iteratively optimize the model by minimizing the loss function.LightGBM also implements features such as regularized learning, bagging, and feature parallelism to improve performance.Thanks to its ability to handle large datasets and high predictive accuracy, LightGBM has become popular in various machine learning tasks, including classification, regression, and ranking.
Linear SVC.Linear Support Vector Classifier (Linear SVC) [38] is a linear classification algorithm that belongs to the Support Vector Machine (SVM) family.It aims to find the optimal hyperplane that separates the data points of different classes.Linear SVC works by maximizing the margin between the classes while minimizing the classification error.Unlike traditional SVM, Linear SVC uses a linear kernel function, making it computationally efficient and suitable for large-scale datasets.It performs well in scenarios where the classes are linearly separable.Linear SVC is widely used in various applications such as text classification, image recognition, and sentiment analysis.It provides a powerful and interpretable solution for binary and multiclass classification problems, offering good generalization and robustness to noisy data.
Logistic Regression.Logistic Regression [39] is a popular statistical model used for binary classification tasks.It predicts the probability of an instance belonging to a certain class by fitting a logistic function to the input features.The model estimates the coefficients for each feature, which represent the influence of the corresponding feature on the outcome.Logistic Regression assumes a linear relationship between the features and the log-odds of the target variable.It is a simple and interpretable algorithm that performs well when the classes are linearly separable.Logistic Regression is widely used in domains such as healthcare, finance, and marketing for tasks including churn prediction, fraud detection, and customer segmentation.It is computationally efficient and provides probabilistic outputs, making it suitable for both binary and multi-class classification problems.
XGBoost.XGBoost (Extreme Gradient Boosting) [40] is an advanced gradient boosting algorithm that has gained popularity in machine learning competitions and real-world applications.It is designed to optimize performance and scalability by utilizing a gradient boosting framework.XGBoost trains an ensemble of weak decision tree models sequentially, with each subsequent model correcting the errors made by the previous models.It incorporates regularization techniques to prevent overfitting and provides options for customizing the learning objective and evaluation metrics.XGBoost supports both classification and regression tasks, and can efficiently handle missing values and sparse data.It also offers features such as early stopping, parallel processing, and built-in cross-validation.XGBoost excels in capturing complex patterns and interactions in the data, making it a powerful tool for predictive modeling.Its high performance and flexibility have made it a popular choice across various domains, including finance, healthcare, and online advertising.
Convolutional Neural Networks.CNNs are deep learning models commonly used for analyzing and classifying text data, including tweets.They excel at capturing local patterns and dependencies within the text through the use of convolutional layers.By applying filters to the input text, CNNs extract features and learn representations that are relevant for classification tasks.These models have proven effective in tasks such as sentiment analysis, topic classification, and spam detection.CNNs offer a robust approach for understanding and categorizing tweet content based on the textual characteristics.We use a similar architecture to that proposed by [41].

Setup
Our preprocessing step included various techniques that have previously shown increased performance on Twitter texts [42,43] The hyperparameters of the machine learning algorithms were tuned using Bayesian optimization [44].Bayesian optimization combines statistical modeling and sequential decision-making to find the optimal set of hyperparameters that maximizes the performance of the model.Unlike traditional grid or random search methods, Bayesian optimization intelligently explores the hyperparameter space by learning from previous evaluations.The process begins by constructing a probabilistic surrogate model, such as a Gaussian process or tree-based model, to approximate the unknown performance function.This surrogate model captures the trade-off between exploration and exploitation, allowing for informed decisions to be made about which hyperparameters to evaluate next.Bayesian optimization uses an acquisition function to balance exploration (sampling uncertain regions) and exploitation (focusing on promising areas).By iteratively evaluating the model's performance with different hyperparameter configurations, Bayesian optimization updates the surrogate model and refines its estimation of the performance landscape.This iterative process guides the search towards promising regions, ultimately converging to the set of hyperparameters that yield the best performance.
The CNN model was trained by following the guidelines from the CrisisBench work [10].We used the architecture proposed by [41] and the Adam optimizer [45].The batch size was 128, the maximum number of epochs was 1000, the filter had a size of 300 with window sizes and pooling lengths of 2, 3, and 4, and the dropout rate was 0.02.The early stopping criterion was based on the accuracy of the development set with a patience of 200.
All models were evaluated using the F-measure score due to the class label imbalance of the two classes (informative vs. not informative).
The machine learning models were developed in Python using the scikit-learn package [46], while the CNN models were developed using the keras package [47].

Results and Discussion
The CrisisBench dataset used in this work is already split into three sets: a training set of 109,441 samples, a development set of 15,870 samples, and a test set of 31,037 samples.We trained and tuned our models on the training set and validated them separately on both the development set and test set.We performed both validations, as we are interested in whether the models generalize well and avoid overfitting to one set.For each machine learning algorithm, we created 21 models using a subsample of the total training set and increasing its size with steps of 5%.The resulting samples were 1094, 5472, 10,944, 16,416, . . ., 109,441 in size.
The results of this study are presented in Table 2 for the development set and Table 3 for the test set.These tables are visualized in Figures 1 and 2, respectively.One notable observation lies in the striking similarity between the results obtained on the development and test datasets.This resemblance manifests not only in the performance metrics but also in the characteristic trends seen in the line plots as the training size increases.Notably, the F-measure exhibits a consistent trend, with a marginal 0.4% increment on average in the test set, which possesses twice the sample size in comparison to the development set (31,037 vs. 15,870).This convergence of outcomes hints at the high quality of the dataset, and implies a degree of linguistic similarity between these three subsets.Indeed, it is reasonable to anticipate such lexical overlaps, particularly when dealing with disaster-related events.It stands to reason that new and unseen data would seldom introduce a multitude of previously unencountered terms.Another noteworthy implication is that in the context of this specific problem the algorithms we employed demonstrate a remarkable capacity for generalization.This assertion is substantiated by their nearly indis-tinguishable performance on the development and test sets, suggesting a high likelihood of continued success when applied to entirely new and unfamiliar Twitter datasets.Another notable observation centers on the remarkable performance achieved by the models.When subjected to training on the entire training dataset, all algorithms consistently exhibit F-measure scores exceeding 85%.Specifically, in descending order of performance on the test set, the scores are as follows: Multinomial Naive Bayes (85.59%),Convolutional Neural Network (86.61%),Bernoulli Naive Bayes (86.20%),XGBoost (87.91%),LightGBM (88.49%),Linear Support Vector Classification (88.53%), and Logistic Regression (88.54%).
It is noteworthy to highlight that our CNN model's results align closely with those reported in the paper introducing this dataset, as cited in [10], where the authors reported an F-measure of 86.6%.Intriguingly, our simpler Bayesian-optimized models outperform the CNN, with Logistic Regression surpassing it by nearly 2%.This observation suggests that the problem at hand can be characterized as relatively tractable, since even straightforward algorithms can attain nearly 90% F-measure performance with some parameter tuning.
The phenomenon of simpler algorithms consistently outperforming their more complex counterparts on certain machine learning tasks, as evidenced in this particular case, can be attributed to a combination of several key factors.
First and foremost, the quality and size of the dataset wield substantial influence.In this instance, the dataset is not only sizable, it is characterized by its cleanliness and meticulous structure.Such a favorable data environment allows even simpler algorithms to adeptly capture meaningful patterns.This stands in contrast to scenarios where datasets may possess an inherent structure or contain distinct features, where the introduction of complex models may not yield a significant advantage.
Another pivotal factor contributing to this phenomenon is the propensity for complex models, exemplified in this case by deep neural networks such as the Convolutional Neural Network (CNN), to succumb to overfitting.This vulnerability becomes particularly pronounced when dealing with smaller or noisy datasets.Complex models have a higher capacity to memorize the idiosyncrasies and noise present in the training data, resulting in suboptimal generalization when confronted with unseen data.In stark contrast, simpler models are characterized by their reduced complexity, which renders them more resilient to overfitting, ultimately bolstering their robustness.
Hyperparameter tuning also plays a crucial role in elucidating the superior performance of simpler algorithms over complex ones.Bayesian optimization systematically explores and identifies optimal hyperparameters for simpler models.This meticulous tuning process can propel these models to deliver exceptional performance, often surpassing their more complex counterparts.
Moreover, opting for simpler models aligns with the age-old principle of Occam's razor.This principle posits that when all other factors are equal, simpler models should be preferred.Simpler models are inherently more interpretable and carry a reduced risk of overfitting.In many real-world scenarios, these streamlined models can effectively approximate the underlying data distribution, especially when the problem at hand lacks excessive complexity.
The improved performance of simpler models is not solely evident when training with the entire dataset; rather, it becomes increasingly apparent as the size of the training dataset grows.This trend is clearly illustrated in the figures, where the CNN and Bernoulli models consistently exhibit lower performance while the other four algorithms demonstrate similar scores.
Notably, Bernoulli Naive Bayes stands out, achieving its highest F-measure at just 10% of the training set (10,944 samples).In contrast, the remaining algorithms continue to improve their F-measures as the size of the training set increases.Remarkably, starting from as little as 1% of the data, Bernoulli outperforms the other algorithms.For the rest of the algorithms, if we were to determine a point where performance gains become marginal, it would be around 60% of the training set (65,664 samples).
As demonstrated above, this problem appears relatively easy to solve, as even at a conservative cutoff of 20% all algorithms except CNN achieve F-measures above 85%.Beyond this point, the algorithms exhibit only marginal improvements, typically in the range of 1-2%.Considering the substantial increase in training data required for this minor gain, it becomes evident that the effort may not be justified.
Based on these findings, it is possible to establish a practical rule for selecting the most suitable algorithm for a given dataset size.When dealing with datasets containing fewer than 5000 tweets, the Bernoulli Naive Bayes model emerges as a promising choice.Conversely, when confronted with larger datasets exceeding 5000 tweets, Logistic Regression proves to be the preferred option.Both of these algorithms are also the fastest in terms of execution.
Notably, with a dataset size of 20,000 tweets, Logistic Regression offers a compelling advantage, delivering an impressive 87% F-measure.This is achieved with remarkable speed and a high level of interpretability, making Logistic Regression an attractive choice for such datasets.

Further Improvements using Ensemble and Stacking Approaches
As a final experiment, we tried to push the performance even further utilizing ensemble and stacking methods.
In an ensemble methodology for classification, majority voting is applied to predictions derived from multiple algorithmic models.This synergistic approach aims to enhance the predictive F-measure beyond the capability of any single model within the ensemble.By capitalizing on the divergent strengths of various models and mitigating their individual weaknesses, ensemble approaches can offer a robust solution showcasing superior performance, diminished susceptibility to overfitting, and enhanced predictive stability.
Stacking involves a hierarchical model architecture in which a secondary model (the meta-learner) is trained to synthesize the predictions of multiple primary models (the base learners).The base learners are trained on the complete dataset, while the meta-learner's training leverages the base learners' predictions as input features.The meta-learning phase aims to capture the essence of the predictions made by the base models, yielding a refined final prediction.
The efficacy of both voting and stacking methodologies is significantly bolstered by incorporating models that are uncorrelated or exhibit low correlation in terms of their errors.Figure 3 shows the results of computing the Pearson correlation between the predictions on the validation set (training with 20% of the data) of the six algorithms that we used.It can be observed that the highest correlations are between XGBoost and Light Gradient Boosting (0.89) and between Logistic Regression and Linear SVC (0.86).The CNN's predictions depict the lowest overall correlations, ranging from 0.63 to 0.7.
According to these results, we decided not to use the XGBoost model for the ensemble and stacking experiment due to its high correlation with Light Gradient Boosting.The single LGB model is inferior, and requires more computational time.While we could have also dropped Linear SVC, that would result in four final models, and majority voting works best with an even number of models.
For the stacking method, we used Logistic Regression as the meta-learner.Because stacking uses the predictions of the individual models as features for the meta-learner, the validation set was used as the training set.
Table 4 shows the F-measure results for the individual models trained on 20% of the training set in comparison to the ensemble and stacking models.Both validation and test results are shown.The majority voting ensemble technique shows improved performance over the best individual model by about 0.5% in both validation and test sets, reaching 87.29% and 87.57%, respectively.Stacking shows the same behavior, improving the test set's F-measure by about 0.5%, ultimately reaching 87.61%.

Conclusions
After conducting a comprehensive analysis of various machine learning algorithms and ensembles for tweet classification in disaster contexts, our study reveals several key findings.
The exceptional performance of simpler models can be attributed in part to the quality and size of the dataset.This dataset, characterized by its substantial size and meticulous organization, allowed even basic models to capture meaningful patterns.However, it is important to note that the advantages of simpler models may diminish when dealing with more complex or noisy datasets.
Complex models such as deep neural networks are susceptible to overfitting, especially in scenarios involving smaller or noisier datasets.These models tend to memorize noise, which hinders their ability to generalize effectively.In contrast, simpler models with reduced complexity exhibit greater resilience to overfitting.
Hyperparameter tuning, particularly through Bayesian optimization, played a pivotal role in enhancing the performance of the simpler models.Systematically exploring the hyperparameter space enabled these models to outperform their more complex counterparts.
The principle of Occam's razor, favoring simpler models when all other factors are equal, holds true in this context.Simpler models are not only easier to interpret, they are less prone to overfitting; thus, in many real-world scenarios these streamlined models can effectively approximate the underlying data distribution, particularly for problems that lack excessive complexity.
Our findings also underscore the significance of training dataset size.The superiority of simpler models becomes increasingly evident as the size of the training dataset grows.Even with as little as 20% of the data, simpler models consistently achieve F-measures above 85%, while complex models exhibit only marginal improvements with more data.
Based on these insights, we propose a practical guideline for algorithm selection based on dataset size.For datasets containing fewer than 5000 tweets, Bernoulli Naive Bayes emerges as a promising choice.Conversely, for larger datasets exceeding 5000 tweets, Logistic Regression proves to be the preferred option.Notably, Logistic Regression offers a compelling advantage with a dataset size of 20,000 tweets, delivering an impressive 87% F-measure along with speed and interpretability benefits.Ensemble and stacking approaches can also be used to further boost the results by approximately 0.5%.
These findings could be used in operational forecasting and management of extreme events.Text-related information transmitted in real time through social media channels could be used to support response operations in the case of extreme weather events, floods, earthquakes, storms, and wildfires.Earth-related Digital Twin technologies could capitalize on the explosion of these new data sources to enable simultaneous communication with real-world systems and models.
In summary, our study highlights the importance of dataset quality, model complexity, hyperparameter tuning, and the principle of simplicity when choosing a machine learning algorithm.These insights provide valuable guidance for practitioners in text classification, especially when dealing with disaster-related tweet data.Ultimately, our findings emphasize the substantial performance potential of simpler models when applied judiciously, even in scenarios that might initially be perceived as more complex.
In future work, we intend to explore and expand these findings on dataset quality and model complexity across more diverse classifications, such as identifying multiple types of disaster-related tweets.
. The techniques were: (a) removal of non-ASCII characters; (b) replacing URLs and user mentions; (c) removing hashtags; (d) removing numbers; (e) replacing multiple repetitions of exclamation marks, question marks, and stop marks; and (f) lemmatization.

Figure 1 .
Figure 1.Comparison of Bernoulli Naive Bayes, Light Gradient Boosting, Linear Support Vector Machine, Logistic Regression, Extreme Gradient Boosting, and Convolutional Neural Network as the training set size increases with steps of 5%, followed by validation on the development set using the F-measure.

Figure 2 .
Figure 2. Comparison of Bernoulli Naive Bayes, Light Gradient Boosting, Linear Support Vector Machine, Logistic Regression, Extreme Gradient Boosting, and Convolutional Neural Network as the training set size increases with steps of 5%, followed by validation on the test set using the F-measure.

Figure 3 .
Figure 3. Pearson correlation between predictions on the validation set when training with 20% of the training set.

Table 1 .
Dataset tweet distribution across the different sources.

Table 2 .
F-measure scores of the Machine Learning and Deep Learning methods obtained by fitting while increasing the training set size and evaluating on the development set.

Table 3 .
F-measure scores of the Machine Learning and Deep Learning models obtained by fitting while increasing the training set size and evaluating on the test set.

Table 4 .
F-measures on the validation and test sets when training with 20% of data across individual algorithms and ensembles.