Know an Emotion by the Company It Keeps: Word Embeddings from Reddit/Coronavirus

: Social media is a crucial communication tool (e.g


Introduction
Recent statistics show that there are 4.55 billion social media users around the world, equating to 57.6% of the total global population [1].
With the outbreak of the COVID-19 pandemic, social media on platforms such as Reddit [2] has become a critical communication tool for the generation, dissemination, and consumption of information [3].
Therefore, social media analysis is one of the most popular areas of research in recent days [4]. Many studies apply various Natural Language Processing (NLP) techniques to social media content [5]. Out of them, sentiment analysis and topic models are two of the most researched NLP topics, as concluded in a Lancet Digital Health scoping review [3]. Much less studied, word embeddings have been recently reported as a valuable text analysis technology in the pandemic context [6][7][8].
Understanding the meaning of a word is at the heart of NLP [9]; the approach followed by word embeddings is based on Firth's notion of "context of situation." In particular, his famous quotation: "You shall know a word by the company it keeps" [10]. Words that occur in similar contexts are prone to have similar meanings [11]. Firth's distributional   The r/Coronavirus subreddit is a curated information platform. As presented in the r/Coronavirus official description [22], "This subreddit seeks to monitor the spread of the disease COVID-19, declared a pandemic by the WHO. This subreddit is for high-quality posts and discussion." As emphasized in the r/Coronavirus Rules: "There are many places online to discuss conspiracies and speculate, we ask you not to do so here." Otherwise users get the message: "Your post or comment was removed due to being low quality information" [22]. It is also worth noting that reposts are removed. A repost is a post that is created by taking a post from a while ago and posting it again in the same subreddit. The concept of reposting also covers new posts containing only information that has already been posted [22].
The number of subscribers and posts in the other COVID-19 subreddits are clearly lower and address more specific aspects; therefore, in this work, our data source was r/Coronavirus.
Users submit top-level postings, known as submissions, to each subreddit, while others respond with comments on the submissions. Submissions consist of a title (up to 300 characters) and either a web link or a user-supplied body text; in the latter case, the submission is also known as a self-post, while comments are always made up of a body text.
In this work, we focus on analyzing the titles of Reddit posts. There are two reasons why we believe titles will be a useful basis for NLP analysis.
First, Reddit strongly recommends double-checking the grammar, spelling, and punctuation of the titles: "Read over your submission for mistakes before submitting, especially the title of the submission. Comments and the content of self-posts can be edited after being submitted; however, the title of a post cannot be. Make sure the facts you provide are accurate to avoid any confusion" [23].
Second, Reddit also requests that posters make their titles factual, accurate, and relevant to the content of the post. As remarked in Rediquette: "Please don't editorialize or sensationalize your submission title, keep your submission titles factual and opinion free. If it is an outrageous topic, share your crazy outrage in the comment section. Do not be vague. Make sure redditors know what they are getting. People do not have time to click on every submission to find out what is inside. Contribute value to the community by writing titles that accurately describe what is being shared. Be relevant. Subreddit The r/Coronavirus subreddit is a curated information platform. As presented in the r/Coronavirus official description [22], "This subreddit seeks to monitor the spread of the disease COVID-19, declared a pandemic by the WHO. This subreddit is for high-quality posts and discussion." As emphasized in the r/Coronavirus Rules: "There are many places online to discuss conspiracies and speculate, we ask you not to do so here." Otherwise users get the message: "Your post or comment was removed due to being low quality information" [22]. It is also worth noting that reposts are removed. A repost is a post that is created by taking a post from a while ago and posting it again in the same subreddit. The concept of reposting also covers new posts containing only information that has already been posted [22].
The number of subscribers and posts in the other COVID-19 subreddits are clearly lower and address more specific aspects; therefore, in this work, our data source was r/Coronavirus. Users submit top-level postings, known as submissions, to each subreddit, while others respond with comments on the submissions. Submissions consist of a title (up to 300 characters) and either a web link or a user-supplied body text; in the latter case, the submission is also known as a self-post, while comments are always made up of a body text.
In this work, we focus on analyzing the titles of Reddit posts. There are two reasons why we believe titles will be a useful basis for NLP analysis.
First, Reddit strongly recommends double-checking the grammar, spelling, and punctuation of the titles: "Read over your submission for mistakes before submitting, especially the title of the submission. Comments and the content of self-posts can be edited after being submitted; however, the title of a post cannot be. Make sure the facts you provide are accurate to avoid any confusion" [23].
Second, Reddit also requests that posters make their titles factual, accurate, and relevant to the content of the post. As remarked in Rediquette: "Please don't editorialize or sensationalize your submission title, keep your submission titles factual and opinion free. If it is an outrageous topic, share your crazy outrage in the comment section. Do not be vague. Make sure redditors know what they are getting. People do not have time to click on every submission to find out what is inside. Contribute value to the community by writing titles that accurately describe what is being shared. Be relevant. Subreddit subscribers like to read about specific topics that are related to their subreddit. If your submission is out of place, it will not gain any attention" [23].
Another advantage of focusing on Reddit post titles is that Twitter has increased the available character space from 140 to 280 characters since November 2017, which is very similar to Reddit's 300 characters limitation of the titles. This provides an opportunity for linguistic comparisons between tweets and Reddit titles.
It is for these reasons that we focused our analysis on the titles of all posts extracted from r/Coronavirus.
A word embedding is a vector-based representation of a word. The vector representing a word can be understood as the coordinates of a word's position within a multidimensional feature space (where the dimensions of the feature space are equal to the size of the vector). Within the vector-based representation, the meaning of a word is encoded by its position within the feature space relative to other words in the space. From a linguistic semantics perspective, the concept of word embedding is related to the distributional hypothesis for Firth [10], which can be paraphrased as "you shall know the meaning of a word by the company it keeps." The relationship between the distributional hypothesis and word embeddings is that in well-trained word embedding models, words that occur in similar contexts (i.e., that keep the same company) are positioned close to each other in the feature space (i.e., they have similar vector representations).
Word2vec was created, patented, and published in two papers in 2013 by a team of researchers led by Tomas Mikolov at Google to learn word embeddings from a corpus of language [12]. It creates embeddings for the words in a corpus by training a neural network to predict words that co-occur with other words in the corpus.
Word2Vec includes two alternative strategies for training the neural network: Continuous Bag of Words (CBOW) and Skip-gram. In both of them, a preset length window is moved along the corpus. Using the CBOW strategy, at each step, the network is trained to predict the word in the center of the window based on the surrounding words. In the Skip-gram strategy, the network is trained to predict the other words in the window based on the central word. In both strategies, the learning signal for the network (and hence the information that is encoded in the embeddings the network generates) is the likelihood of one word co-occurring in the surrounding context of another word (i.e., within the same window). In the present paper, we use the Skip-gram model, which has shown better performance in semantic tasks [24].
Psychological resilience, as a general term, deals with how people manage stress and how they recover from traumatic events, encouraging constructive growth and promoting an optimistic outlook on the future [25]. Evidence suggests that when resilience-based abilities are applied to people's lives, they have many advantages (for example, a carry-over effect on other life domains) [26]. Resilience may be improved with deliberate practice; it is not necessary to be born with it [27]. However, within the research community, there is a lack of a unified definition for the concept [28]. This lack of consensus in definition can also be linked to the lack of consensus on how the concept should be operationalized in order to address community disasters [29]. As recently reported [30], positive and negative emotions have varied effects on developing a resilient attitude. People who go through higher levels of positive emotions (i.e., gratitude, compassion, love, relief, hope, calm, or admiration) exhibit a higher degree of resilience, whereas those who feel high levels of negative emotions (i.e., anger, loneliness, boredom, fear, anxiety, confusion, sadness) are associated with poorer resilience.
Typically, large general-purpose corpora (e.g., Wikipedia dumps with 3 billion words [31]) are used to learn word embeddings. Nevertheless, in this work, we hypothesized that word embeddings could be extracted from publicly available social media, using open source software, in sufficient numbers such that the embeddings (1) are relevant to provide meaningful context to specific emotions specifically linked to an ill-defined domain such as psychological resilience (2) verifiable by sound theoretical semantic tests such as the Battig and Montague norm [32] (3) consistent with current related scientific publications and (4) offering the possibility of providing actionable knowledge to on-field specialists.
Therefore the objectives of this work are to (1) train a model (W2V) for creating word associations (also known as embeddings) using a publicly available dataset (a subreddit on Coronavirus from January 2020 to July 2021, a period where emotions were exacerbated) and open access software (R libraries) able to retrieve meaningful closest terms. (2) Such a W2V model aims to be formally validated using the semantic categorization test by means of an updated and expanded version [33] of the Battig and Montague norm, with 65 categories; for each category, the silhouette coefficient of the model will be computed. As a complementary validation step, the extensive scientific literature is aimed to be included, supporting our findings. (3) We will then run W2V to discover the context for seven specific positive and seven negative emotions recently reported as related to resilience during the COVID-19 pandemic, and (4) such specific context will be supported using related scientific publications.
The article is organized as follows. A literature review is presented in Section 2. Materials and methods are introduced in Section 3. In Section 4, we initially report a descriptive analysis of the sample; we then present the results of our W2V model at three different levels (toy-example analogies, representative terms from a COVID-19 glossary, and resilience related terms) for both the COVID-19 glossary and resilience related terms. We support our findings with extensive scientific literature and then discuss performance using the Battig and Montague evaluation. The discussion and limitations are presented in Section 5. Lastly, in Section 6, we conclude the paper.

Related Work
For computer scientists and researchers, social media data are valuable assets for understanding people's sentiments regarding current events, especially those related to events with worldwide impacts, such as the COVID-19 pandemic. Therefore, the classification of these sentiments yields remarkable findings. For example, in one of the earliest related publications, Rajput and colleagues [34] classified (negative, positive, and neutral) tweets based on word-level, bi-gram, and tri-gram frequencies to represent word rates by power law distribution and applied the Python built-in package TextBlob to perform sentiment analysis. Samuel and colleagues [35] proposed machine learning models (naïve Bayes and logistic regression) to categorize sentiment tweets into two classes (positive and negative). Similarly, Aljameel et al. [36] analyzed a large Arabic COVID-19-related tweets dataset, applying uni-gram and bi-gram TF-IDF with SVM, naïve Bayes, and KNN classifiers to enhance accuracy. Muthausami et al. [37] classified the tweets into three classes (positive, neutral, and negative). They utilized different classifiers, such as random forest, SVM, decision tree, naïve Bayes, LogitBoost, and MaxEntropy. More recently, Jalil and colleagues [38] classified positive, negative, and neutral tweets using various feature sets and XGBoost (eXtreme Gradient Boosting) classifier. The authors of Rustam et al. [39] proposed a COVID-19 tweets classification approach based on a decision tree, XGBoost, extra tree classifier (ETC), random forest, and LSTM. Similarly, Dangi et al. [40] proposed a novel approach known as Sentimental Analysis of Twitter social media Data (SATD) based on five different machine learning models (logistic regression, random forest classifier, multinomial NB classifier, support vector machine, and decision tree classifier) Rahman et al. [41] explored the performance of ensemble machine learning classifiers for sentiment analysis of COVID-19 tweets from the United Kingdom. Es-Sabery et al. [42] applied MapReduce opinion mining for COVID-19-related tweets classification using an enhanced ID3 decision tree classifier.
Basiri et al. [43] presented a model that combines five models such as naïve Bayes support vector machines (NBSVM), FastText, DistilBERT, CNN, and bidirectional gated recurrent unit (BiGRU) on COVID-19 tweets in eight highly affected countries. Ibrahim et al. [44] proposed a hierarchical Twitter sentiment model (HTSM) to show people's opinions in short texts. Bonifazi et al. [45] proposed a novel approach for investigating the COVID-19 discussions on Twitter through a multilayer network-based model. It yielded the identifica-tion of influential users, which is much more important to analyze and can provide more valuable information.
Naseem et al. [46] correspondingly proposed the use of various pre-trained embedding representations-FastText, GloVe, Word2Vec, and BERT-to extract features from a Twitter dataset. Furthermore, for the classification, they applied deep learning methods Bi-LSTM and several classical machine learning classifiers, such as SVM and naïve Bayes.
Yan et al. [47] reported public sentiment toward COVID-19 vaccines across Canadian cities by analyzing comments on Reddit. In order to identify significant latent topics and classify sentiments in COVID-19-related English comments between January and March 2020, Jelodar et al., examined 563,079 comments from Reddit [48]. Lai et al. [49] analyzed 522 comments from a Reddit Ask Me Anything session about COVID-19. Reddit posts evaluated in this study were manually coded by two authors of this paper.
Pal et al. [50] showed that new knowledge could be captured and tracked using the temporal change in word embeddings from the abstracts of COVID-19 published articles. They found that thromboembolic complications were detected as an emerging theme as of August 2020. A shift toward the symptoms of long COVID complications was observed in March 2021, and neurological complications gained significance in June 2021.
Jha et al. [51] observed that the word2vec model performed better than the GloVe model on a COVID-19 Kaggle dataset. Another point highlighted by this work is that latent information about potential future discoveries was significantly contained in past papers and publications.
Batzdorfer et al. [52] used word embeddings to distinguish non-conspiracy theory content from conspiracy theory-related content and analyzed which element of conspiracy theory content emerged during the pandemic.
Didi et al. [6] proposed a tweets classification approach (negative, positive, and neutral) based on a hybrid word embedding method, combining several widely used techniques, such as TF-IDF, word2vec, Glove, and FastText, to represent posts.
Bhandari et al. [53] proposed a deep learning model with stacked word embeddings to the multi-class classification problem for three and five classes (extremely negative, negative, neutral, extremely positive, and positive). It outperformed the individual static pre-trained embedding representation, classical machine, and deep learning approaches.
To our knowledge, no previous analysis applied word embeddings to extract knowledge from Reddit to provide context about specific emotions involved in psychological resilience during the pandemic. Acute crisis and loss events, disruptions in many facets of life, continuous multi-stress problems, and always-changing conditions made the COVID-19 pandemic a perfect storm of stressors. The rapid spread of COVID-19 during the 2020-2021 period, when emotions were exacerbated [54], created a unique opportunity to extract knowledge about resilience in the face of global adversity, yet to be explored using NLP. We believe that a better understanding of resilience is important in developing strategies to cultivate and promote resilience.

Methods
Our research included different sequential phases starting with data collection from publicly available Reddit titles from the R/Coronavirus subreddit, data cleaning using open access R libraries, an initial descriptive analysis of the available data, word2vec model training, the formal model validation using semantic categorization test and visualization using hierarchical clustering and heatmaps. Each of them is described in this section.

Data Collection
Data from Reddit were obtained via pushshift.io through the pushshift.io API (Pushshift, 2023) [55]. In order to collect and distribute Reddit datasets for research purposes, academics can use Pushshift.io, a website that keeps all publicly accessible Reddit submissions and comments. Pushshift.io has been used in a large number of publications in related research (e.g., Lama et al. [56]). In this work, the pushshiftr R package [57] was used as a wrapper for the pushshift.io API.

Data Cleaning
The quanteda R library [58] was used to create the final sample for analysis. The data cleaning process included lemmatization (where the phrases "dog," "dogs," and "dog's" are all changed to "dog"), nonprintable character removal (such as emojis), and basic normalizing (such as removing punctuation and lowercasing all text).
All analyses used are publicly available, anonymized data and comply with Reddit's terms of service, usage rules, and privacy guidelines. They were also carried out with institutional review board clearance from the authors' institutions.

Descriptive Initial Analysis
For descriptive analysis, we first processed the data into the tidy text format as one token (word) per row. The process of breaking text into tokens is known as tokenization. This one-token-per-row structure differs from how text is commonly kept in current studies (e.g., in a document-term matrix). For tidy text pre-processing, we used the tidytext [59], dplyr [60], ggplot2 [61], and broom [62] R packages.
In order to determine if the frequency of each word is rising or decreasing over time, we fitted a model (logistic regression) using the broom R package. Then, each term has a growth rate (represented by an exponential term) associated with it.
In the Supplementary Materials Figure S1, we present the number of titles per week. We confirm that the distribution is quite similar to the plot provided by the official Reddit statistics presented in Figure 1. Figure S2 shows the most frequent words (after excluding COVID-19, Coronavirus, and pandemic, which due to their highest frequency, make all other terms not visible if put together in the same plot with all other terms). The top 10 are people, vaccine, China, positive, health, home, masks, world, death, and Trump. Figure S3 shows the terms with the steepest increase in frequency. The highest one is for Donald Trump, right before the day of the Presidential Election in the United States (3 November 2020), with the highest decrease after it. When visualizing all four sub-plots in Figure S2, shown from left to right and from top to bottom, it can be seen that each of them refers to a specific aspect of this pandemic, each of them with special relevance at different time points: lockdown at the early stage, masks and Trump at intermediate stages, and vaccine increasing steadily until the final stages.
In Figure S4, we present a word cloud created using all the titles containing the term "stress."

Model Training: Word2vec
We applied the wordVectors [63] R package to train the word2vec model. It runs the original C code for word2vec [12].
A metric of the degree of similarity between two embedding vectors for the two words is provided to measure how similar the two words are. Given two vectors u and v, cosine similarity is defined as follows [12]: where u.v is the dot product (or inner product) of two vectors, u 2 is the norm (or length) of the vector u, and θ is the angle between u and v.
The cosine distance is defined as the inverse of the cosine similarity; the shorter the cosine distance, the more similar the two vectors (words).

Model Validation: Semantic Categorization Test
We measured the capacity of the W2V model to represent the semantic categories based on the Battig and Montague category norms, which have been applied by researchers in several fields in over 1600 publications in more than 200 different journals [33]. In this work, we use Van Overschelde's [33] expanded and updated version of the Battig and Montague original norms.
In order to measure how well a word i is grouped in relation to the other words in its semantic category, we used the Silhouette Coefficients, s(i), defined as: where a(i) is the mean distance of word i with all other words within the same category, and b(i) is the minimum mean distance of word i to any words within another category (i.e., the mean distance to the neighboring category). Therefore, silhouette coefficients measure how close a word is to other words within the same category compared to words of the closest category [64].

Model Visualization: Hierarchical Clustering and Heatmaps
We used the superheat R package [65] to visualize the word vectors (obtained from Word2vec), highlighting contextual similarity. "The rows and columns are ordered based on a hierarchical clustering and are accompanied by dendrograms describing this hierarchical cluster structure" [65].

Sample Description
We collected all 374,421 titles submitted by 104,351 different Redditors to the r/Coronavirus subreddit between 20 January 2020 and 14 July 2021.
In Figure 2, we show representative examples of the collected titles, the top 3 containing the term "resilience" and the bottom three randomly selected.

A 3-Steps Validation of the Word2vec Embeddings
The train_word2vec function of the wordVectors R package was used to obtain the model (W2V) once the data had been generated. The following settings were used: "vectors = 200, threads = 4, window = 12, iter = 5, negative_samples = 0". These parameters have been applied by the wordVectors authors in related research [63].
We performed a three-step validation of W2V as in previous related research [66].

A 3-Steps Validation of the Word2vec Embeddings
The train_word2vec function of the wordVectors R package was used to obtain the model (W2V) once the data had been generated. The following settings were used: "vectors = 200, threads = 4, window = 12, iter = 5, negative_samples = 0". These parameters have been applied by the wordVectors authors in related research [63].
We performed a three-step validation of W2V as in previous related research [66]. We utilized a subset of the original Mikolov article analogies [12] for the first one.
In NLP, the task of finding a word analogy is represented as "a is to b as c is to ___." The classic Mikolov example is: king is to man as woman is to ___'-also represented as king -man + woman = ?
The human brain can recognize that the answer is the word 'queen'. However, for a machine to understand this pattern and fill in the blank with the most appropriate word requires a lot of training using a huge corpus (for example, the whole of Wikipedia; in our case, we are using only the obtained 374,421 titles from r/Coronavirus). Using our obtained model (namely W2V), the example analogy is represented as: W2V("king") − W2V("man") + W2V("woman") = ?
We obtained promising results (as presented in Table 2) for several analogies from previous research [66], for example: Analogy: brother − sister + husband = ? Answer: wife (0.5985) The number in brackets is the cosine distance between the vector embedding for the term 'wife' and the vector that is the result of the operations on the left-hand side of the equation. As the second step of W2V validation, from a representative list of specific terms related to COVID-19, we run our W2V model on each of them (for example, the term "anosmia") to identify its three closest terms using the following command: nearest_to(W2V[["anosmia"]],3) = ? As a result, we obtained the following set of the three closest terms to "anosmia": {olfactory (0.463); parkinson (0.459); aspirin (0.496)} In Table 3, we present the closest terms retrieved by our model and their cosine distances to several COVID-19 representative terms of a known COVID-19 glossary [67]. We proceeded through the closest terms and identified related publications and evidence supporting them, noting the high relevance of all the discovered terms in order to demonstrate the capacity of our W2V model to uncover relevant related terms (Table 3).  [83] In Table 3, we show the closest terms retrieved by our model and their cosine distances to representative definitions from the initial terms of the glossary (terms starting with the 'A' letter). For each of the terms identified by our trained model, we included relevant published scientific literature. For example, the first term in the glossary was "ards" (acute respiratory distress syndrome); our model retrieved Remestemcel, and its cosine distance was 0.364. We referenced Mahendiratta et al. [68] because in their systematic review of Stem cell therapy in COVID-19, results on Remestemcel were recently reported. Similarly, for glucose, we referenced Lazzeri et al. [69] work, where they address the prognostic role of hyperglycemia and glucose variability in COVID-related acute respiratory distress, similarly, for all other terms in Table 3.
As the third step of W2V validation, we identified the closest terms to "resilience." Then we searched for all appearances of "resilience" in all 374,421 titles and identified the titles with the highest upvotes. We present them in Figure 2.

Semantic Categorization Test
For each of the first 65 semantic categories of the updated version of the Battig and Montague norm [33], we calculated the silhouette coefficients. The complete list of all the terms included in each category as well as distances and silhouette calculations, is presented in Supplementary Materials Table S1. A representative screenshot of the distances from the first eight semantic categories to representative terms is presented in Figure 3. For example, the first semantic category is "1. A precious stone", as detailed in Table S1. It is integrated into four terms (diamond, ruby, gold, and gem). We run our W2V model to calculate the distances from a representative term from each category to all the other terms. Therefore, as shown in Figure 3, the mean distance from the "diamond" term to all other terms in the "1. A precious stone" category is 0.66. Meanwhile, it is 1.01 to the "2. A unit time" category represented by the "hour" term, it is 1.00 to the "3. A relative" category represented by the "mother" term, and so forth. Therefore, Figure 3 represents such distances as a heatmap, with greener values to the closest distances. It can be seen that for each term, for every semantic category, the closest distances are to those terms related to the category where the term belongs, therefore showing encouraging results.    Table S1 When analyzing the lower Silhouette scores, we identified remarkable reasons for the miscategorization of the terms. For example, as presented in Figure 3, the mean distance from the "diamond" term to all other terms in the "1.A precious stone" category is 0.66, but as shown in Table S1, when considering the "51. A type of ship/boat" category, represented by the "cruise" term, such mean distance is 0.55, remarkably lower. A possible explanation for this is the existence of the Diamond Princess Cruise, which is mentioned in some of the Reddit titles used for training our W2V model.   Table S1. When analyzing the lower Silhouette scores, we identified remarkable reasons for the miscategorization of the terms. For example, as presented in Figure 3, the mean distance from the "diamond" term to all other terms in the "1. A precious stone" category is 0.66, but as shown in Table S1, when considering the "51. A type of ship/boat" category, represented by the "cruise" term, such mean distance is 0.55, remarkably lower. A possible explanation for this is the existence of the Diamond Princess Cruise, which is mentioned in some of the Reddit titles used for training our W2V model.

Context for Positive and Negative Emotions
In Table 6, we present a list of specific positive emotions (gratitude, compassion, love, relief, hope, calm, and admiration) [30]. We ran our W2V model for each of them and identified several closest terms, providing the context where such emotions took place. Table 6. Positive emotions and their obtained closest terms.

Search Term
Closest Terms Similarly, Table 7 presents the list of negative emotions [30] (anger, loneliness, boredom, fear, anxiety, confusion, sadness) and their closest terms retrieved using W2V. Table 7. Negative emotions and the obtained closest terms.  Figure 4 graphically shows a dendrogram for the closest terms to two positive emotions (hope and gratitude) and two negatives (anger and anxiety) presented as clusters of the most similar closest terms. The darkest the color in the heatmap, the closest are the two terms; therefore, three clear clusters emerge in the heatmap diagonal.

Discussion
In this study, we proposed social media (particularly a Reddit subforum) as a connection between word associations (also known as embeddings) and emotion research. Although they both share context as a critical component, to our best knowledge, word embeddings have rarely been used in the field of emotion research. Furthermore, COVID-19 created a unique opportunity for doing it.
Therefore, we trained a model for producing word embeddings using a publicly accessible dataset (a Coronavirus subreddit) and open-source tools (R libraries) capable of retrieving relevant content (closest words). This content was formally validated using a standard tool and supported by public evidence (scientific publications), and applied to

Discussion
In this study, we proposed social media (particularly a Reddit subforum) as a connection between word associations (also known as embeddings) and emotion research. Although they both share context as a critical component, to our best knowledge, word em-beddings have rarely been used in the field of emotion research. Furthermore, COVID-19 created a unique opportunity for doing it.
Therefore, we trained a model for producing word embeddings using a publicly accessible dataset (a Coronavirus subreddit) and open-source tools (R libraries) capable of retrieving relevant content (closest words). This content was formally validated using a standard tool and supported by public evidence (scientific publications), and applied to the discovery of context for seven specific positive and seven negative emotions recently reported as related to resilience during the COVID-19 pandemic.
Our results confirmed our three initial hypotheses: word embeddings may be recovered in sufficient numbers from public domain-specific social media for the embedding to (1) be relevant to offer meaningful context to specific emotions, (2) be verifiable by sound theoretical semantic tests such as the Battig and Montague norm, and (3) be consistent with recent related publications, in spite of working with a relative "small" number of Reddit titles.
In relation to our fourth hypothesis (provide actionable knowledge to on-field specialists), current research reporting on the COVID-19 pandemic concluded that developing a resilient mentality differs depending on whether positive or negative emotions are present. Higher levels of positive emotions are correlated with higher levels of resilience, whereas high levels of negative emotions are associated with lower levels of resilience [30]. We associated seven positive and seven negative emotions to experienced situations. Specialists could therefore promote actions encouraging participation in activities related to positive emotions. For example, as shown in Table 6, "gratitude" and "admiration" were shown by means of activities taking place worldwide. People congregated on balconies while confined to their apartments to acclaim medical personnel working on the front lines, as well as to sing or take part in impromptu flash mobs [96]. Calm and compassion were associated with meditation and mindfulness. Hope was associated with humor, smiling, laughing, fun, and funny.
When analyzing negative emotions, we found racism and xenophobia mainly related to fear. Globally, migrants and minority groups were disproportionately affected by racism and xenophobia linked to COVID-19 [97]. They have an especially negative effect on people who already experience overlapping social, economic, and health-related vulnerabilities. They intensify current patterns of discrimination and unfairness. Minority groups in both the United States and Europe have endured discrimination and hate crimes. [98,99]. Anger was mainly related to frustration, bureaucracy, and confusion as in related research (e.g., Selman et al. [100]); loneliness was associated with addictions, while boredom was related to specific activities to overcome it, such as meditation, illustration, piano, Spotify, playlists or videogames (Halo, Fortnite).
Several recent studies addressed social media (particularly Reddit) during the pandemic. For example, Gozzi et al. [101] analyzed collective responses to media coverage. They performed mixed-methods analysis on web-based news articles, YouTube videos, English user posts and comments on Reddit, and views of Wikipedia pages related to COVID-19. They concluded that "collective attention was mainly driven by media coverage rather than epidemic progression [101]". Compared to other social media platforms, Reddit users were generally more concerned about health, data related to the new disease, and interventions needed to stop its spread [101]. In order to identify significant latent topics and classify sentiments in COVID-19-related English comments between January and March 2020, Jelodar et al., examined 563,079 comments from Reddit [48]. Lai et al. [49] analyzed 522 comments from a Reddit Ask Me Anything session about COVID-19 on 11 March 2020. Most posts addressed symptoms, followed by prevention recommendations. COVID-19 symptoms were also the most requested topic suggested by users for further discussion.
Word2vec has been scarcely used in small corpora. García-Rudolph et al. [ In another study applying word2vec in small corpora using the semantic categorization test, Stetten, the study included 37 k and 140 k documents to analyze and disambiguate the content of dreams [102]. This research area addresses questions such as "How do gender, cultural background, and waking life experiences shape the content of dreams?". To our knowledge, no previous work studied Reddit submission titles considering word embeddings in order to expand on the concept of resilience. We offer a tool for identifying terms of interest that can be addressed to practitioners in the field of psychology and social work.
A number of limitations to this study need to be highlighted. The analyzed sample was not meant to be exhaustive or representative of all titles posted by everyone living in any specific region during the period under study. It included all titles from only one of the COVID-19 subreddits; therefore, we did not include data from other subreddits addressing specific COVID-19 aspects (e.g., CovidVaccinated or COVID-19Positive). Nevertheless, r/Coronavirus was by far the subreddit with a higher number of subscribers and posts. It has been the most active subreddit during the period under study (between 20 January 2020 and 14 July 2021). We did not include comments in our analysis. We included only submissions' titles. The length limit in Reddit comments is 40,000 characters, more than 100 times larger than the titles' limit (300 characters). Therefore including comments would involve a different analysis, with different hypotheses, which is left as future work.
The potential impact of the data-cleaning process needs to be mentioned as another limitation, particularly in terms of the context of the text. For example, by removing emojis and other non-printable characters, we might have been removing some contextual information that could be relevant to understanding the sentiments or emotions. For example, Li et al. [103] presented an approach to classify microblog review sentiments that included emojis with an emoji-text-incorporating bi-LSTM (ET-BiLSTM) model. Their results showed that ET-BiLSTM enhances the performance of sentiment classification.
Another aspect of Reddit worth to be analyzed, not included in this study, involves NSFW (Not Safe For Work) posts. This term refers to user-submitted content not suitable to be viewed in public or in professional contexts. The phenomenon of NSFW posts on Reddit has been very little investigated, although it is very common in this social medium [104].
Other relevant factors to mention as limitations to our study include geographic location, spatial trajectory, or the time of day a submission was posted. Such factors, as noted by Padilla et al. [105] and Gore et al. [106], are relevant in social media. Geographic aspects were not analyzed in our study, but Reddit is most popular in the U.S., with American users far outnumbering those from any other country at 54% of Reddit users. After the U.S., the United Kingdom has the second-highest share of data traffic with 8%, while Canada ranks third with 6.4%. Reddit is most popular with young adults aged 25 to 34, who comprise more than half of the site's users. Nevertheless, there are also a large number of middle-aged users on Reddit. Previous studies have found that 33% of users are between the ages of 30 and 49, suggesting that Reddit is a viable platform for reaching both young and middle-aged adults. More than two-thirds of Reddit users are men who are particularly active on the site [107]. Compared to people living in rural areas, urban and suburban residents use Reddit much more frequently. Gozzi et al., also pointed out that Reddit has developed into a self-referential community, reinforcing the site's propensity to concentrate on its own content rather than outside sources [101].

Conclusions
This study opens up interesting opportunities for exploration and discovery using, for the first time, a word2vec model trained with a small Coronavirus dataset of Reddit titles leading to immediate and accurate terms that can be used to expand our knowledge on specific concepts such as resilience, by identifying the context in which they take place. We presented a step forward in developing a tool that can be used by practitioners in the field of psychology or social work for identifying terms of interest describing the context in which specific positive and/or negative emotions related to psychological resilience took place. These may support clinicians in specific situations where individuals can be encouraged to get involved or promote positive emotions related to psychological resilience.
Supplementary Materials: The following supporting information can be downloaded at: https://www. mdpi.com/article/10.3390/app13116713/s1, Figure S1: Number of titles per week of the r/Coronavirus subreddit; Figure S2: Top 50 most frequent words; Figure S3: Terms with the steepest increase in frequency; Figure S4: Wordcloud of all titles containing the "stress" term; Table S1: Semantic categorization test.  Institutional Review Board Statement: All analyses relied on public, anonymized data; adhered to the terms and conditions, terms of use, and privacy policies of Reddit; and were performed under Institutional Review Board approval from the authors' institution.

Informed Consent Statement: Not applicable.
Data Availability Statement: The datasets generated and/or analyzed during the current study are available from the corresponding author upon reasonable request.