Using large language models to assess public perceptions around glucagon-like peptide-1 receptor agonists on social media

Background The prevalence of obesity has been increasing worldwide, with substantial implications for public health. Obesity is independently associated with cardiovascular morbidity and mortality and is estimated to cost the health system over $200 billion dollars annually. Glucagon-like peptide-1 receptor agonists (GLP-1 RAs) have emerged as a practice-changing therapy for weight loss and cardiovascular risk reduction independent of diabetes. Methods We used large language models to augment our previously reported artificial intelligence-enabled topic modeling pipeline to analyze over 390,000 unique GLP-1 RA-related Reddit discussions. Results We find high interest around GLP-1 RAs, with a total of 168 topics and 33 groups focused on the GLP-1 RA experience with weight loss, comparison of side effects between differing GLP-1 RAs and alternate therapies, issues with GLP-1 RA access and supply, and the positive psychological benefits of GLP-1 RAs and associated weight loss. Notably, public sentiment in these discussions was mostly neutral-to-positive. Conclusions These findings have important implications for monitoring new side effects not captured in randomized control trials and understanding the public health challenge of drug shortages.


Preprocessing
To prepare the raw text scraped from Reddit for topic modeling, the following series of operations are performed.Since every post is composed of both a title and body, these two attributes are concatenated and separated by a period.Next, hyperlinks and html tags from the body of each discussion are removed to reduce content size and reduce the risk of erroneous embeddings during topic modeling.Discussions that were less than five characters in length were also removed.

Topic Modeling
Topic modeling is a use case of natural language processing that identifies clusters (or topics) of similar information from a set of text documents.Bidirectional Encoder Representations from Transformers (BERT) models are a state-of-the-art architecture that have generally outperformed prior models in embedding words, sentences, and documents with greater contextual and semantic meaning.Notably, the foundational transformer block in BERT models is also a crucial component of large language models (LLM) and state-ofthe-art computer vision models, highlighting the valuable role of this novel architectural design in modern artificial intelligence models.
We first start embedding documents using a pretrained, document-level Bidirectional Encoder Representations from Transformers (BERT)-like architecture model called Beijing Academy of Artificial Intelligence (BAAI) Generalized Embeddings (bge-base-en-v1.5),which is a 102M parameter model that uses the final layer's hidden state (768 dimensions) of the [CLS] (sentence classification prediction) token as the embedding representation. 1 This class of embedding models achieves state-of-the-art performance on the Massive Text Embedding Benchmark, which is an evaluation pipeline that can be used to assess performance of embedding models across 8 embedding tasks (including clustering tasks) using 58 different datasets. 2ering discussions with a large feature size (i.e., 768 as the output dimensionality of the embedding) can be computationally prohibitive; as such, we further reduce the dimensionality of this representation using Uniform Manifold Approximation and Projection (UMAP), an unsupervised dimensionality reduction algorithm that better preserves global topology of higher-dimensional data into lower dimensions. 3We then cluster these discussions into relevant topics in a series of two steps.First, we find an initial guess of clusters of documents corresponding to a particular topic using Hierarchical Density Based Spatial Clustering of Applications with Noise (HDBSCAN), which chooses the optimal number of clusters based on techniques in graph theory to assess for cluster stability. 4However, this technique carries a risk of excessively labeling points as outliers for topics that may be smaller in size and miss out on labeling topics that are otherwise relevant to captures.For this reason, we use the initial number and locations of clusters obtained from HDBSCAN as an initial guess for a less stringent and more commonly used clustering algorithm, KMeans Clustering.After topics are identified, we use a class-based term-frequency inverse-document frequency (c-TF-IDF) technique, similar to a Bag-of-Words representation, based on the discussions contained in each topic to identify keyword representations of each topic.
To create topic labels that can better represent a topic, we leverage LLMs and engineer prompts for this task, as shown in Supplementary Figure 1.We choose Llama2 (7B) as our LLM of choice, given its free availability, transparent development, and state-of-the-art performance of its family of models compared to other LLMs in a variety of domains in the Holistic Evaluation of Language Models (HELM) benchmark. 5,6ue prompts for each topic are created with a context ("You are a honest, scientific chatbot that helps me, a Cardiologist, create unique, diverse labels for a topic based on representative discussions and keywords."),constraint instructions ("Do not be creative or loquacious.Please present the topic label in a short and direct manner."),input data ("I have a topic that is described by the following keywords:<keywords>…In this topic, the following documents are a small but representative subset of all other documents in the topic…"), and output instruction ("Based on the information above, can you create three short, direct labels without descriptions for this topic?").Our input data consists of five representative discussions, five random discussions, and keyword representations of each topic.Representative discussions are chosen as those discussions closest in Euclidean distance in the native embedding space to the centroid (represented as the mean of all embedded discussions in that topic) of that topic.Five randomly sampled discussions from within the topic, without prespecifying that these are random to the LLM, are also provided to capture the variety of discussions that may be captured in each topic.
Since these topics may be granular, identifying further clusters within these topics is important to provide an appreciation of the overarching themes within our dataset.To group these topics, we use UMAP and spectral clustering 7 on the c-TF-IDF representation of each topic.Since spectral clustering requires a prespecified number of clusters to be provided, a sensitivity analysis was performed by measuring the Silhouette Coefficient across a range of prespecified clusters (2 to 40) to find the optimal number of clusters. 8,9The Silhouette Coefficient measures how the overlap and mislabeling across clusters by calculating the mean intra-cluster distance and the mean nearest-cluster distance.Values range between -1, which suggests likely misassignment of a discussion sample to an incorrect cluster, and 1, which suggests optimum assignment of a discussion point to a cluster; a value of zero indicates overlapping clusters.Labels for the resulting groups are generated by using Llama2 and a prompt similar in context and instructions as before using the topic labels for all topics within that group, as outlined in Supplementary Figure 1.
All analysis above was performed using Python 3.11.Hyperparameters for the above models are specified in Supplementary Table 2.All models with built in stochasticity were initialized with a random state set to 42 to enable reproducibility.All code is publicly available on www.github.com/sssomani/glp1_reddit.

Sentiment Analysis
Sentiment analysis is a form of natural language processing that classifies the sentiment of text documents into distinct categories, most commonly "positive" ("I am so happy about the weight I lost on Ozempic!") or negative ("I had the worst nausea and vomiting on Ozempic!").Multiclass models have also been developed that additionally classify documents into a neutral category when these text representations may not be polarized towards a positive or negative sentiment.For example, the phrase "tirzepatide is a drug" is not opinionated and as such does not fit the mold of most traditional sentiment analysis models.Emerging techniques for sentiment analysis leverage the same transformer model architecture, with pretrained models available on open source frameworks like Huggingface. 10To assess sentiments for each post, a separate BERT-like model, Robustly Optimized BERT Pretraining Approach (RoBERTa), pretrained on characterizing sentiments from social media posts, was used to classify sentiment. 11This model is useful since it offers multiclass labels (i.e.3][14][15] The length of the input phrase was limited to 512 characters for this model.The output comprised of three probabilities assigning the likelihood that the input text would have a negative, neutral, or positive sentiment.For phrases less than or equal to 512 characters, the preprocessed text was passed with assignment of the sentiment with the highest probability to that phrase; for instance, if a phrase had the following probability array (0.1, 0.3, 0.6) for (negative, neutral, positive), then that phrase would be labeled as 'positive'.To process longer phrases, multiple phrases from each post or comment exceeding 512 characters were created by searching for all instances of any of the search words (e.g."ozempic") and taking the 256 characters before and after that match location.Sentiment value ('positive', 'negative', or 'neutral') for that phrase was assigned by taking the mean of the probabilities across all subsampled regions and choosing the sentiment with the highest probability.To understand how sentiments varied across topics and groups, the sentiment label was transformed from 'negative', 'neutral', and 'positive' to -1, 0, and 1, respectively.
Supplementary Figure 1.AI Topic Modeling Pipeline.The topic modeling pipeline in this study is described.
First, candidate discussions about GLP-1RA are extracted from all Reddit discussions using candidate key terms and preprocessed for our AI models ("Dataset Curation", red box).Next, each discussion is embedded, dimensionally reduced, and clustered using KMeans to identify salient topics ("Topic Identification," yellow box), with subsequent identification of groups based on clustering each topic's cumulative term-frequency inverse-document frequency representation ("Group Identification," blue box).Topic labels for both topics and groups are generated using a large language model, Llama, with the specified prompt template ("Topic and Group Labeling using Llama", purple box).The topic labeling prompt template is shown in the center with a yellow highlight, while the group labeling prompt template is shown in blue on the right.Sentiment analysis is performed directly on the preprocessed set of discussions ("Sentiment Analysis", green box).