Social media engagement analysis of U.S. Federal health agencies on Facebook

Background It is becoming increasingly common for individuals and organizations to use social media platforms such as Facebook. These are being used for a wide variety of purposes including disseminating, discussing and seeking health related information. U.S. Federal health agencies are leveraging these platforms to ‘engage’ social media users to read, spread, promote and encourage health related discussions. However, different agencies and their communications get varying levels of engagement. In this study we use statistical models to identify factors that associate with engagement. Methods We analyze over 45,000 Facebook posts from 72 Facebook accounts belonging to 24 health agencies. Account usage, user activity, sentiment and content of these posts are studied. We use the hurdle regression model to identify factors associated with the level of engagement and Cox proportional hazards model to identify factors associated with duration of engagement. Results In our analysis we find that agencies and accounts vary widely in their usage of social media and activity they generate. Statistical analysis shows, for instance, that Facebook posts with more visual cues such as photos or videos or those which express positive sentiment generate more engagement. We further find that posts on certain topics such as occupation or organizations negatively affect the duration of engagement. Conclusions We present the first comprehensive analyses of engagement with U.S. Federal health agencies on Facebook. In addition, we briefly compare and contrast findings from this study to our earlier study with similar focus but on Twitter to show the robustness of our methods.


Background
An increasing percentage of the population uses various social media platforms such as Facebook, Twitter, and Tumblr for reasons varying from casual conversations to debating social issues. Around 68% of U.S. adults use Facebook [1] which has over 180 million daily active users in the U.S. and Canada [2] who spend around 40 min per day on this medium [3]. A recent study by PricewaterhouseCoopers showed that in the United States, 24% of adults post about their health experiences on social media with 16% of them posting reviews of medications, treatments, doctors or health [4]. A survey on social media preference among medical students showed 77% of first year medical students and 80% of graduating medical students use Facebook and prefer online media as their primary source of information [5].
Facebook, the most popular social networking website [1], has invigorated a wide range of health sciences studies. Facebook use for disease surveillance [6] or public health issues [7][8][9] shows its broad scope for improving public health. Researchers have also used Facebook to address specific health concerns. For example, studies have been conducted to assess Facebook's potential in engaging smokers in smoking cessation treatment [10] and to evaluate it's scope in recruitment and retention of young adult American veterans into an online alcohol intervention study [11]. While most Facebook based health studies focus on information dissemination to individual users, surprisingly few have focused on how health agencies are involved in Facebook based communications [12][13][14]. This paper addresses this gap.
We ask the general question: How can health agencies be more engaging on social media? We perceive 'engagement' as interactions designed to promote some common goal as seen for example in [15]. In the context of this study the interactions between the U.S. Federal health agencies and Facebook users are meant to promote better healthcare knowledge through successful information dissemination and consumption.
The importance of social media for communicating to a broad audience is well acknowledged in journalism [16], politics [17], marketing [18], entertainment [19], etc. Healthcare organizations such as the Centers for Disease Control and Prevention (CDC) or Food and Drug Administration (FDA) have a crucial responsibility to inform the public of critical pandemic events like the spread of H1N1 [6,20] or Coronavirus [21] and about drug recalls [22] and sexual health information [23]. Interestingly, while the two organizations differ significantly in the number of Facebook posts they are quite similar in the response (activity/post) generated. Like the CDC, the National Cancer Institute of National Institutes of Health (NIH) also has several thousand posts, but their response is quite low compared to the other two organizations. While it may be that the intent behind a post is to inform rather than to generate a response, differences in engagement are notable. We do not yet understand if there are factors associated with these differences. The nature of public engagement with an organization's messages is an active focus of research in health sciences and in marketing [24]. This is traditionally studied by surveys of healthinformation seekers [25,26]. Studies on engagement can inform organizations about topics of public interest [27] or strategies to increase public reach [28]. In contrast to surveys, our study of engagement on social medial is 'observational' where we assess public activities in response to posts by U.S. Federal health agencies.
We address two specific questions with respect to Facebook posts from U.S. Federal health agencies and the responses they generate. First, which Facebook account and post features are associated with the level of engagement, i.e., level of public response in the form of Facebook activity (likes, shares, comments)? Second, which Facebook account and post features are associated with the interval length between an agency's Facebook post and the last activity it generates?
We analyze an almost comprehensive set of Facebook posts from 72 Facebook accounts of 24 U.S. Federal health agencies. We explore associations between various features and level of activity using hurdle models. We explore the features related to our second question using survival models. Features we examine include standard ones such as the number of page likes as well as less studied features relating to the semantic content of a post.

Data collection Agencies & accounts
We selected health agencies through the Health and Human Services (HHS) Social Hub website [29] which lists all Facebook accounts affiliated to various U.S. Federal health agencies.

Posts & activity
The Facebook Graph API [30] was used to collect all posts from an account's timeline as of late January 2013. For each post, we recorded its unique identifier, number of likes, shares, comments and other metadata as described below.

Account and post features
We included features that are generally used in Facebook-based studies [12,31,32] as well as those that are seldom considered (see Table 1).

Page likes
The number of page likes shows the number of users endorsing an account. A page like is different from a post like which is considered an engagement activity. Users liking a page receive all posts from the account in their news feeds [33]. It seems reasonable to expect page likes to associate with engagement.

Post types
The Facebook Graph API provides information about the type of a particular post. Posts are classified into six selfexplanatory categories, namely link, music, photo, question, status (a post is an uncategorized status if it is simply textbased and does not belong to any of the other categories), and video/Adobe's ShockWave Flash format (SWF).

Sentiment
We hypothesize that the sentiment of a Facebook post may be associated with engagement. Perhaps more positive sentiment is linked with greater activity, or maybe the reverse holds. We analyze sentiment using a state-of-the-art lexicon-based sentiment classifier, SentiStrength [34]. SentiStrength has been widely applied to social media postings [35] and has been shown to outperform other lexical classifiers [36]. SentiStrength classifies each Facebook post into positive and negative on a scale of +/−1 (neutral) to +/−5 (extreme).

Content
One aspect of Facebook analysis that is often overlooked is post content. We hypothesize that some topics are more attractive to a wider group than others. For example, a post about information dissemination of the outbreak of West Nile virus ("West Nile virus is a potentially serious illness. What you need to know: http://go.usa.gov/r9g4") generated far more activity compared to a job posting from U.S. Public Health Service Nurses ("National Park Service has a Registered Nurse Manager position open in Yosemite, CA. This position closes on November 19. If interested, please send a cover letter and CV to S**** C**** at email@nps.gov.").
We use the National Library of Medicine's Medical Text Indexer (MTI) [37] to assign Medical Subject Headings (MeSH) [38,39] recommendations to each post. MTI is commonly used to recommend MeSH terms to titles and abstracts of biomedical literature and has been shown to be useful in other domains such as clinical text [40]. As an aside we show a novel application of MTI in the social media domain. The semantic types of the MeSH terms are mapped to the fifteen higher level semantic groups by the National Library of Medicine [41]. For example, the high level semantic group "Disorders" comprises of 12 semantic types, namely, Acquired Abnormality, Anatomical Abnormality, Cell or Molecular Dysfunction, Congenital Abnormality, Disease or Syndrome, Experimental Model of Disease, Finding, Injury or Poisoning, Mental or Behavioral Dysfunction, Neoplastic Process, Pathologic Function, and Sign or Symptom.

Choice of model
As shown later, around 20% of Facebook posts have zero activity (i.e. they receive no likes, shares or comments). This type of distribution of data where the variance (of activity count) is much greater than the mean implies overdispersed data [42] with zero-inflation [43]. Typically linear models such as Poisson or negative binomial regression are used to model count data. However the zero-inflation and overdispersion (p < 0.001) requires using two-part count data models such as the hurdle regression model [44]. Hurdle models have two separate components: a zero-portion used to fit the sizeable portion of zero counts in the data and a count-portion to fit the non-zero counts of the data. The zero-portion models whether a count is zero (no activity) or not using a binomial probability model. The count portion determines the conditional distribution of the non-zero counts using a zero-truncated negative binomial or Poisson model. Previous studies on social media engagement [10,[45][46][47] have shown the power of hurdle models for modeling data with similar characteristics.
We compared different count data regression models (namely, the Poisson, negative binomial, hurdle Poisson and hurdle negative binomial (HNB)) using standard goodness-of-fit measures. The HNB model had the lowest AIC value (297667.3) compared to the Poisson (1,443,334), negative binomial (304590.7) and hurdle Poisson (1,292,709) models, signifying a better fit. The Vuong statistics signifies that hurdle negative binomial model has a better fit compared to the other models. Our comparison of full and nested models such as hurdle negative binomial and negative binomial using the likelihood ratio test also indicates that the former model fits our data best. Variance inflation factor (VIF) yielded VIF scores for all independent variables in our regression analysis that were within the range of zero to five indicating no multicollinearity issues.
The temporal characteristics of a post are also of interest. We use methods from survival analysis [48], the branch of statistics dedicated to modeling such temporal behavior. Similar to other social media based studies [49,50], we use the Cox proportional hazards regression model [51], specifically, to predict how the different features (see Table 1) associate with the time duration between the Facebook post and the last activity in response.

Agencies & accounts
Seventy two Facebook accounts corresponding to 24 health agencies were identified. Seventeen are NIH division such as NIH/NIDA, NIH/NIMH and NIH/ NICHD. Some agencies have quite a few accounts such as NIH/NLM (6 accounts: Women's_Health_Resources, NLM_4_Caregivers, etc.), CDC (10 accounts: CDC_Tobacco_Free, Health_Hazard_Evaluation_Program, etc.), OS (16 accounts: HealthCare.gov, Medi-cal_Reserve_Corps, etc.) while several others have just one account such as ACF, FDA, NIH/NCCAM, etc. Table 2 lists the various agencies, the number of accounts for each and of accounts.
As shown in Table 3 Table 4 shows the top 10 accounts ranked by activity per post. We note, for example, that one of the six NLM Facebook accounts is in the top 10 list. Let's Move affiliated to the Office of the Secretary has the highest activity per post (246.2) when excluding posts with no activity. CDC's official account, with the most number of posts (2867), also leads in total number of activities (285,347).    (115,940).
Post types Table 6 shows the various types of post as well as their counts. Links are the most common (28,830) while questions are the least common (74).

Sentiment
In Table 7, we see that Facebook posts are generally positive (percentage of moderate to extreme positive is 61.89% while for negative this percentage is 47.04%). Modeling activity using hurdle model Table 9 presents results from the hurdle regression model. Regression coefficients in the zero-portion are exponentiated as odds ratios (OR) while the exponentiated regression coefficients in the count portion are treated as incident rate ratios (IRR) [52]. When we interpret the results of a particular variable we consider all other variables to remain constant.

Analysis for activity presence
The coefficients of the logit regression in the zero portion of the model indicate how the features relate to crossing the 'hurdle' of obtaining at least one activity (i.e. either a like, share or comment).

Analysis for activity abundance
We now analyze the coefficients of the negative binomial regression in the count portion of the hurdle model (Table 9). This allows us to focus on posts that cross the 'hurdle' of getting at least one activity.
Given a unit increase in the log-transformed count of page likes, the rate of activity is expected to increase by a factor of 6.033, while holding all other variable in the model constant. For sentiment, a unit increase in positive sentiment increases the rate of activity by a factor of 1.126 while a unit increase in negative sentiment decreases the rate of activity by a factor of 0.934, with all

Modeling activity life span
The median number of days between a date of posting and date of last activity is zero. Almost 80% of posts have their last activity on the same day as the post date, but there are posts garnering attention for months or even years. Regression coefficients from the Cox proportional hazards model are exponentiated as hazard ratios (HR) and used in the interpretation of the survival models. It is important to note here that a longer interval is desirable for the time to last activity. Thus features with negative coefficients are beneficial. Interpreting the coefficients is as follows. For continuous variables such as log-transformed counts of page likes, a unit increase in these values may change the time to last activity with all other variables remaining constant. For binary variables (each post type or each semantic group) the time to last activity may increase or decrease based on the presence of a feature compared to its absence in a post.    In Table 10 we find that a unit increase in the number of log-transformed page likes increases the time to last activity by 34.6% with all other variables remaining constant. A unit increase in positive sentiment increases the time to last activity by 2.1% while a unit increase in negative sentiment has no significant association with the time to last activity. Of the various post types, the presence of photos or videos are both linked to an increase in the time to last activity. The other post types are not significantly associated with the time to last activity. Amongst the 15 semantic groups, only eight are significantly related to the time to last activity. Posts containing semantic groups ' Activities & Behavior' , 'Concepts & Ideas' , 'Genes & Molecular Sequences' , 'Phenomena' and 'Procedures' are positively related by 2.9, 2.3, 13.6, 6.5 and 2.7% respectively. 'Devices' , 'Organizations' and 'Occupations' are the only ones that decrease the time to last activity by 14.7, 4.3 and 5.6% respectively.

Discussion
Our results show that there is considerable difference between levels of Facebook use and public engagement among organizations. OS and CDC have the most Facebook posts while NIH/NINDS and NIH/NIGMS have less than 200 posts. In terms of engagement, CDC with more than 7000 posts generates the most Facebook activity among agencies. Overall, less than 5% of Facebook posts get more than 100 total shares, likes or comments. We also found that an account's page likes have strong positive relationships with Facebook activity. This is in line with previous research where page likes have been used as proxy for engagement with specific health condition pages on Facebook [53]. While it is not an easy task for agencies to increase the number of users liking a page [54], it is still an easy metric to follow. Results also show that the photos, videos or interactive links may increase the likelihood of getting  [31,55,56], which show that media content and links are key to engaging Facebook users. Quite surprisingly, questionrelated posts, which are typically posted to encourage public participation or interaction, are apparently not useful in engaging the public. As observed in previous research [31], it can be argued that while questions might encourage user comments, they are unlikely to encourage likes or shares. Probably the organizations can look into more innovative ways to frame questions that would encourage user engagement. The presence of positive sentiment in posts from these government agencies is associated with higher activity. We speculate that positive posts generate greater readership and thus higher engagement compared to negative posts on Facebook, especially in the healthcare domain. This is in contrast to previous research, albeit in a different domain, which show that users participate more in discussions regarding problems or concerns in political posts with negative affect [57]. Semantic groups have not been previously studied in the context of Facebook activities. We found that posts about activities and behaviors, and phenomenon are positively associated with level of engagement. In contrast, posts about organizations and occupations tend to lower engagement. It may be that such posts are meant to be more informative than engaging.

Comparison with other studies
With goals similar to this research (i.e. to identify factors associated with engagement), we previously published an article where we analyzed tweets from 130 U.S. Federal health agency Twitter accounts [47]. Nineteen out of the 24 Facebook agencies studied here also had accounts on Twitter. Here we compare and contrast the findings from our previous Twitter-based study to our findings from this study. Comparison of accounts from same agencies but across the two platforms shows that Twitter-based accounts post more than Facebook-based accounts. This is likely due of the relative simplicity of Twitter postings. However, Facebook posts on average get more likes, shares or comments than retweet for tweets. In fact, around 27% of Facebook posts get more than 15 total likes, shares and comments, compared to only 10% of tweets that get more than 15 retweets. Comparison of the results of the statistical models from the two platforms reveals many interesting findings. As in Facebook, the use of URLs in tweets translates to higher engagement. Interestingly, while positive sentiment in Facebook posts correlate to higher engagement, it has negative or no association with the level of engagement in Twitter. The reasons for this are not quite obvious and we would like to investigate this in future research. In terms of semantic categorization, we find that across both social media platforms posts about activities and behaviors, and phenomenon are positively associated with level of engagement. In contrast, posts about organizations and occupations tend to lower engagement across both platforms. Overall, we find our results to be consistent and our methods to be robust for engagement analysis on Facebook and Twitter.

Limitations
Our research has a few limitations. First, the social media landscape is extremely dynamic. We captured the number of likes, shares and comments as well as the time to last activity of a Facebook post as a snapshot within this dynamic system. Hence the recorded numbers may have changed over time. While our longitudinal data analysis shows that for four out of five posts all activities are generated on the date of the posting itself, we cannot guarantee that a post won't gather any activity after months or years. This limitation, however, is bound to affect almost any social media based research conducted at a specific point in time and that uses these counts or similar ones as metrics. Second, our study focused only on U.S. Federal health agencies and thus our findings may not be generalizable to other organizations. While we find ample evidence where our findings mirror those of Facebook studies in other domains (as shown in the Discussion section), we would like to investigate the generalizability of our approach in future studies. Third, the intent behind a post is only known to a posting agency. It could be to encourage discussion or to disseminate information. Engagement may not always be the primary motivation of every post or every agency. Hence our results should not be interpreted as general performance metrics for these agencies. Finally, we studied a specific set of features and their correlation to the extent and duration of engagement. While we included many commonly used features as well as some novel ones in this study, there could be other features such as post frequency [58] or posting time [59] that also have correlation to engagement.

Conclusion
While some previous studies (referenced earlier) have focused on engagement of health departments at a local level, to the best of our knowledge, we present the first comprehensive analyses of engagement with U.S. Federal health agencies on Facebook. Examination of over 45,000 Facebook posts from 72 Facebook accounts belonging to 24 U.S. Federal health agencies reveals a wide range of activity across these accounts. We find that a very small fraction of the 45,000 posts get more than 100 likes, shares or comments, while one-fifth of posts see no activity at all. Content analyses of the posts show, for example, that the majority of posts contain links and are generally positive in sentiment. Statistical analyses show that the number of page likes of an account is associated with higher engagement. We also find that posts containing media or links and expressing positive sentiment correlate with higher or longer engagement. Depending on their goals and objectives, these findings may be used as recommendations by the U.S. Federal health agencies for communications on Facebook.