A New Source of Data for Public Health Surveillance: Facebook Likes

Background Investigation into personal health has become focused on conditions at an increasingly local level, while response rates have declined and complicated the process of collecting data at an individual level. Simultaneously, social media data have exploded in availability and have been shown to correlate with the prevalence of certain health conditions. Objective Facebook likes may be a source of digital data that can complement traditional public health surveillance systems and provide data at a local level. We explored the use of Facebook likes as potential predictors of health outcomes and their behavioral determinants. Methods We performed principal components and regression analyses to examine the predictive qualities of Facebook likes with regard to mortality, diseases, and lifestyle behaviors in 214 counties across the United States and 61 of 67 counties in Florida. These results were compared with those obtainable from a demographic model. Health data were obtained from both the 2010 and 2011 Behavioral Risk Factor Surveillance System (BRFSS) and mortality data were obtained from the National Vital Statistics System. Results Facebook likes added significant value in predicting most examined health outcomes and behaviors even when controlling for age, race, and socioeconomic status, with model fit improvements (adjusted R 2) of an average of 58% across models for 13 different health-related metrics over basic sociodemographic models. Small area data were not available in sufficient abundance to test the accuracy of the model in estimating health conditions in less populated markets, but initial analysis using data from Florida showed a strong model fit for obesity data (adjusted R 2=.77). Conclusions Facebook likes provide estimates for examined health outcomes and health behaviors that are comparable to those obtained from the BRFSS. Online sources may provide more reliable, timely, and cost-effective county-level data than that obtainable from traditional public health surveillance systems as well as serve as an adjunct to those systems.


Introduction
Big Data has the potential to revolutionize public health surveillance. The development of the Internet and the explosion of social media has provided many new opportunities for health surveillance. In 2013, Internet use among U.S. adults and adolescents aged 12 to 17 years has reached 80%-85% 1, 2 and 95%, 3 respectively, with the majority using wireless technologies to access the Internet, such as such as laptop computers, tablet computers, and cell phones or Smartphones. 4,5 Moreover, the use of the Internet for personal health and participatory health research has exploded, largely due to the availability of online resources and healthcare information technology applications. [6][7][8][9][10][11][12][13] These online developments, plus a demand for more timely, widely available, and cost effective data, has led to new ways epidemiological data are collected, such as digital-disease surveillance, opt-in Internet panels, and Internet surveys. [13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30] For example, over the past two decades, Internet technology has been used to identify disease outbreaks, track the spread of infectious disease, monitor self-care practices among those with chronic conditions, and to assess, respond, and evaluate natural and manmade disasters at a population-level. 11,13,16,17,19,20,22,27,[31][32][33] Use of these modern communication tools for public health surveillance has proven to be less costly and more timely than traditional population surveillance modes (e.g., mail surveys; telephone surveys; and face-to-face household surveys).
The Internet has spawned several sources of "Big Data," such as Facebook, 34 Twitter, 35 Instagram, 36 Tumblr, 37 Google, 38 and Amazon. 39 These online communication channels and market places provide a wealth of passively-collected data that may be mined for purposes of public health, such as socio-demographic characteristics, lifestyle behaviors, and social and cultural constructs. Public health researchers need cost effective and readily available sources of health data at the local level and the Big Data revolution may provide a partial answer. Social networking sites, such as Facebook, have expanded to include over half of the US population, 40 allowing for digital data on Facebook users from virtually every area of the country. Moreover, researchers have demonstrated that these digital data sources can be used to predict otherwise unavailable information, such as socio-demographic characteristics among anonymous Internet users. [41][42][43][44] For example, Goel et al. 42 found no difference by demographic characteristics in the usage of social media and e-mail. However, the frequency with which individuals accessed the Web for news, healthcare, and research was a predictor of gender, race/ethnicity, and educational attainment, potentially providing useful targeting information based on ethnicity and income. 42 Integrating these big data sources into the practice of public health surveillance is vital to move the field of epidemiology into the 21 st century as called for in the 2012 U.S. "Big Data" initiative. 24,45 Understanding how "Big Data" can be used to predict lifestyle behavior and health-related data is one step toward the use of these electronic data sources for epidemiologic needs. 42, 46 Facebook has been used by individuals and public health researchers for novel surveillance applications. 18,43,44,[47][48][49][50] Tong 44 reported on the use of Facebook as a surveillance tool among individuals involved in intimate partner break-ups. Chunara, et al. 18 used Facebook to examine the association between activity-related interests and sedentary-related interests and population obesity prevalence. These researchers found that populations with higher activity-related interests had a lower predicted prevalence of overweight and/or obesity. Facebook Likes are a means by which Facebook users can identify their own preferred Internet sites and interests. While Facebook Likes are not explicitly health-related, researchers have shown that when taken together, the 'network' of an individual's Likes are predictive of socio-demographics characteristics, health behaviors, obesity and health outcomes. 18,43,48,50 Timian et al. 50 examined whether Facebook Likes for a hospital could be used to quickly and inexpensively evaluate two quality measures (i.e., 30-day mortality rates and patient recommendations). Facebook Likes have also been shown to be predictors of a variety of user attributes, such as intelligence, happiness, race, religious and political views, sexual orientation, and a spectrum of personality traits. 43 For example, Likes correctly predict homosexuality and heterosexuality, African American vs. White, and Democrat vs. Republican at levels above 85%. Researchers have proposed that Facebook Likes be used as a new behavioral measure in a fashion similar to traditional questionnaires. 43 The power of Likes is that they represent behavior.
In this study, we focus on harnessing the predictive power of Facebook Likes for the purpose of enhancing population health surveillance. Towards this end, we view Facebook Likes as a class of "Big data" that may help us understand population health at a local level. To do this, the data we derive from Facebook Likes must be relevant to the health metrics we seek to address. Likes must predict life expectancy, the ultimate outcome of one's quality of health. Predicting intermediary causes of a shortened lifespan, such as obesity and diabetes, is also a worthwhile stepping stone to that goal. But in order to specifically target the risk factors associated with these conditions, Likes must also be able to predict the lifestyle behaviors that contribute to poor health outcomes. Given that risk factors and the associated disease are often clustered in populations geographically, 15, 51, 52 the ability to identify, monitor, and intervene at a population-level exists. If the Facebook characteristics of a region can predict physical activity, smoking, and self-care of chronic conditions (health maintenance), then a strong argument can be made in favor of the use of these data to target, monitor, and intervene on adverse lifestyle behaviors.
In this paper, we attempt to add to the scientific evidence-base on how "Big Data" might be used to complement traditional surveillance systems. We explored the use of Facebook Likes as potential predictors of health outcomes (e.g., morbidity, injury, disability, and mortality) and the behavioral determinants of poor health outcomes. Specifically, we hypothesized that: 1) Facebook Likes provide a means of characterizing communities; 2) Facebook Likes can be used as an indicator of chronic disease outcomes (obesity, diabetes, and heart disease); 3) Facebook Likes can be used as an indicator of mortality; and 4) Facebook Likes can be used as an indicator of adverse lifestyle behaviors that impact disease. If these hypotheses hold, then Facebook Likes can ultimately be used to enhance population health surveillance.

Data Sources
Data for the analysis were collected from a number of sources. Health outcome and risk behavior data were obtained from Behavioral Risk Factor Surveillance System (BRFSS). The BRFSS is an ongoing random-digit-dialed telephone survey operated by state health agencies with assistance from the Centers for Disease Control and Prevention (CDC). The surveillance system collects data on many of the behaviors and conditions that place adults (aged ≥18 years) at risk for chronic disease, disability, and death. The large sample size of the 2011 BRFSS (n = 506,467) facilitated the calculation of reliable estimates for 224 counties with 500 or more respondents. County-level risk factor data were obtained from the 2011 Selected Metropolitan/Micropolitan Area Risk Trends (SMART) subset of the BRFSS Health outcomes data (i.e., life expectancy, mortality, and low birth weight) were collected from the National Vital Statistics System (NVSS) which provides population data on deaths and births in the United States. 53 , Given the comprehensiveness of vital records, these data represent as complete a body of information on these statistics as can be achieved. As such, they may be considered the most reliable estimates employed by this study.
Facebook Likes data were collected using the Facebook Advertising API 54 in February 2013, which aggregates the number of users who express interest in certain categories of items by zip code. These zip code data were aggregated to the county-level to allow for direct comparisons to the health data 1 . The data reflect the cumulative total of Facebook users' likes at the time they are drawn. Out of 127 categories, 40 were selected for the model from the 'super-categories' of activities, interests, and retail and shopping 2 . Super categories were selected for their theorized relationship to health. For example, "Interests" contains the "Health & Wellbeing" category, to which the relationship of health is self-explanatory. The "Activities" category was chosen because it included "Outdoor Fitness & Activities," which seemed directly applicable to measures of physical activity, while "Retail & Shopping" was chosen due to its apparent linkage to socioeconomic status, a powerful driver of health outcomes. 55, 56 Other super-categories lacked these explicit links, though we acknowledge the possibility that potentially powerful indirect relationships may exist. Due to rounding performed automatically by the API that routinely led to overestimates, counties with fewer than 1,000 profiles overall were excluded from the analysis. Facebook Likes were scored as a percentage of completed profiles in an area. Finally, in order to reduce multicollinearity caused by variation in levels of Facebook usage by county, values were divided by the average percentage of Likes across all categories. The resulting variables can be characterized as a measure of popularity relative to that of other categories. 3 Population data, such as average income, median age, and sex ratio, were collected using the 2010 U.S. Census 57 and broken into county aggregates. Supporting county-level statistics unrelated to health were collected using "USA Counties Information" provided by the Census Bureau. 58 Overall, 214 counties in the continental United States contained sufficient data for all variables in the analysis, while analysis of mortality data was possible in 2,879 counties.

Variables of Interest
Several sociodemographic, health outcome and risk factor variables were selected for analysis. These include income, age, education, employment, obesity, diabetes, physical activity, and smoking, as well as other measures such as general health status. A comprehensive listing, as well as the data source and assessment of each variable of interest are available in online appendix (see Appendix 1).

Data Analysis
1 Zip codes crossing county borders were assigned to the county containing the largest geographic share 2 The exact method for determining these categories has not been reported by Facebook. 3 Though the individual variables resulting from this transformation were sometimes entirely uncorrelated with the originals, estimates using the raw and transformed variables correlated at R=0.9. Thus, we conclude that the results of the proceeding analyses are not an artifact of this transformation.
First, we used principal components analysis to reduce the 40 selected Facebook Likes categories to a more parsimonious set of factors that described the variation in these categories. We then used these factors in ordinary least squares (OLS) regression to determine whether Facebook Likes could predict life expectancy. Finally, we used these Facebook factors to predict other variables, beginning with the incidence of the diseases of diabetes and obesity and continuing on to predict a series of health-related behaviors.

Results
The first stage in the analysis was to establish that health outcomes could indeed be determined by Facebook Likes. Through principal components analysis, the forty categories were reduced to nine factors 4 (varimax rotation). Due to the complex structure of Likes contributing to these factors, we have resisted the urge to attempt to describe their meaning. Instead, each is numbered in accordance with the amount of variance it explains. The full matrix of loadings for the analysis can be found in Appendix 2.
In order to test our hypothesis that Facebook Likes can be used to predict mortality on their own, we used OLS regression. We used our nine Facebook factors to predict life expectancy, with no other controls included in this initial model. The results, as shown in the Facebook Only column of Table 1, were quite strong (model R 2 = .69). Despite this relationship, Facebook only has value insofar as it provides predictive value beyond that of reliable data that is already available through the census or other means. Regression results for an OLS model predicting life expectancy with demographic information on average age and socioeconomic status (as represented by average household income, unemployment rate and percentage with bachelor's degree) are shown in the socioeconomic status (SES) only column of Table 1. There is a very strong relationship to be found there as well, although it is less strong than for Facebook factors alone. Finally, the two groups of variables are combined in the last column of Table 1, indicating that while a great deal of the variance in life expectancy is shared by both the Facebook and SES variables, the addition of Facebook improves the model fit above and beyond readily available socioeconomic measures. The resulting R 2 =0.80 also indicates that a considerable amount of the variation in county-level life expectancy can be explained by SES factors and Facebook likes. Table 2 summarizes regressions run across an array of health variables and indicates the percent improvement in variance explained by the inclusion of Facebook Likes when added to SES compared to the SES alone. There are two conclusions we can draw from this model. First, Facebook Likes do prove to be an effective identifier of all tested disease outcomes. Second, there is a persistent benefit of Facebook Likes above and beyond that contributed by SES, though its magnitude varies widely.
Our next hypothesis stated that Facebook Likes, as a measure of personality or behavior, should be able to determine the behaviors that drive health outcomes. The results in Table 2 clearly show that the Facebook Likes factors had a sizeable impact in the predictive models of all tested health-related behaviors, and in some cases such as health insurance and exercise, the total model fit was quite strong.

Predicting Health Conditions
We have established the need for better estimates of health in small communities where survey data is insufficient. We believe a statistical model can be used for the purpose that incorporates Facebook Likes, but it is not necessary that Facebook Likes be the dominant force in the model. In our view, any variable that is available and reliable at a county level should be included in predictive models, regardless of the direction of its relationship with the measure in question. A number of the health measures used as dependent variables previously are extremely reliable non-survey statistics, and can incrementally increase model fit beyond what Facebook Likes and SES can do on their own.
Attempting to apply predictions from the 2011 SMART data creates a problem. Though predictions correlate well with actual levels in non-SMART data, mean levels are consistently upwardly biased. We hypothesize that the selection method that leads counties to be weighted according to the SMART program creates a non-representative sample with better levels of general health than we see in the United States in general, particularly in areas that are more rural. As an alternative without such problematic selection issues, we have limited our predictive model to 2010 Florida data. Florida collects over 500 interviews in 61 of its 67 counties every three years, leading to a dataset that has neither sample size shortages nor selection biases.
Using data exclusively from one state creates its own problems for a predictive model. Though the integrity of the data is very good, there is no easy way to correct for the various cultural differences between Florida and other states. Attempting to apply Florida-based models to the full set of SMART counties results in only fair level of correlation (R =0.63). Though it indicates that relationships exist, this is not a sufficient level of accuracy upon which to base policy decisions. Instead, we have limited our analysis to Florida in order to demonstrate the level of accuracy we feel can be achieved at a national level once a somewhat more representative selection of county-level data is made available.
The results of a predictive model are shown in Table 3. These are the averages of a 10-fold cross validation, where ~6 counties were randomly excluded and predicted with the remaining counties in each iteration. The inclusion of vitality statistics reduces but does not eliminate the contribution of Facebook Likes to the model. Although we would expect demographics and vitality statistics to be very effective at predicting "healthy" versus "unhealthy" communities, we believe that the additional data provided by Facebook Likes should help to clarify the finer distinctions between communities with similar general levels of health. 5 Figures 1 & 2 show a graphical comparison of estimates versus source data in Florida, where nearly all counties were sufficiently sampled for reliable estimates. These maps are dynamically shaded from light to dark in accordance with the level of obesity. As should be apparent visually, the fit is generally good -90% of errors in the model fall inside of ±2.1% (0.4 standard deviations) from CDC estimated values. The same process is repeated for general health in Figure 2.

Discussion
When we first undertook this research plan, it was our expectation that the larger part of the measurement error that would impact our results would come through the imprecise categorization and geographic aggregation of the Facebook data. But while there are some exceptions, the consistency and strength of fit we have found seem manifest. Our models do extremely well in predicting levels of health variables across counties where data are plentiful and often diverge from BRFSS estimates where they are not. This suggests the possibility that data imputed from Facebook and vital statistics may provide a more accurate picture in small counties than attempting to aggregate improperly balanced data across several years.
Thus, we argue that Facebook can serve an intermediary role in augmenting sparse data at a community level. We have shown that it can do so already, but additional health survey data, especially in less extensively measured regions (e.g., rural), could only help. Ensuring that communities of all types are represented in sufficient number when estimating the model is a necessary step in avoiding the risk of systematic error in its predictions.
The ultimate goal of our analysis of Facebook Likes is to establish the potential contribution of "Big Data" to research that directly impacts government spending and public policy, and, most importantly, contributes to improved population health. At a fraction of the cost of traditional research, data that might seem on its face to have little to do with health can predict life expectancy and epidemic-level health problems such as diabetes and obesity. With the need to augment en predictions of two diseases dropped from 0.94 to 0.85 with the addition of Facebook likes (Z = 2.6, p<.05), which supports this theory.
traditional public health surveillance systems with readily available, cost effective, and geographically-relevant health data, the use of "big epidemiologic data" comes at just the right time.

Limitations
The nature of the Facebook data source prevents it from being a useful tool in several situations. In the case of very small counties (about 9%) and in any smaller geographic areas, rounding error becomes so great that estimates cannot be reliably used, even though they may be provided by Facebook. Facebook profiles are untested as a tool for tracking the prevalence of infectious diseases. They are better suited to predicting endemic and ongoing conditions that are unlikely to fluctuate over the course of short periods of time.

Conclusion
The relationships examined here demonstrate convincingly that social media can be used as an indicator of local conditions, even those that have little relationship to the activity that takes place on Facebook. As we predicted, significant relationships that extend beyond the predictive power of local demographics exist between an area's aggregate Facebook behavior and life expectancy, the incidence of diseases, and of health-related behaviors that very well may lead to those diseases.
We have also indicated the severe shortage of health data that is available and the great majority of American counties. While even Facebook data may not reach into every corner of the United States, it seems an effective enough tool to augment the existing county-level data in the majority of counties. With demand for local health data growing, such tools seem far more cost-effective than an increase in survey surveillance, regardless of the mode through which it might be conducted.
Whether this data ultimately comes from Facebook or not is of little importance. The online landscape may change, and it may provide a different source of data that proves more viable in the future. So long as the source reflects people's activities in daily life, the same relationships should hold. Even if Facebook does prove to endure as a social institution, however, there is still room for a great deal of improvement on the models presented here. With cooperation from the social media outlets themselves, we may be able to obtain better estimates in categories that align better with our needs. In the end, our data may not suffer as a result of the rising costs of research. Instead, exploring newly opened avenues of data collection online could lead to more reliable, timely, and cost effective data than ever.

26.
Liu     1. changing all beta (b) symbols to b (for unstandardized regression parameter estimates) or B (for standardized regression parameter estimates); and 2. replacing all text or symbolic references to b in the manuscript and tables to language referencing b (parameter estimates) or B (standardized parameter estimates), as appropriate.
Beta (b), and other Greek symbols, should only be used in the text when describing the equations or parameters being estimated, never in reference to the results based on sample data.>>>