Measuring online social bubbles

Social media have become a prevalent channel to access information, spread ideas, and influence opinions. However, it has been suggested that social and algorithmic filtering may cause exposure to less diverse points of view. Here we quantitatively measure this kind of social bias at the collective level by mining a massive datasets of web clicks. Our analysis shows that collectively, people access information from a significantly narrower spectrum of sources through social media and email, compared to a search baseline. The significance of this finding for individual exposure is revealed by investigating the relationship between the diversity of information sources experienced by users at both the collective and individual levels in two datasets where individual users can be analyzed — Twitter posts and search logs. There is a strong correlation between collective and individual diversity, supporting the notion that when we use social media we find ourselves inside “social bubbles.” Our results could lead to a deeper understanding of how technology biases our exposure to new information.


10
The rapid adoption of the Web as a source of knowledge and a social space has made it ever more difficult 11 for people to manage the constant stream of news and information arriving on their screens. Content 12 providers and users have responded to this problem by adopting a wide range of tools and behaviors 13 that filter and/or rank items in the information stream. One important result of this process has been 14 higher personalization (Mobasher et al., 2000) -people see more content tailored specifically to them 15 based on their past behaviors or social networks. Recommendation systems (Ricci et al., 2011), for 16 example, suggest items in which one is more likely to be interested based on previous purchases, past 17 actions of similar users, or other criteria based on one's past behavior and friends. Search engines provide 18 personalized results as well, based on browsing histories and social connections (Google, 2009b,a). 19 It is common for users themselves to adopt filters in their online behavior, whether they do this 20 consciously or not. For example, on social platforms such as Facebook, a large portion of users are 21 exposed to news shared by their friends (Bakshy et al., 2012;Matsa and Mitchell, 2014). Because 22 of the limited time and attention people possess and the large popularity of online social networks, 23 the discovery of information is being transformed from an individual to a social endeavor. While the 24 tendency to selectively expose ourselves to the opinion of like-minded people was present in the pre-digital 25 world (Hart et al., 2009;Kastenmüller et al., 2010), the ease with which we can find, follow, and focus on 26 such people and exclude others in the online world may enhance this tendency. Regardless of whether 27 biases in information exposure are stronger today versus in the pre-digital era, the traces of online behavior 28 provide a valuable opportunity to quantify such biases. 29 While useful, personalization filters -whether they are algorithmic, social, a combination of both, and 30 whether they are used with or without user awareness -have biases that affect our access to information 31 in important ways. In one line of reasoning, Sunstein (2002, 2009) and Pariser (2011 have argued that 32 exposure to some ideologically challenging news. But how does this compare to other ways of discovering 48 information? 49 In a different Facebook study, users, especially partisan ones, were more likely to share articles with 50 which they agree (An et al., 2014). Similar patterns can be seen on other platforms. On blogs, commenters 51 are several times more likely to agree with each other than not (Gilbert et al., 2009), and liberals and 52 conservatives primarily link within their own communities (Adamic and Glance, 2005). On Twitter, 53 political polarization is even more evident (Conover et al., 2011(Conover et al., , 2012. When browsing news, people are 54 more likely to be exposed to like-minded opinion pieces (Flaxman et al., 2013), and to stay connected 55 and share articles with others having similar interests and values (Grevet et al., 2014). In the context of 56 controversial events that are highly polarizing, web sources tend to be partial and unbalanced, and only a 57 small fraction of online readers visit more than two different sources (Koutra et al., 2014). To respond to 58 such narrowing of online horizons, researchers have started to concentrate on more engaging presentation 2012) could lead to highly diverse exposure. In this study we look at the diversity of information exposure 73 more broadly. Our goal is to examine biases inherent in different types of online activity: information 74 search, one-to-one communication from email exchanges, and many-to-many communication captured 75 from social media streams. How large is the diversity of information sources to which we are exposed 76 through interpersonal communication channels, such as social media and email, compared to a baseline 77 of information seeking? We answer this question at the collective level by analyzing a massive dataset of 78 Web clicks. In addition, we investigate how this analysis relates to the diversity of information accessed by 79 individual users through an analysis of two additional datasets -Twitter posts and search logs. Figure 1 80 illustrates our empirical analysis: we measure how the visits by people that are engaged in different types 81 of online activities are distributed across a broad set of websites ( Fig. 1(a, c)) or concentrated within a few 82 ( Fig. 1(b, d)). 83 We carry out our analyses on all web targets as well as on targets restricted to news sites. The latter 84 are of particular relevance when examining bias in public discourse. We do not make any additional 85 distinctions regarding the type of content people visit, such as opinion pieces versus reporting, or 86 differing ideological slant. We do not consider beliefs, past behaviors, or specific interests of information 87 consumers. These deliberate choices are designed to yield quantitative measures of bias that do not 88 depend on subjective assessments. Our results are therefore general and applicable to different topics, 89 geographical regions, interests, and media sources. Manuscript to be reviewed

91
To study the diversity of information exposure we use a massive collection of Web clicks as our primary 92 dataset, and two supplementary datasets of link shares on Twitter and AOL search clicks. Code is available 93 to reproduce our entire data processing procedure, described next. 1 94

95
The click data we use comes from a publicly available dataset collected at the edge of the Indiana 96 University network (Meiss et al., 2008), which allows us to obtain a trace of web requests. 2 Each 97 request record has a target page, a referrer (source) page, and a timestamp. Privacy concerns prevent 98 any identifying information about individual users or clients from being collected, making it impossible 99 to associate any request with any particular computer or person. We only use the traffic coming from  Since in the click data it is not possible to distinguish with full certainty requests resulting from human 105 clicks and requests auto-generated by the pages, we filter out any requests for files other than web pages, 106 such as JavaScript, images, video, and so on based on the file extension. This results in the shrinking of 107 the dataset by a factor of 5. Since the file extension is not always present in the URL, this method is not 108 guaranteed to remove all non-human click data. However, it provides a good first approximation of human 109 clicks, and we further address this issue with additional data filtering described later in this section.

110
Once non-human traffic is removed from the dataset based on file extensions, the path in the URL is 111 discarded and the resulting clicks are only identified by the referrer and target domains. We take referrer 112 and target domains as proxies for websites. This level of granularity allows us to address the research 113 question while avoiding the problem of the sparseness of the traffic at the page level -users typically 114 visit most pages once.

115
Even if we identify a domain with a website, not all sites are equal -wikipedia.org has more diverse 116 content than kinseyinstitute.org. Furthermore, one needs to decide whether to represent domains at the 117 second or higher level. In many instances, higher-level domains reflect infrastructural or organizational 118 differences that are not important to measure diversity (e.g., m.facebook.com vs. l.facebook.com). In 119 others cases, using second-level domains may miss important content differences (e.g., sports.yahoo.com 120 vs. finance.yahoo.com). To address this issue, we performed our analysis using both second-and third-121 level domains. As discussed below, these analyses yield very similar results. In the remainder of the paper 122 we consider second-level domains, but account for special country cases; for example, domains such as 123 theguardian.co.uk are considered as separate websites. Once we have a definition of a website, we use the 124 number of clicks in the data to compute a diversity measure as discussed below.

125
After extracting the domain at the end points of each click, we examined the most popular referrers 126 in the dataset and manually assigned them to the search, social media, and email categories. We then 127 filtered the click data to only include referrers from these categories. In addition, we excluded targets 128 from these same categories, because we are specifically interested in the acquisition of new information.

129
For example, activities such as refining searches on Google and socializing on Facebook are unlikely to 130 represent such discovery.

131
Subsequent data filtering was performed to exclude other likely non-human traffic, such as traffic 132 to ad and image servers, traffic resulting from game playing or using browser applications such as RSS 133 readers, and traffic to URL shortening services. Since it is impossible to exclude all non-human traffic, 134 we focused on filtering out those target domains that constitute a significant portion of overall traffic. We 135 used an iterative procedure in which we examined the top 100 targets for each category and manually 136 identified traffic that is non-human. This procedure was repeated until the list of top 100 domains in each 137 category was composed of legitimate targets. Table 1 lists the top six referrers in each category.

138
The filtered dataset includes over 106 million records, roughly representing someone clicking on 139 a link from a search engine, email client, or social media site, and going to one of almost 7.18 million 140 targets outside these three categories.  Table 2 and crawling their subcategories recursively. Following 147 the crawl, the list of news targets was filtered as follows. 148 1. Each URL was transformed to a canonical form and only the domain name was kept. 149 2. Domains falling in one of the predefined categories -search, social media and email -were 150 removed. URLs from popular blogging platforms, Wiki platforms, and news aggregators were also 151 removed (see Table 3). 3. An iterative filtering procedure was applied to remove targets of non-human traffic, such as from 153 RSS clients, advertising, and content servers.

154
The above procedure resulted in nearly 3,500 news sites. We used this list to filter the targets in the 155 click collection, yielding the news dataset used in our analysis.

157
To quantify the diversity of an information source s we look at all targets reached from websites in category s and compute the Shannon entropy where T (s) is the set of target websites reached from referrer sites in s, and p t is the fraction of clicks 158 requesting pages in website t.

159
Entropy (Shannon, 1948) is a measure of uncertainty over a set of possible outcomes. It is maximized 160 when all outcomes are equally likely, and minimized when only a single outcome is likely (indicating 161 full certainty). Used over the set of domain probabilities as we have done above, the entropy gives the 162 uncertainty in the websites that will be accessed given a category of referrers. By measuring diversity over 163 a set of domains, our approach captures the intuition that visiting 10 pages (for example, news articles) 164 from 10 different sites implies a more diverse exposure than visiting 10 pages from the same site. The 165 implications of this assumption are further debated in the Discussion section. We considered an alternative 166 method of measuring diversity based on the Gini coefficient (Sen, 1973), and found the results discussed 167 below to be robust with respect to the choice of diversity measure.

168
The traffic volume in our click dataset varies significantly over time and across the three categories, as 169 shown in Fig. 2(a). A similar pattern emerges for the dataset of news targets (see inset). These vast volume 170 differences make it necessary to understand the relationship between traffic volume and the diversity of 171 an information source. To do so, we measure the diversity over samples of increasing numbers of clicks.

172
From Fig. 2(b) we see that the diversity measurements indeed depend on volume, especially for small 173 numbers of clicks; as the volume increases, the diversity tends to plateau. However, the dependence of 174 diversity on number of clicks is different for each category of traffic. Therefore, instead of normalizing 175 each category of traffic by a separate normalization curve, we account for the dependence by using the 176 same number of clicks. This makes our approach easier to generalize to more categories and datasets, 177 since it does not require the fitting of a separate curve to each case. We compute the diversity over traffic 178 samples of the same size (50,000 clicks per month for all targets, and 1,000 clicks per month for news 179 targets) for each category in our analysis.

181
In the second part of our analysis we make use of two auxiliary datasets to disentangle the relationship 182 between collective diversity -as seen in the targets accessed by a community of users -and individual 183 diversity -as seen in the targets accessed by a single user. From both datasets, we are able to recover a 184 referring website, a target website, and an associated user identifier. period of three and a half years (see Fig. 3(b)). This empirical evidence suggests that social media expose 198 the community to a narrower range of information sources, compared to a baseline of information seeking 199 activities. Fig. 4 illustrates the top targets of traffic from search and social media on a typical week.

200
The diversity of targets reached via email also seems to be higher than that of social media, however 201 the difference is smaller and its statistical significance is weaker due to the larger noise in the data. The 202 difference in entropy is larger and more significant for traffic from email sources to news targets.

203
While we wish to ultimately understand the biases experienced by individuals, the diversity mea-204 surements based on anonymous traffic data do not distinguish between users, and therefore they reveal 205 a collective social bubble, as illustrated in Fig. 1(c,d). It is at first sight unclear whether the collective 206 bubble implies individual bubbles, or tells us anything at all about individual exposure. The number of 207 clicks per user, or even the number of users could vary to produce different individual diversity patterns 208 resulting in the same collective diversity. In theory, high collective diversity could be consistent with low 209 individual diversity, and vice versa. Therefore we must investigate the relationship between collective 210 and individual diversity measurements. To this end, we analyze the two auxiliary datasets where user 211 information is preserved (see Methods). For both datasets, we measure the diversity for individual users, 212 and collectively disregarding user labels. The strong correlation between collective diversity and average 213 user diversity (Fig. 5) suggests that our results relate not only to a collective bubble, but also to individual 214 social bubbles, as illustrated in Fig. 1(a,b).

216
We have presented evidence that the diversity of information reached through social media is significantly 217 lower than through a search baseline. As the social media role in supporting information diffusion 218 increases, there is also an increased danger of reinforcing our collective filter bubble. A similar picture 219 emerges when we specifically look at news traffic -the diversity of social media communication is 220 significantly lower than that of search and inter-personal communication. Given the importance of news 221 consumption to civic discourse, this finding is especially relevant to the filter bubble hypothesis.

222
Our results suggest that social bubbles exist at the individual level as well, although our evidence is 223 based on the relationship between collective and individual diversity and therefore indirect. Analysis of 224 traffic data with (anonymized) user identifiers will be necessary to confirm this conclusively. In addition,

236
These results also come with the caveat that in our analysis we do not try to quantify the diversity 237 inside each domain. We are assuming that the diversity of content is higher across different domains than 238 across the pages within a single domain. The problem of quantifying the diversity of the content inside a 239 single domain is a significant research problem in its own right, and one that would greatly benefit this We are grateful to Mark Meiss for collecting the web traffic dataset used in this paper.

260
Data and materials availability: The web traffic dataset used in this paper is available to researchers.

262
Code to reproduce our analysis is at github.com/dimitargnikolov/web-traffic-iu.       Manuscript to be reviewed Computer Science percentiles, respectively. The horizontal line and the hollow point inside each box mark the median and mean entropy, respectively. The filled points are outliers. The uncertainty was computed over data points representing the clicks that occurred over one calendar month. (b) Entropy as a function of time. We smooth the data by applying a running average over a three-month sliding window that moves in increments of one month. Error bars are negligibly small and thus omitted. The insets plot the entropy for news traffic (same scale if not shown). This illustration refers to a typical week, with entropy close to (within one standard deviation from) average. The area of each rectangle is proportional to the number of clicks to that target. While these websites reflect the sample of users from Indiana University as well as the time when the data was collected, these contexts apply to both categories of traffic. Therefore the higher concentration of social media traffic on a small number of targets is meaningful.