OSoMe: The IUNI observatory on social media

The study of social phenomena is on big from online social networks. Broad access to software development researchers Here we present the IUNI Observatory on Social Media , an open analytics platform designed to facilitate computational social science. The system leverages a historical, ongoing collection of over 70 billion public messages from Twitter. We illustrate a number of interactive open-source tools to retrieve, visualize, and analyze derived data from this collection. The Observatory, now available at osome.iuni.iu.edu , is the result of a large, six-year collaborative effort coordinated by the Indiana University Network Science Institute. Abstract The study of social phenomena is becoming increasingly reliant on big data from on- 2 line social networks. Broad access to social media data, however, requires software 3 development skills that not all researchers possess. In this note we present the IUNI 4 Observatory on Social Media, an open analytics platform designed to facilitate compu- 5 tational social science. The system leverages a historical corpus of over 70 billion public 6 messages from Twitter for which data collection is ongoing. We illustrate a number of 7 interactive publicly available tools to retrieve, visualize, and analyze derived data from 8 this collection. Preliminary evaluation shows that system performance scales well with 9 number and size of queries. The Observatory, now available at osome.iuni.iu.edu , is 10 the result of a large, six-year collaborative effort coordinated by the Indiana University 11 Network Science Institute.

These limitations point to a critical need for opening social media platforms to re-71 searchers in ways that are both respectful of user privacy requirements and aware of 72 the needs of SBE researchers. In the absence of such systems, SBE researchers will have 73 to increasingly rely on closed or opaque data sources, making it more difficult to re-74 produce and replicate findings -a practice of increasing concern given recent findings 75 about replicability in the SBE sciences (Open Science Collaboration, 2015).

76
Our long-term goal is to enable SBE researchers and the general public to openly 77 access relevant social media data. As a concrete milestone of our project, here we present   An important caveat about the use of these data for research is that possible sam-106 pling biases are unknown. When Twitter first made this stream available to the research 107 community, it indicated that the stream contains a random 10% sample of all public 108 tweets. However, no further information about the sampling method was disclosed.  The high-speed stream from which the data originates has a rate that ranges in the 122 order of 10 6 − 10 8 tweets/day. Figure 1 illustrates the growth of the Twitter collection

125
Performing analytics at this scale presents specific challenges. The most obvious has to 126 do with the design of a suitable architecture for processing such a large volume of data.

127
This requires a scalable storage substrate and efficient query mechanisms.

128
The core of our system is based on a distributed storage cluster composed of 10 129 compute nodes, each with 12 × 3TB disk drives, 2 × 146GB RAID-1 drives for the  The architecture is illustrated in Figure 2. The data collection system receives data

182
The software is freely available (Grabowicz and Aiello, 2013). Finally, the Observatory 183 provides access to raw data via a programmatic interface (API). visualization parameters (Rafaeli, 1988). In the following, we give a brief overview of 191 the available tools.

192
It is important to note that, in compliance with the Twitter terms of service (Twitter,  OSoMe provides two tools that allow users to explore diffusion and and co-occurrence

248
The tool allows access to Twitter content, such as hashtags and user screen names.

249
This content is available both through the interactive interface itself, and as a download-

Manuscript to be reviewed
Computer Science percentage of tweets that contain exact geolocation coordinates. Furthermore, as already 284 discussed, this percentage has changed over time.

285
API 286 We expect that the majority of users of the Observatory will interact with its data pri-287 marily through the tools described above. However, since more advanced data needs 288 are to be expected, we also provide a way to export the data for those who wish to create with no particular deterioration for increasing loads (Figure 8).

307
To evaluate the scalability of the Hadoop-based analytics tools with increasing data 308 size, we plot in Figure 9 the run time of queries submitted by users through OSoMe 309 interactive tools, as a function of the number of tweets matching the query parameters. 310 We observe a sublinear growth, suggesting that the system scales well with job size.    foster data-intensive research in the social, behavioral, and economic sciences. 331 We welcome feedback from researchers and other end users about usability and 332 usefulness of the tools presented here. In the future, we plan to carry out user studies 333 and tutorial workshops to gain feedback on effectiveness of the user interfaces, efficiency 334 of the tools, and desirable extensions. 335 We encourage the research community to create new social media analytic tools by 336 building upon our system. As an illustration, we created a mashup of the OSoMe API