Tweet-SCAN: An event discovery technique for geo-located tweets☆
Introduction
Twitter1 is one of the most popular Social Networks and microblogging sites offering location-based services to identify the geographical location of social content, e.g. tweets. A tweet is a 140-character-long status message that responds to the question What’s happening? This update message is associated with a user, a posting time and might contain some sort of geographical localization, among other metadata. In fact, [1] showed that one-in-five tweets is geo-located or its location can be inferred from user metadata. Given that about 500 millions tweets are generated per day2, understanding some of the physical world behaviors from geo-located tweets seems now feasible.
There are numerous research papers supporting the use of Twitter in a broad range of fields from politics – Borge-Holthoefer et al. [2] studied the dynamics of the Spanish political movement called 15M, epidemics – Kim et al. [3] proposed to improve forecasting of human influenza infection, to seismology – Sakaki et al. [4] presented a detection and monitoring system to track earthquakes. As a matter of fact, we can view Twitter as a rich source of data generated by millions of distributed users acting as sensors that report what is happening right now worldwide.
An event happening in a specific location (such as a demonstration, a music concert, an accident or a street fight) will be likely reported on Twitter by means of geo-located tweets posted by users close to the event location. Nonetheless, these events are usually masked by tweets which do not contribute to any particular pattern and which can be considered noise for the event detection task. Therefore, the problem of event discovery in Location-based Social Networks (LBSNs), and specifically in Twitter, consists in uncovering and determining these events while excluding the undesired observations [5].
In fact, we propose to frame the event discovery problem within a clustering type of problem in which clusters are dense groups of tweets posted by different users that talk about an event happening nearby their location. Tweets not related to the events are unwanted and we aim to group them together into a noise cluster. In a nutshell, the event discovery problem described here can be seen as an unsupervised machine learning problem which aims to group together similar event-related tweets.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) proposed by Ester et al. [6], is a density-based clustering algorithm in which clusters are arbitrary shaped regions with higher density of spatial points. The algorithm defines three types of points depending whether they belong to a dense, a sparse or an intermediate region: core points, noise points and border points, respectively. GDBSCAN (Generalized DBSCAN) proposed by Sander et al. [7], generalizes DBSCAN to use spatially extended objects instead of simply spatial points and more advanced predicates beyond Euclidean proximity. This algorithm is a convenient framework to define a technique capable of uncovering clusters of event-related tweets from the rest.
Therefore, we present Tweet-SCAN, a novel event discovery technique which adapts the DBSCAN algorithm – or particularize GDBSCAN – to cope with Twitter objects, considering its spatial, temporal, textual and user dimensions. Tweet-SCAN considers spatially extended objects from DBSCAN and it implements independent neighborhood identification in each separate dimension to group close neighbors into a dense cluster which is finally associated to an event.
The textual part of a tweet is modeled through a probabilistic topic model [8], named Hierarchical Dirichlet Process (HDP) [9], which can be seen as the nonparametric extension of Latent Dirichlet Allocation (LDA) [10]. This nonparametric topic model represents the textual dimension of each tweet as a Categorical probability distribution over topics. To assess similarity of tweet messages, we propose to use Jensen–Shannon distance [11], a proper and natural metric for probability distributions which outperforms other measures in terms of semantic similarity [12] and categorization accuracy [13]3.
The algorithm capabilities to uncover events are assessed in a real data set composed of geo-located tweets from Barcelona during its local festivities in September of 2014 and 2015, called “La Mercè”. This data set has been crawled through the Twitter Streaming API via a distributed system called Hermes [14]. It has been shown that the Twitter Streaming API returns all geo-located tweets within the bounding box, instead of a sample [15]. Furthermore, some tweets have been manually tagged and several events have been assigned to them based on our expert knowledge about the festivities. This tagging process allows to quantitatively evaluate the algorithm and to interpret the algorithm parameters.
The rest of the paper is organized as follows: first, we present the digested background for this study in Section 2. Next, Tweet-SCAN technique is described in detail in Section 3. Section 4 contains a descriptive analysis of both data sets from “La Mercè” festivities. Then, we assess Tweet-SCAN discovering capabilities by studying different parameter settings, see Section 5. To conclude, we present the main conclusions for this work and identify future challenges in Section 6.
Section snippets
Background
There is a broad literature on event discovery in Location-based Social Networks which can be further classified based on the features used by the algorithm. Following the well structured background study presented by Yuan et al. [16] with regards to geographical topic modeling, we attempt to classify the event discovery literature in a very similar manner. As stated above, we group the existing proposals in this area of research based on the combination of features used: time, content,
Tweet-SCAN technique
Tweet-SCAN is an event discovery technique for geo-located tweets which is based on the DBSCAN clustering algorithm presented by Sander et al. [7]. In the following section, we describe Tweet-SCAN by first introducing the main elements of DBSCAN [6] and then generalize them into GDBSCAN to make clustering of tweets possible. The proper definitions of GDBSCAN predicates will enable that the resulting Tweet-SCAN clusters matches real world events. Therefore, we define proper Tweet-SCAN predicates
“La Mercè” data sets
In order to evaluate Tweet-SCAN, we have collected data through the Twitter Streaming API6 via Hermes [14]. In particular, we have established a long standing connection to Twitter public stream which filters all tweets geo-located within the bounding box of Barcelona city7. This long standing connection was established during the local festivities of “La Mercè”,
Tweet-SCAN assessment
In this section, we assess Tweet-SCAN for the task of event discovery in “La Mercè”. Particularly, we aim to find evidence that proves the benefits of considering the textual component. We also seek to give insights on the role of each parameter, as well as to provide a scheme for determining them.
As a result, we first introduce some clustering measures to evaluate Tweet-SCAN performance against the tagged data set. Then, we present a heuristic to determine its spatio-temporal parameters (ϵ1, ϵ2
Conclusions and future work
To our best knowledge, Tweet-SCAN provides a first step in using spatial, temporal, textual and user features for the purpose of uncovering real world events unsupervisedly from Location-based Social Networks like Twitter. The formulation of Tweet-SCAN within the framework of DBSCAN enables to understand events as density-connected set of tweets, as well as to use similar heuristics for determining some of the parameters.
The results of Tweet-SCAN points out to the benefits of using text, when
Acknowledgments
This work is partially supported by Obra Social “la Caixa”, by the Spanish Ministry of Science and Innovation under contract TIN2015-65316, by BSC-CNS Severo Ochoa programs (SEV2015-0493 and SEV-2011-00067), by the SGR program (2014-SGR-1051) of the Catalan Government and by COR (TIN2012-38876-C02-01) project. We would like to also acknowledge reviewers for their constructive feedback.
References (28)
- et al.
A study of retrospective and on-line event detection
Proceedings of the Twenty-first Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
(1998) Social media location intelligence: The next privacy battle-an arcGIS add-in and analysis of geospatial data collected from twitter.com
Int. J. Geoinform.
(2013)- et al.
Structural and dynamical patterns on online social networks: the spanish may 15th movement as a case study
PloS One
(2011) - et al.
Use of Hangeul twitter to track and predict human influenza infection
PloS One
(2013) - et al.
Earthquake shakes twitter users: real-time event detection by social sensors
Proceedings of the Nineteenth International Conference on World Wide Web
(2010) Tutorial on location-based social networks
Proceedings of the Twenty-first International Conference on World Wide Web (WWW)
(2012)- et al.
A density-based algorithm for discovering clusters in large spatial databases with noise
Proceedings of the 1996 International Conference on Knowledge Discovery and Data Mining (Kdd)
(1996) - et al.
Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications
Data Min. Knowl. Discov.
(1998) Probabilistic topic models
Commun. ACM
(2012)- et al.
Hierarchical Dirichlet processes
J. Am. Stat. Assoc.
(2006)
Latent Dirichlet allocation
J. Mach. Learn. Res.
A new metric for probability distributions
IEEE Trans. Inf. theory
Comparing measures of semantic similarity
Research on the categorization accuracy of different similarity measures on chinese texts
Proceedings of the 2011 International Conference on Business Management and Electronic Information (BMEI)
Cited by (0)
- ☆
This paper has been recommended for acceptance by Eva Armengol.