Elsevier

Pattern Recognition Letters

Volume 93, 1 July 2017, Pages 58-68
Pattern Recognition Letters

Tweet-SCAN: An event discovery technique for geo-located tweets

https://doi.org/10.1016/j.patrec.2016.08.010Get rights and content

Highlights

  • We motivate event discovery from geo-located tweets in Twitter.

  • We propose to tackle this problem through density-based clustering with noise.

  • We formulate Tweet-SCAN within GDBSCAN to cope with Twitter objects.

  • We demonstrate how Tweet-SCAN is able to discover real-world events.

  • We show the benefits of considering the textual component of a geo-located tweet.

Abstract

Twitter has become one of the most popular Location-based Social Networks (LBSNs) that bridges physical and virtual worlds. Tweets, 140-character-long messages, are aimed to give answer to the What’s happening? question. Occurrences and events in the real life (such as political protests, music concerts, natural disasters or terrorist acts) are usually reported through geo-located tweets by users on site. Uncovering event-related tweets from the rest is a challenging problem that necessarily requires exploiting different tweet features. With that in mind, we propose Tweet-SCAN, a novel event discovery technique based on the popular density-based clustering algorithm called DBSCAN. Tweet-SCAN takes into account four main features from a tweet, namely content, time, location and user to group together event-related tweets. The proposed technique models textual content through a probabilistic topic model called Hierarchical Dirichlet Process and introduces Jensen–Shannon distance for the task of neighborhood identification in the textual dimension. As a matter of fact, we show Tweet-SCAN performance in two real data sets of geo-located tweets posted during Barcelona local festivities in 2014 and 2015, for which some of the events were identified by domain experts beforehand. Through these tagged data sets, we are able to assess Tweet-SCAN capabilities to discover events, justify using a textual component and highlight the effects of several parameters.

Introduction

Twitter1 is one of the most popular Social Networks and microblogging sites offering location-based services to identify the geographical location of social content, e.g. tweets. A tweet is a 140-character-long status message that responds to the question What’s happening? This update message is associated with a user, a posting time and might contain some sort of geographical localization, among other metadata. In fact, [1] showed that one-in-five tweets is geo-located or its location can be inferred from user metadata. Given that about 500 millions tweets are generated per day2, understanding some of the physical world behaviors from geo-located tweets seems now feasible.

There are numerous research papers supporting the use of Twitter in a broad range of fields from politics – Borge-Holthoefer et al. [2] studied the dynamics of the Spanish political movement called 15M, epidemics – Kim et al. [3] proposed to improve forecasting of human influenza infection, to seismology – Sakaki et al. [4] presented a detection and monitoring system to track earthquakes. As a matter of fact, we can view Twitter as a rich source of data generated by millions of distributed users acting as sensors that report what is happening right now worldwide.

An event happening in a specific location (such as a demonstration, a music concert, an accident or a street fight) will be likely reported on Twitter by means of geo-located tweets posted by users close to the event location. Nonetheless, these events are usually masked by tweets which do not contribute to any particular pattern and which can be considered noise for the event detection task. Therefore, the problem of event discovery in Location-based Social Networks (LBSNs), and specifically in Twitter, consists in uncovering and determining these events while excluding the undesired observations [5].

In fact, we propose to frame the event discovery problem within a clustering type of problem in which clusters are dense groups of tweets posted by different users that talk about an event happening nearby their location. Tweets not related to the events are unwanted and we aim to group them together into a noise cluster. In a nutshell, the event discovery problem described here can be seen as an unsupervised machine learning problem which aims to group together similar event-related tweets.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) proposed by Ester et al. [6], is a density-based clustering algorithm in which clusters are arbitrary shaped regions with higher density of spatial points. The algorithm defines three types of points depending whether they belong to a dense, a sparse or an intermediate region: core points, noise points and border points, respectively. GDBSCAN (Generalized DBSCAN) proposed by Sander et al. [7], generalizes DBSCAN to use spatially extended objects instead of simply spatial points and more advanced predicates beyond Euclidean proximity. This algorithm is a convenient framework to define a technique capable of uncovering clusters of event-related tweets from the rest.

Therefore, we present Tweet-SCAN, a novel event discovery technique which adapts the DBSCAN algorithm – or particularize GDBSCAN – to cope with Twitter objects, considering its spatial, temporal, textual and user dimensions. Tweet-SCAN considers spatially extended objects from DBSCAN and it implements independent neighborhood identification in each separate dimension to group close neighbors into a dense cluster which is finally associated to an event.

The textual part of a tweet is modeled through a probabilistic topic model [8], named Hierarchical Dirichlet Process (HDP) [9], which can be seen as the nonparametric extension of Latent Dirichlet Allocation (LDA) [10]. This nonparametric topic model represents the textual dimension of each tweet as a Categorical probability distribution over topics. To assess similarity of tweet messages, we propose to use Jensen–Shannon distance [11], a proper and natural metric for probability distributions which outperforms other measures in terms of semantic similarity [12] and categorization accuracy [13]3.

The algorithm capabilities to uncover events are assessed in a real data set composed of geo-located tweets from Barcelona during its local festivities in September of 2014 and 2015, called “La Mercè”. This data set has been crawled through the Twitter Streaming API via a distributed system called Hermes [14]. It has been shown that the Twitter Streaming API returns all geo-located tweets within the bounding box, instead of a sample [15]. Furthermore, some tweets have been manually tagged and several events have been assigned to them based on our expert knowledge about the festivities. This tagging process allows to quantitatively evaluate the algorithm and to interpret the algorithm parameters.

The rest of the paper is organized as follows: first, we present the digested background for this study in Section 2. Next, Tweet-SCAN technique is described in detail in Section 3. Section 4 contains a descriptive analysis of both data sets from “La Mercè” festivities. Then, we assess Tweet-SCAN discovering capabilities by studying different parameter settings, see Section 5. To conclude, we present the main conclusions for this work and identify future challenges in Section 6.

Section snippets

Background

There is a broad literature on event discovery in Location-based Social Networks which can be further classified based on the features used by the algorithm. Following the well structured background study presented by Yuan et al. [16] with regards to geographical topic modeling, we attempt to classify the event discovery literature in a very similar manner. As stated above, we group the existing proposals in this area of research based on the combination of features used: time, content,

Tweet-SCAN technique

Tweet-SCAN is an event discovery technique for geo-located tweets which is based on the DBSCAN clustering algorithm presented by Sander et al. [7]. In the following section, we describe Tweet-SCAN by first introducing the main elements of DBSCAN [6] and then generalize them into GDBSCAN to make clustering of tweets possible. The proper definitions of GDBSCAN predicates will enable that the resulting Tweet-SCAN clusters matches real world events. Therefore, we define proper Tweet-SCAN predicates

“La Mercè” data sets

In order to evaluate Tweet-SCAN, we have collected data through the Twitter Streaming API6 via Hermes [14]. In particular, we have established a long standing connection to Twitter public stream which filters all tweets geo-located within the bounding box of Barcelona city7. This long standing connection was established during the local festivities of “La Mercè”,

Tweet-SCAN assessment

In this section, we assess Tweet-SCAN for the task of event discovery in “La Mercè”. Particularly, we aim to find evidence that proves the benefits of considering the textual component. We also seek to give insights on the role of each parameter, as well as to provide a scheme for determining them.

As a result, we first introduce some clustering measures to evaluate Tweet-SCAN performance against the tagged data set. Then, we present a heuristic to determine its spatio-temporal parameters (ϵ1, ϵ2

Conclusions and future work

To our best knowledge, Tweet-SCAN provides a first step in using spatial, temporal, textual and user features for the purpose of uncovering real world events unsupervisedly from Location-based Social Networks like Twitter. The formulation of Tweet-SCAN within the framework of DBSCAN enables to understand events as density-connected set of tweets, as well as to use similar heuristics for determining some of the parameters.

The results of Tweet-SCAN points out to the benefits of using text, when

Acknowledgments

This work is partially supported by Obra Social “la Caixa”, by the Spanish Ministry of Science and Innovation under contract TIN2015-65316, by BSC-CNS Severo Ochoa programs (SEV2015-0493 and SEV-2011-00067), by the SGR program (2014-SGR-1051) of the Catalan Government and by COR (TIN2012-38876-C02-01) project. We would like to also acknowledge reviewers for their constructive feedback.

References (28)

  • YangY. et al.

    A study of retrospective and on-line event detection

    Proceedings of the Twenty-first Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

    (1998)
  • C. Weidemann

    Social media location intelligence: The next privacy battle-an arcGIS add-in and analysis of geospatial data collected from twitter.com

    Int. J. Geoinform.

    (2013)
  • J. Borge-Holthoefer et al.

    Structural and dynamical patterns on online social networks: the spanish may 15th movement as a case study

    PloS One

    (2011)
  • KimE.-K. et al.

    Use of Hangeul twitter to track and predict human influenza infection

    PloS One

    (2013)
  • T. Sakaki et al.

    Earthquake shakes twitter users: real-time event detection by social sensors

    Proceedings of the Nineteenth International Conference on World Wide Web

    (2010)
  • ZhengY.

    Tutorial on location-based social networks

    Proceedings of the Twenty-first International Conference on World Wide Web (WWW)

    (2012)
  • M. Ester et al.

    A density-based algorithm for discovering clusters in large spatial databases with noise

    Proceedings of the 1996 International Conference on Knowledge Discovery and Data Mining (Kdd)

    (1996)
  • J. Sander et al.

    Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications

    Data Min. Knowl. Discov.

    (1998)
  • D.M. Blei

    Probabilistic topic models

    Commun. ACM

    (2012)
  • Y.W. Teh et al.

    Hierarchical Dirichlet processes

    J. Am. Stat. Assoc.

    (2006)
  • D.M. Blei et al.

    Latent Dirichlet allocation

    J. Mach. Learn. Res.

    (2003)
  • D.M. Endres et al.

    A new metric for probability distributions

    IEEE Trans. Inf. theory

    (2003)
  • N. Ljubešić et al.

    Comparing measures of semantic similarity

  • LiX. et al.

    Research on the categorization accuracy of different similarity measures on chinese texts

    Proceedings of the 2011 International Conference on Business Management and Electronic Information (BMEI)

    (2011)
  • Cited by (0)

    This paper has been recommended for acceptance by Eva Armengol.

    View full text