Sub-Story Detection in Twitter with Hierarchical Dirichlet Processes

Social media has now become the de facto information source on real world events. The challenge, however, due to the high volume and velocity nature of social media streams, is in how to follow all posts pertaining to a given event over time, a task referred to as story detection. Moreover, there are often several different stories pertaining to a given event, which we refer to as sub-stories and the corresponding task of their automatic detection as sub-story detection. This paper proposes hierarchical Dirichlet processes (HDP), a probabilistic topic model, as an effective method for automatic sub-story detection. HDP can learn sub-topics associated with sub-stories which enables it to handle subtle variations in sub-stories. It is compared with state- of-the-art story detection approaches based on locality sensitive hashing and spectral clustering. We demonstrate the superior performance of HDP for sub-story detection on real world Twitter data sets using various evaluation measures. The ability of HDP to learn sub-topics helps it to recall the sub- stories with high precision. Another contribution of this paper is in demonstrating that the conversational structures within the Twitter stream can be used to improve sub-story detection performance significantly.


Introduction
Online social networks play a major role in generating and disseminating information. They provide a platform for people to voice their opinion and viewpoints. Moreover, social media provide main stream media, governments, companies, and citizens the opportunity to obtain real time information about events happening around the world. The challenge, however, due to the high volume and velocity nature of social media streams, is in how to follow all posts pertaining to a given event over time -a task referred to as story detection [Petrović et al., 2010].
Twitter is ideal for this task because it provides a large source of publicly available posts on major events, e.g. the 2014 Ferguson unrest and the 2011 London riots. In particular, we experiment with five different Twitter data sets. Three of these arise from rumour detection research [Zubiaga et al., 2016, Procter et al., 2013 and come annotated with events (2011 London riots, the 2014 Ferguson unrest, and the 2014 Ottawa shooting) and tweets grouped by sub-story within each of the events. The fourth public data set was created for first story detection (FSD) with Locality Sensitive Hashing (LSH), which is one of our baselines [Petrović et al., 2010]. The fifth public data set (FAcup [Aiello et al., 2013]) contains tweets about various events occuring during a football match. The properties of the data sets are discussed further in Section 5 and comparative evaluation results for all methods and data sets are detailed in Section 7.
We propose to use a mutual information based evaluation measure, adjusted mutual information (AMI) [Vinh et al., 2009], in addition to the standard precisionrecall metrics which avoids agreements by chance by considering the number of clusters produced. The experimental results establish the superior performance of HDP for sub-story detection task. The approach could detect sub-story specific topics which helps journalists and government agencies to monitor the evolution of new topics associated with a story. The runtime performance of HDP is comparable to established story detection approaches and can be used to perform real-time detection of sub-stories.
The main contributions of the paper are summarized as follows: • Introduces the sub-story detection task and proposes a hierarchical Dirichlet process based approach to solve the problem of sub-story detection.
• Provides rigorous experimental comparison of the proposed sub-story detection approach with state-of-the-art story detection approaches, establishing the effectiveness of HDP for sub-story detection.
• Proposes a mutual information based metric for evaluating the performance of sub-story detection approaches.
• Demonstrates the usefulness of conversational structure in improving substory detection performance.

Related Work
A number of techniques have been employed for detecting and tracking stories in social media streams [Allan, 2002]. Story detection is typically done by extending traditional clustering algorithms to a streaming data setting [Aggarwal, 2014]. A comprehensive survey of the literature on story detection techniques in Twitter data is given in [Farzindar and Wael, 2015]. Clustering approaches were traditionally used to group data points with similar features [Jain et al., 1999]. Many of the classic clustering techniques such as K-means have been extended to a streaming data setting [Zhong, 2005, Aggarwal andYu, 2010] and can be used to perform story detection [Aggarwal, 2014]. These algorithms aim to discover underlying groups within data by inferring a general representation to characterize the data in terms of a few key topics or stories.
Story detection in Twitter for a particular topic such as 'earthquake' is studied in [Sakaki et al., 2010]. Becket et al. [2011] use an online clustering algorithm to detect stories and distinguish real vs. non-real stories using a classification method. Twevent [Li et al., 2012] is a story detection approach which clusters bursty segments in Twitter data. A fast and efficient approach based on locality sensitive hashing (LSH) is first used in [Petrović et al., 2010] to detect the emergence of new stories (first story detection) in Twitter. Locality sensitive hashing reduced the computational complexity associated with nearest neighbor search and detected clusters of documents in constant space and time. Later, they extended this approach to counter lexical variations in documents by using paraphrases [Petrović et al., 2012]. An alternative approach to detect new events by storing the contents of already seen documents in a single hash table is proposed in [Wurzer et al., 2015b]. Further, LSH based techniques are also developed to handle topic streams emerging in Twitter [Wurzer et al., 2015a]. Here, topics are also hashed into a bucket in addition to the tweets.
Topic models are also used to detect stories in Twitter, for instance latent Dirichlet allocation (LDA) [Diao et al., 2012] to detect trending topics. A nonparametric topic model based on Dirichlet process is used in [Wang et al., 2013] to detect newsworthy stories in Twitter, where topics are shared among tweets from consecutive time periods. Topicsketch [Xie et al., 2013] uses a novel sketch based topic model to detect bursty topics from millions of tweets.
A spectral clustering based approach is developed in [Preotiuc-Pietro et al., 2016] to address the task of story detection. The approach uses a mutual information based metric to represent the similarity matrix in spectral clustering. Real world events happen at different scales of time and space. Multi-scale event detection [Dong et al., 2015] aims to detect stories evolving at different pace and spanning different geographic locations by using the properties of wavelet transform. Supervised machine learning techniques such as support vector machines and logistic regression are used to detect events corresponding to specific topics such as those related to traffic [D'Andrea et al., 2015], lifestyle and wellness [Akbari et al., 2016] and uprisings [Boecking et al., 2015]. These event detection techniques will not be able to distinguish different sub-stories associated with a main story due to content overlap.
Whilst story detection has received considerable attention, less attention has been paid to sub-story detection task. Aiello et al. [2013] discuss tasks similar to sub-story detection like finding important events happening over time in a main event such as a football match. They used standard story detection approaches to find events on their tasks. There exists approaches [Nichols et al., 2012, Zubiaga et al., 2012 which rely on tweet rates in an interval to detect major moments in a game. Chakrabarti and Punera [Chakrabarti and Punera, 2011] use a modified hid-den Markov model combining tweet rate and text features to summarize events in a game. Shen et al. [2013] use a time-content mixture model which effectively combines burstiness and cohesiveness to detect key moments in a story. Chierichetti et al. [2014] use non-textual features based on tweet rate and communication pattern among users to detect points in time where an important event happens within a story. An approach based on graph-of-words to represent sequence of tweets was used in [Polykarpos et al., 2015] to detect important events happening within a football match. These approaches will not be effective for detecting sub-stories which overlaps considerably in time and have low tweet rates.
There exists a hitherto unaddressed problem -finding sub-stories related to a particular real world event. These sub-stories may or may not correspond to realworld events (e.g. false rumours about the London riots [Procter et al., 2013] do not), they tend to overlap in time (i.e. tweets on more than one sub-story circulate simultaneously), and share significant common vocabulary [Zubiaga et al., 2016]. As demonstrated in the rest of this paper, state-of-the-art approaches for story detection do not perform well on this type of task.

Research Objective
The following are our main research objectives: 1. Introduce the task of sub-story detection in Twitter. Sub-story detection differs from story detection and we discuss the properties specific to sub-story detection which makes it a harder task than story detection.
2. Propose hierarchical Dirichlet processes as an effective approach for substory detection. Unlike story detection approaches, HDP can learn sub-topics associated with sub-stories which makes it particularly useful for modeling sub-story detection task.
3. Verify experimentally the effectiveness of HDP for sub-story detection task. We compare HDP with standard story detection approaches based on locality sensitive hashing and spectral clustering on real world Twitter data sets to establish the fitness of HDP for sub-story detection.
4. Show the usefulness of conversational structure in Twitter for improving the performance of sub-story detection task. By considering conversational structure, reply tweets which does not share a topical similarity with the source tweets gets clustered along with the source tweet.
5. Introduces adjusted mutual information score as an effective metric to measure clustering performance in sub-story detection. Standard metrics based on precision typically favor clustering approaches which produces large number of small sized clusters. Such clustering approaches are not useful in practice and we propose to use AMI as an effective alternative metric which can take care of such problems.

Problem Definition : Sub-Story Detection
This paper addresses the problem of detecting sub-stories as they emerge in social media streams. Automatic sub-story detection methods need to separate tweets related to different sub-stories into different clusters, even though they pertain to the same real-world event.
Sub-story detection differs from a story detection in that sub-stories share some common vocabulary and the tweet rates for the sub-stories are comparatively low. Table 1 shows 8 major sub-stories related to the Ferguson unrest from one of our five data sets. All these sub-stories are related to the shooting of M. Brown by the Ferguson police and thus share words such as 'M. Brown', 'Ferguson', 'police' etc. Standard story detection approaches fail to produce good results in this setting where vocabulary is shared across the sub-stories because they look at tweet similarity or overlaping words to cluster tweets. Sub-stories can have considerably larger lifespan, overlap in time, and are set within a broader over-arching story, that contains many thematically related substories. The themes discussed in these sub-stories are referred to as sub-topics. For example, consider the temporal profile of sub-stories from the Ferguson data shown in Figure 1. 1 We can observe that these sub-stories overlap in time and have relatively low tweet rate. This is mainly due to the fact that there are multiple conversations within a sub-story, each evolving at a different point in time. A sub-story which becomes active at some point in time can become dormant temporarily and then re-activate again at a later time (for instance, consider the temporal behavior of sub-stories 1 and 3 in Figure 1).

Data Set Description
The core of our experiments are carried out on three sub-story annotated data sets (Ferguson unrest, Ottawa shooting and London riots). The first two are very recent and have been collected and human annotated as part of a rumour analysis project [Zubiaga et al., 2016], while the London riots one arose from an earlier qualitative social science analysis of related tweets [Procter et al., 2013]. All three data sets consist of tweets, grouped together into human annotated sub-stories related to the particular real-world event. Other tweets pertaining to the same event count as background data. We also consider two other publicly available data sets, FSD and FAcup, which have been created for story detection. Even though these data sets are not strictly suitable to the sub-story detection task, we use them for comparative purposes and have vocabulary overlap across some sub-stories.
Next we describe these data sets in more detail.

Ferguson
This data set consists of tweets collected between August and September 2014, all related to the unrest that took place in Ferguson, USA [Zubiaga et al., 2016]. This data set is not only sub-story annotated, but also includes "reply-to" information, which connects together subsets of tweets into conversational graphs. We refer to reply tweets as those that reply to a tweet present in the data set, and source tweets as all those which do not have such a "parent". In other words, the data set is regarded as a collection of conversational threads, each of which has a single source tweet at its root. As detailed in [Zubiaga et al., 2016], journalists categorised manually source tweets as belonging to one of 45 different sub-stories. A reply tweet is assumed automatically to belong to the sub-story to which its source tweet has been assigned (if any).
After discounting sub-stories with fewer than 10 tweets, the final data set consists of 6,598 labeled tweets spread across 35 sub-stories and 18,650 tweets as background, i.e. not belonging to any of those sub-stories. Considering source tweets alone, there are 284 labeled source tweets and 899 background source tweets. Table 1 lists major sub-stories in the data [Zubiaga et al., 2016], illustrating how the sub-stories are very similar and have the Ferguson unrest as a common topic.

Ottawa
The Ottawa data set consists of tweets related to shootings at the parliament building in Ottawa during October 2014 [Zubiaga et al., 2016]. Similar to the Ferguson data, it also has a conversational structure including source and reply tweets. The data set consists of 6,414 tweets spread across 39 sub-stories and 5,975 tweets as background.
Considering source tweets alone, there are 462 labeled tweets and 439 background tweets. Some major sub-stories associated with the Ottawa shooting event are 'Soldier shot dead is Nathan Cirillo', 'Soldier shot at War Memorial has died', 'Suspected shooter is dead', etc. All these sub-stories have a common theme of shooting and death which makes them an ideal candidate for the sub-story detection task. With respect to temporal patterns, the evolution of sub-stories is very similar to the patterns observed in the Ferguson data.

London Riots
The London riots data set consists of 2.5 million tweets related to the riots that took place in London during August, 2011 [Procter et al., 2013]. It includes 10,000 tweets that are labeled as belonging to 7 different sub-stories, all with a common background topic -the London riots. Table 2 provides a summary of number of tweets in each sub-story in this data set. Unlike Ferguson and Ottawa, the conversational structure was not made available by the researchers.

First Story Detection
This is a publicly available story detection data set [Petrović et al., 2012] with approximately 2,400 tweets labeled as belonging to 27 stories, from the period June to September 2011. This is augmented with background tweets from the same period, to create a corpus of approximately 80,000 tweets. Originally this data set was created by Petrović et al. [2012] for evaluation of their first story detection (FSD) system. This FSD data set can be seen to represent the standard story detection task, in contrast to the sub-story task represented by the former three data sets. It should be noted, however, that there is some overlap of stories here as well, e.g. four of the stories are related to the London riots in 2011 and another four are about death of some celebrities. These commonalities make this data also applicable to sub-story detection, as well as enabling us to benchmark our methods on the related story detection problem.

FAcup
This is a publicly available data set [Aiello et al., 2013] with approximately 7,000 tweets belonging to 13 different sub-stories associated with a football match story. These tweets represents sub-stories such as goals, fouls etc. in the 2012 Football Association (FA) final match between Chelsea and Liverpool. This data set is augmented with approximately 20,000 tweets related to the same football match as background. Due to the shared common story (football game), sub-story tweets in this data set share a common vocabulary and is useful for evaluating the proposed sub-story detection approach. However, they differ from sub-story data sets such as Ferguson and Ottawa in that the sub-stories in this data set are temporally separated.

Methods
The main sub-story detection method investigated in this paper uses hierarchical topic modeling. In particular, we experiment with hierarchical Dirichlet processes (HDP), a non-parametric Bayesian model, which can effectively model the substory detection task. HDP is also compared to two story detection state-of-the-art approaches: spectral clustering and locality sensitive hashing.

Hierarchical Dirichlet Process
Latent Dirichlet allocation (LDA) [Blei et al., 2003] and hierarchical Dirichlet processes (HDP) [Teh et al., 2006] have shown promising results in topic modeling due to their probabilistic interpretations. They model a document (i.e. tweet in our case) as a mixture of topics, where each topic has a distinct distribution over the words. These generative models can infer the latent topics associated with the tweets.
In this paper we propose to use HDP for sub-story detection, since it can model the hierarchical structure underlying the topic distribution. As argued above, in sub-story detection we need to find sub-topics associated with the main story (e.g. the Ferguson unrest), and HDP is developed specifically to handle such kinds of tasks. HDP achieves this by extending the Dirichlet process mixture model (DPMM) [Murphy, 2012] to a hierarchical setting.
In more detail, the DPMM considers a tweet as consisting of words generated by a mixture of topics. The mixture distribution is modeled using a non-parametric prior based on a Dirichlet process (DP) [Hjort et al., 2010]. A DP is parameterized by a base distribution H and a concentration parameter α and is denoted as DP (α, H). The base measure specifies the a-priori distribution over some parameter space θ which is used to generate observed data.
In our case, θ represents the parameters of a multinomial distribution over the words w in a tweet. A draw from DP (α, H) is a discrete probability measure G providing a distribution over θ. It can be represented as , where δ θ i is the Kronecker delta function which gives a value of 1 when the parameter takes value θ i , θ i is a draw from H and π i is the probability mass associated with θ i . The sequence of values π i is obtained from α using a stick breaking process [Sethuraman, 1994] The process ensures that π represents a probability distribution i.e. ∞ i=1 π i = 1 and is often represented as π ∼ GEM (α). The concentration parameter α determines the probability mass associated with a topic π i as a parameter in Beta distribution. A draw from G results in θ i with probability π i , with θ i representing the parameters of a multinomial distribution associated with a topic i. Thus each topic i occurs in a tweet with probability π i . Modeling tweets independently as a DPMM does not allow topics to be shared across tweets, which is needed in our task.
Hierarchical Dirichlet processes are developed to handle grouped data and share topics across the groups [Teh et al., 2006]. We use them to model the tweet as consisting of a set of topics and to share topics across multiple tweets. HDP achieves this by drawing tweet specific probability distribution G d for a tweet d from DP (γ, G 0 ), where γ represents the concentration parameter and G 0 is the base distribution shared by all the tweets. The common base distribution G 0 is indeed a draw from DP (α, H). The common base distribution has the form G 0 (θ) = ∞ i=1 π 0i δ θ i (θ) and the tweet specific distribution has the form Here, both the common base distribution and tweet specific distributions share the parameters θ i (or the topics) with tweet specific mixture distribution π d over the topics. Thus, the tweets modeled using HDP share the topics but with different probabilities. The mixing proportions π d is generated as follows [Teh et al., 2006] (2) Figure 2 shows the graphical model representation of the HDP model. A word w dn in a tweet d comes from a topic with parameter θ dn drawn from the Dirichlet Process G d associated with the tweet. The topics are shared across the tweets due to the hierarchical modeling of DPMM. Since in sub-story detection tweets relate to the same real world event, HDP can model this effectively, coupled with the fact that individual tweets address different sub-topics (corresponding to the sub-stories). These sub-topics are characterized by words and each word is associated with a probability indicating the importance of the word in representing the sub-topic. The identified sub-topics are used to cluster tweets based on the words common in the tweet and the sub-topics. For each tweet, we detect common words and calculate a similarity score to a subtopic by summing the probability associated with these words in representing the sub-topic. The tweet is assigned to the sub-topic with the maximum similarity score. We use sub-topics for clustering the tweets as they can better discriminate the tweets associated with sub-stories.

Spectral Clustering
Clustering techniques have been used widely to detect stories in social data streams [Aggarwal, 2014]. Here, we discuss one based on spectral clustering using pointwise mutual information [Preotiuc-Pietro et al., 2013]. The spectral clustering (SC) algorithm [Shi and Malik, 2000] has been shown to achieve the state-of-the-art performance for tasks such as community detection in graphs [Smyth and White, 2005]. This method treats clustering as a graph partitioning problem. It projects the objects into a lower dimensional space by performing singular value decomposition of a similarity graph constructed over the objects. It then discovers clusters of objects which are maximally separated in this space using standard clustering techniques, such as k-means. A good spectral clustering relies on a good similarity graph which best reflects the connections between objects.
We apply spectral clustering to detect sub-stories in a stream of tweets [Preotiuc-Pietro et al., 2013]. The approach represents a similarity graph by constructing a matrix which captures the similarity over words that appear in the data. It uses normalized point-wise mutual information (NPMI) [Bouma, 2009] to capture the word similarity. NPMI measures the probability of co-occurrence of words in the same tweet. The idea is that if two words appear consistently in the same tweet, then they are indicative of the same story. For example, the co-occurrence of words such as 'Ferguson' and 'police' over a period of time indicate there is a story related to Ferguson police.
The NPMI measure between words pairs x and y is calculated as where p(x) denotes the probability of occurrence of a word x in a tweet, and p(x, y) provides the probability of co-occurrence of words x and y in a tweet. We consider two words as co-occuring if they appear in the same tweet, which gives us a straightforward measure of co-occurence frequencies. The NPMI measure takes values between −1 and 1 with positive values indicating a higher chance of cooccurrence and negative value indicating a lower chance of co-occurrence. The spectral clustering algorithm proceeds by filtering out less frequent words and constructing a similarity graph over words using the NPMI measure. It ignores all NPMI values less than a threshold and keeps the largest connected component from the resulting graph. Singular value decomposition is performed over a graph Laplacian constructed from this similarity graph to obtain a representation of words in a lower dimensional space. A k-means algorithm then finds clusters of words in this reduced space.
The word clusters discovered by the spectral clustering algorithm represents a coherent topic. The words are associated with a score, which provides a measure of importance of the word in representing the topic. For each tweet, a similarity score is computed with respect to each topic, using the co-occurence score of the words in the tweet. The tweets are then clustered by assigning them to the topic with the highest similarity score. Thus, tweets in the same cluster form a topically coherent cluster.

Locality Sensitive Hashing
The second state-of-the-art approach is locality sensitive hashing (LSH) [Rajaraman and Ullman, 2011], which was proposed originally for first story detection in Twitter [Petrović et al., 2010]. LSH finds nearest neighbour tweets in constant time and keeps only a constant number of tweets in memory.
The LSH approach uses the nearest neighbour algorithm to find the tweet closest to the incoming tweet. The computational overhead of finding the nearest neighbor is overcome using locality sensitive hashing. LSH maps incoming tweets to buckets using a hashing function which maps similar tweets to the same bucket. The method then finds the nearest neighbour to the incoming tweet by searching the bucket to which it has been mapped. This greatly reduces the search space. The approach clusters tweets based on the cosine similarity of the tweets which are hashed into the same bucket. It assigns an incoming tweet to the cluster of its nearest neighbour if the cosine similarity is greater than a particular threshold. Otherwise, it assigns the tweet to a new cluster.
In more detail, LSH uses a series of random hyperplanes sampled from a normal Gaussian distribution. These hyperplanes divide the space into subspaces and similar tweets fall into the same subspace. We consider k such hyperplanes. The number of hyperplanes k can be considered as a number of bits per key in this hashing scheme. Let u i , i = 1 . . . k represent the hyperplanes and x be the tf-idf representation of the tweet. The hash value is considered to be a binary vector with k bits. We set the bit i to be 1 if x.u i > 0 and 0 otherwise. The tweets falling in the same subspace have the same hash value in the hash table and is stored in a bucket of size b. The higher k is, the fewer collisions there will be in the buckets, but more time will be needed to compute the hash values. However, increasing k also decreases the probability of collision with the nearest neighbor, and hence multiple hash tables (h) are required to increase the chances of finding the nearest neighbor. Thus, a tweet is compared with the tweets belonging to the same bucket in multiple hash tables in order to find its nearest neighbor using cosine similarity. The nearest neighbor tweets with cosine similarity below a user specified threshold forms a cluster. This cluster represents tweets with some topical similarity, which helps one to detect stories evolving in Twitter.

Experimental Results
This section reports on the comparative evaluation of HDP and all state-of-the-art baselines on the sub-story detection data sets. We follow a cluster-based approach, as it accounts for the varying popularity of sub-stories and the related user endorsement aspect. This also provides a fair comparison with LSH. Alongside this, we also consider an event extraction setting where major sub-stories detected by HDP and SC are described in terms of detected topics 2 . Results are reported using the standard metrics of precision, recall, F-score and adjusted mutual information [Vinh et al., 2009]. The latter is included as it has certain advantages over the others with respect to cluster evaluation.
In particular, the experiments compare Hierarchical Dirichlet processes (HDP), spectral clustering (SC) and locality sensitive hashing (LSH) on the data sets introduced in Section 5 (Ferguson, Ottawa, London riots, FSD, and FA Cup).
The text of each tweet is pre-processed to remove unusual characters, tokens, and stop words, followed by stemming. In particular, the filtered tokens are: usermentions (tokens starting with @), hashtags (words starting with #) and URLs. The rationale behind hashtag removal is that hashtags often tend to refer to the shared real world event (e.g. #Ferguson, #Londonriots) and are thus shared across sub-stories.

Method Comparison using Precision, Recall and F-score
Our first comparative evaluation experiment uses the standard information retrieval metrics of precision and recall [Manning et al., 2008]. Detected tweet clusters are evaluated against the gold standard tweets in the respective sub-stories. Since the approaches are unsupervised, the number of automatically discovered clusters does not always align to the sub-stories in the gold standard. Therefore, for each sub-story, we find the automatically produced cluster with the maximum overlap, in terms of number of tweets from that sub-story. It should be noted that multiple sub-stories may get aligned to a single cluster. In this case, precision measures how many of the retrieved tweets belong to the actual sub-story, while recall measures whether the system could retrieve all known tweets associated with the aligned sub-story. Performance is reported using micro-averaged precision and recall, due to the varying sizes of each sub-story (in terms of number of contained tweets).
More formally, let N be the total number of known sub-stories in a given data set. T P i , F P i , and F N i are the true positives, false positives, and false negatives associated with a sub-story i. Then, micro-averaged precision and recall are calculated as: We also report micro-averaged F-score, which is the harmonic mean of microaveraged precision and recall. Approaches with a high F-score are preferred.
The free HDP parameters (e.g. the concentration parameters) are learnt from the data using Gibbs sampling. We put an upper bound on the number of topics produced by HDP (k) and this allows for a fair comparison with SC. The effective number of topics could be less and is determined by the concentration parameter learnt from the data.The spectral clustering approach depends primarily on the parameter k, which determines the number of clusters in the data set. The approach is run by filtering out words with an NPMI threshold of 0.1 and with word frequency threshold of 10. We perform experiments with different values of k for HDP and SC. Lastly, the LSH approach depends on the parameters k (number of bits), h (number of hash tables) , and b (bucket size). The experiments are conducted with different values of these parameters. We present only the results obtained with the best two parameter settings (in terms of F-score) for each of the approaches.
As can be seen in Table 3, HDP is the best performing method on the Ferguson and Ottawa sub-story data sets. In particular, HDP's F-score is significantly better than the SC and LSH F-scores. With respect to precision, LSH performs best, while SC has the best recall. The low recall of LSH, however, is due to the fact that it generates a large number of very small tweet clusters, which is also the reason for its high precision. On the other hand, SC is not able to differentiate sufficiently between similar sub-stories and groups them together in a small number of very large clusters. While this increases recall, it leads to the observed low precision. In contrast, HDP can detect subtle differences in sub-stories, thanks to the sub-topics, which are then used to cluster the tweets. This leads to improved precision for HDP over SC and an ultimately higher F-score.  Table 4 reports the experimental results on the much larger London riots data set. The methods here are executed by partitioning the data set into 50 subpartitions with approximately 50,000 tweets in each. The table shows the number of clusters per partition for the HDP and SC approaches. As above, HDP has the best recall and F-score, while LSH has the highest precision.
Nevertheless, it should be noted that precision, recall and F-score are still very low on the Ferguson and Ottawa data sets, which is due to the presence of conversational threads within the sub-story clusters. As discussed in Section 1, some of the tweets within the conversational tweets tend to discuss completely unrelated topics. For instance, even if the source tweet mentions the sub-story explicitly, often reply tweets within the thread do not have significant word overlap with the source. Consequently, these reply tweets are not deemed topically similar to the source tweet and are assigned to a completely different cluster, which negatively impacts performance.

Conversational Structure Experiments
The effect of reply tweets in lowering the performance of the system is verified by conducting clustering experiments on the Ferguson and Ottawa data sets, using source tweets alone. As can be seen in Table 5, algorithm performance improves significantly, compared to the results reported in Table 3. In some cases, the improvement in performance is by an order of magnitude. Again we observe that HDP outperforms both LSH and SC, with similar precision and recall patterns as those observed in the full data sets.

Ferguson
Ottawa The next experiment considers the sub-story assignment of entire conversa-tional threads. The first step is to cluster only the source tweets, while reply tweets are assigned automatically to the cluster of their corresponding source tweet. This is a realistic setting on these data sets, and on unseen Twitter data in general, since the source-reply structure is readily available. Table 6 shows that considering conversational threads achieves an order of magnitude improvement in recall and F-score, as compared to those in Table 3. Again, we observe that LSH has better precision, while HDP has better recall, and ultimately HDP has the best F-score.
This experiment also considered an additional baseline, which clusters tweets using only the conversational structure. By design, this approach has a precision of 1. The aim here is to investigate whether the sub-story detection approaches can get better recall, than this readily available baseline. This is indeed the case, as shown in Table 6. In particular, the simple baseline has a recall of 0.2545 and 0.1696 for Ferguson and Ottawa respectively, which is lower than the recall of the three other methods. Table 6: Results of HDP, SC, and LSH on Ferguson and Ottawa data sets (considering conversational structure) for different parameter settings. Best results are indicated in bold letters. In order to investigate variation in method performance across individual substories, 3 major sub-stories are selected at random in the Ferguson and Ottawa data sets. As can seen in Table 7, performance patterns for LSH, SC, and HDP remain unchanged, i.e. LSH has the best precision, while HDP -the best recall and Fscore. The latter is able to find most of the tweets associated with sub-stories with a good precision.

Performance on FSD data
The next experiment compares the methods on the publicly available FSD story detection data set (see Table 8). As before, LSH has very high precision but low recall. HDP and SC outperform LSH in recall, while HDP precision is better than that obtained for SC. Thus, again HDP has the highest F-score.
Similar to the sub-story data sets, LSH produces very small clusters, which split the tweets belonging to a particular story across multiple threads resulting in In the FSD data, we observed that LSH produced around 1500 clusters in total, after ignoring clusters with fewer than 3 tweets. Spectral clustering, on the other hand, tends to cluster together tweets from related stories, resulting in few large story clusters. For example, some of the tweets from the two stories ('Death of Amy Winehouse' and 'Betty Ford dies') are put into the same cluster. In the case of 'Death of Amy Winehouse', the corresponding SC cluster has 821 tweets, with 0.59 precision and 0.67 recall. This is mainly due to SC clustering words rather than messages, and thus merging sub-stories sharing common vocabulary.
HDP provides a more balanced result with comparatively higher precision and recall. It is a more fine grained approach which can distinguish subtle differences in various stories, due to the hierarchical modeling of the topics with some shared vocabulary. In the case of 'Death of Amy Wine house', the corresponding HDP cluster has 660 tweets with 0.81 precision and 0.73 recall.
In conclusion, this experiment has demonstrated that HDP performs very well also on story detection data sets and tasks.

Performance on FAcup Data
The publicly available FAcup data set is used to study the behavior of the story detection approaches. The data exhibits properties similar to Ferguson and Ottawa since all the tweets belongs to a common main event, i.e. a football match, which makes it a challenging task for the three methods. Table 9 compares the performance HDP, SC, and LSH on the FAcup data. The  Table 3 on the sub-story detection data sets. Again, HDP outperforms LSH and SC, thanks to superior recall and F-score, while LSH maintains the best precision.

Discussion
The experiments presented above demonstrated that LSH generally produces a large number of clusters with high precision but low recall. For instance, on the London riots data set, it produced around 45,000 clusters. In contrast, HDP and SC achieve similar performance with only 2500 and 5000 clusters, respectively.
In general, LSH tends to create numerous very small clusters (mostly containing re-tweets), which explains its very high precision. On the other hand, SC tends to cluster together similar categories, which lowers precision. HDP distinguishes subtle topical differences, resulting in more balanced precision and recall. Another noteworthy observation is that, in general, increasing the number of clusters in HDP and SC leads to improved precision but at the cost of recall. Thus, depending on application needs, HDP and SC make it possible to trade off some recall for better precision.
With respect to the metrics used above, precision, recall and F-score do not penalize methods, such as LSH, which produce a large number of small clusters, and thus the corresponding F-score is often high due to their high precision. Such methods, however, are not useful in practice as an user has to navigate over a large number of clusters, in search of important sub-stories. Therefore, in our final experiment we use adjusted mutual information (AMI) [Vinh et al., 2009], which takes cluster size and cluster numbers into account. The improvement in clustering quality due to HDP is clearly visible also with adjusted mutual information, which corrects for agreement by chance due to a larger number of clusters.

Adjusted Mutual Information Experiments
The information theoretic, adjusted mutual information measure (AMI) [Vinh et al., 2009] is used to evaluate cluster quality. In prior work, information theoretic measures, such as mutual information, have been shown as being well suited to comparing the performance of clustering approaches [Banerjee et al., 2005, Meilǎ, 2005. These measures are theoretically grounded and provide a better evaluation of cluster quality.
Mutual information (MI) between two clustering U = {U 1 , . . . , U R } (true clustering of tweets) and V = {V 1 , . . . , V C } (generated clustering of tweets) quantifies the information shared among them and provides the reduction in uncertainty on U upon observing V. The MI score between U and V, is computed as Here, p(i) is the probability that tweets belong to cluster U i , p(j) -the probability that tweets belong to cluster V j , and p(i, j) -the probability that tweets belong to both clusters U i and V j . When clusters are identical, MI score takes a higher value upper bounded by max{H(U), H(V)}, where H(U) = − R i=1 p(i) log(p(i)) is the entropy of cluster U. If the clusters are disjoint, MI score is close to zero. One can also use a normalized MI (NMI) score, which normalizes the MI score to be between zero and one.
The shortcoming of the MI and NMI scores, however, is that they do not correct for clusters that occur by chance. They do not have a constant baseline value, i.e. the average value obtained for a random clustering of the data [Vinh et al., 2009]. Consequently, these scores tend to be higher for results with larger number of clusters, or when the ratio of the total number of data points to number of clusters is small. In particular, they would produce a high score for an approach, which categorizes each tweet into a separate cluster.
Therefore, in our experiments we consider adjusted mutual information (AMI) [Vinh et al., 2009], which is corrected for chance by subtracting the expected mu-tual information score from both the numerator and denominator of the normalized mutual information score. The AMI score is calculated as follows Table 10 and Table 11 provide the AMI scores obtained by HDP, SC and LSH on the different data sets. As can seen, HDP has the best performance, as measured by its AMI score. In this case, we also note that SC demonstrated improved performance and tends to be better than LSH on most data sets. As expected, the AMI score penalizes the LSH algorithm, which produces a very large number of clusters, since the expected mutual information score grows as the number of clusters increases.

Topic detection
Since HDP and SC are topic based and can describe a cluster through key terms, this is not the case for LSH. Therefore, in this section we examine topics within sub-stories, as identified automatically by these two methods.
In particular, Table 12 shows 5 major topics learnt by HDP and SC from the Ferguson data. We found that HDP learns topics corresponding to major stories in the Ferguson data set. For instance, Topic 1, Topic 2 and Topic 3 correspond to Story 1, Story 7 and Story 3 in the Ferguson data set, described in Table 1. The first two topics detected by SC correspond to Story 5 and Story 3 of the Ferguson data.

Runtime Efficiency
We study the runtime of different approaches on the data sets and check their practical usability.  Social networks such as Twitter provide real time information on various events happening around the world. Sub-story detection in Twitter provides news agencies and government organizations the ability to track the evolution of various stories associated with a main story. For instance, it helps journalists to detect various stories associated with U.S. presidential elections and government to track stories arising around natural disasters such as earthquakes. The proposed approach based on HDP could detect accurately most of the sub-stories associated with a main story in real time. It will be useful for news agencies and governments to more accurately track the evolution of sub-stories and take appropriate remedial measures. The subtopics learnt by HDP from the Twitter helps humans to understand the content of sub-stories without actually inspecting them. It also avoids the need to have a separate algorithm to summarize the contents of the sub-stories. We take into account the conversational structure in Twitter which allows our model to more accurately track the evolution of sub-stories. By observing the rate of growth of sub-stories, one could decide which among the lot of sub-stories require immediate attention. This is particularly useful in applications such as rumour detection where early detection of rumour is important. Categorizing the conversational tweets also into the cluster of the source tweet serves another purpose in this scenario. They help in debugging the truthfulness of a rumour mentioned in the source tweet. For instance, the presence of words such as 'incorrect' and 'unbelievable' in the reply tweets often indicate that the topic mentioned in the source tweet is not true.
We provide a better measure to evaluate clustering quality in sub-story detection by using adjusted mutual information. We observed that standard story detection approaches such as LSH when applied to sub-story detection task, produced a large number of small sized highly accurate clusters. Standard metrics based on precision favor such clustering approaches but they may not be useful in practice. Though F-score consider recall as well, very high precision often leads to a good F-score. AMI takes into account number of clusters produced by the approach and penalizes those which produces large number of clusters. By correcting agreement between clusters due to chance, AMI measure better reflects the clustering quality of the approaches. We proposed to use it for comparing the quality of clusters produced in the sub-story detection task. HDP performed far better than other ap-proaches in terms of AMI score which makes it an ideal candidate for sub-story detection.

Conclusion
This paper introduced the sub-story detection task, which differs from the previously studied story detection task. Secondly, we proposed a probabilistic topic model (hierarchical Dirichlet processes (HDP)) for automatic sub-story detection. HDP performs hierarchical modeling of topics and is effective in modeling substories by learning sub-topics associated with the common topic of the shared realworld event.
HDP performance was compared against spectral clustering and locality sensitive hashing on several sub-story detection and story detection data sets. In general, we found that SC provides good recall, while LSH provides good precision. HDP, on the other hand, was found to have balanced precision and recall and achieves the best F-scores on all data sets. This demonstrates that HDP can handle effectively the subtle differences in sub-stories, which leads to an improved clustering performance. The superior performance of HDP is substantiated further by evaluating cluster quality via adjusted mutual information.
Lastly, our experiments also demonstrated that considering the conversational structure of tweet threads significantly improved performance of the sub-story detection approaches.
Future work will address the task of automatic sub-story ranking, which will enable users, such as journalists or emergency responders, to identify and focus on the most important sub-stories within a large volume of tweets surrounding major world events.