Exploring Twitter communication dynamics with evolving community analysis

11 Online Social Networks (OSNs) have been widely adopted as a means of news dissemination, event reporting, opinion expression and discussion. As a result, news and events are being constantly reported and discussed online through OSNs such as Twitter. However, the variety and scale of all the information renders manual analysis extremely cumbersome, and therefore, creating a storyline for an event or news story is an effort-intensive task. The main challenge pertains to the magnitude of data to be analyzed. To this end we propose a framework for ranking the resulting communities and their metadata on the basis of structural, contextual and evolutionary characteristics such as community centrality, textual entropy, persistence and stability. We apply the proposed framework on three Twitter datasets and demonstrate that the analysis that followed enables the extraction of new insights with respect to influential user accounts, topics of discussion and emerging trends.These insights could primarily assist the work of social and political analysis scientists and the work of journalists in their own story telling, but also highlight the limitations of existing analysis methods and pose new research questions. To our knowledge, this study is the first to investigate the ranking of dynamic communities. In addition, our findings suggest future work regarding the determination of the general context of the communities based on structure and evolutionary behavior alone. 12


27
OSNs have become influential means of disseminating news, reporting events and posting ideas as well  To this end, we present a framework for analyzing and ranking the community structure, interaction 58 and evolution in graphs. We also define a set of different evolution scenarios, which our method can 59 successfully identify. A community here is essentially a subgraph which represents a set of interacting 60 users as they tweet and mention one another. The edges of the subgraph represent the mentions made 61 between users. A dynamic community is formed by a temporal array of the aforementioned communities 62 with the condition that they share common users Cazabet and Amblard (2014); Nguyen et al. (2014). 63 Community evolution detection has been previously used to study the temporal structure of a network metadata that a user has to scan through is immense. In our previous work Konstantinidis et al. (2013), 67 we proposed an adaptive approach to discover communities at points in time of increasing interest, but 68 also a size-based varying threshold for use in community evolution detection. Both were introduced in an 69 attempt to discard trivial information and to implicitly reduce the available content. Although the amount 70 of information was somewhat reduced, the extraction of information still remained a tedious task. Hence, 71 to further facilitate the browsing of information that a user has to scan through in order to discover items 72 of interest, we present a sorted version of the data similarly to a search engine. The sorting of the data is  The closest work to dynamic community ranking was recently presented by Lu and Brelsford (2014) 79 in a behavioral analysis case study and as such it is used here as a baseline for comparison purposes. 80 However, it should be mentioned that the ranking was not the primary aim of their research and that the 81 communities were separately sorted by size thus missing the notions of importance, temporal stability and 82 content diversity which are employed in the proposed framework. To the best of our knowledge this is the 83 first time that structural, temporal, importance and contextual features are fused in a dynamic community 84 ranking algorithm for a modern online social network application. 85 Although the overall problem is covered by the more general field of evolving network mining, it 86 actually breaks down in many smaller issues that need to be faced. Table 1 presents the decomposition of 87 the problem into these issues, together with popular available methods which can be used to overcome 88 them, along with the techniques employed by Lu and Brelsford (2014) and the ones proposed by the 89 TISCI framework which is presented here.

90
In this work, we consider the user activity in the form of mentioning posts, the communities to 91 which the users belong, and most importantly, the evolutionary and significance characteristics of these 92 communities and use them in the ranking process. The proposed analysis is carried out in three steps. In the Due to the lack of ground truth datasets for the evaluation of the proposed framework, we devised 113 and are proposing a novel, context-based evaluation scheme which could serve as a basis for future work.

114
It is our belief that by studying the contents of discussions being made in groups and the evolution of 115 these groups we can produce a better understanding of the users' and communities' behavior and can give 116 deeper insights into unfolding large-scale events. Consequently, our main contributions can be summed 117 up in the following:

118
• a novel ranking framework for dynamic communities based on temporal and contextual features;

119
• a context-based evaluation scheme aimed to overcome the absence of ground truth datasets for the 120 community discovery analysis;

121
• an empirical study on three Twitter datasets demonstrating the merits of the proposed framework 122 An additional asset of the TISCI ranking method, which is the main contribution of this paper, is that it 123 was created to work with any kind of community evolution detection method that retains discrete temporal 124 information and that it is independent of the choice of the community detection algorithm applied to the 125 individual timeslots.

127
Mining OSN interactions is a topic that has attracted considerable interest in recent years. Interaction 128 analysis provides insights and solutions to a wide range of problems such as cyber-attacks Wei et al.  users but contrary to our method, they do not take into account the communities created by these users or 143 the evolution of these communities. the network. Although the network selected is quite large and the method is also very fast; the system was 147 created in order to be applied on a mobile phone network which renders it quite different to the networks 148 studied in this paper. The collected data lack the topic of discussion and the content of the messages 149 between users, so there is no way to discover the reason for which a community was transformed or the 150 effect that the transformation really had on the topic of that community. Moreover, although the features 151 of persistence and stability are mentioned in the paper, no effort was made in ranking the communities.

152
Nonetheless, due to its speed and scalability advantages, in this paper we decided to employ and extend in the similarity threshold. The analyzed data presented valuable information as to how to select the 168 similarity threshold but no insight as to important communities, their users or specific topics.

169
Another dynamic community detection method used to extract trends was introduced by Cazabet et al.  The work by Lin et al. (2008) bears some similarities in terms of motivation as they also want to gain 175 insights into large-scale involving networks. They do this via extracting themes (concepts) and associating 176 them with users and activities (e.g. commenting) and then try to study their evolution. However, they 177 provide no way of ranking the extracted themes, which is the focus of our work.

178
One of the main problems in detecting influential communities in temporal networks is that most  the ranking of the extracted communities, which is the focus of this paper.

195
A method for ranking communities, specifically quasi-cliques, was proposed by Xiao et al. (2007) in 196 which they rank the respective cliques in respect to their betweeness centrality. However, they also do not 197 take temporal measures into consideration and apply their method on a call graph from a telecom carrier 198 and a collaboration network of co-authors thus excluding the context of the messages.

199
The most recent work regarding the extraction of information using evolving communities was In this paper, we employ the standard graph notation G = (V, E, w), where G stands for the whole network;     Manuscript to be reviewed Computer Science Every node in the resulting graphs represents a Twitter user who communicated tweets in the datasets 237 by mentioning or being mentioned. A mention, and thus a directed edge between two users is formed 238 when one of the two creates a reference to the other in his/her posted content via use of the @ symbol.

239
The number of mentions between them forms the edge weight.

274
In this paper, we represent a dynamic network as a sequence of graph snapshots G 1 , G 2 , ..., G n , ... .

275
The objective is to detect and extract dynamic communities T by identifying the communities C that sets C and T is that the former contains every static community in every available timeslot, whereas the 280 latter contains sequences of communities that evolve through time. In both C i,n and T i,n i is a counter of 281 communities and dynamic communities respectively, while particularly in T i,n n represents the timeslot 282 of birth of the dynamic community. Figure 5 presents an example of the most frequent conditions that 283 communities might experience: birth, death, irregular occurrences, merging and splitting, as well as 284 growth and decay that register when a significant percentage of the community population is affected.

285
In the example of Figure 5, the behavior of six potential dynamic communities is studied over a period 286 of three timeslots (n − 1, n, n + 1). Dynamic community T 1,n−x originated from a previous timeslot that of T 2,n−1 and T 3,n−x in which two communities started up as weak and small but evolved through a 298 merger into one very strong, large community that continues on to n + 2. In this case it could be that two 299 different groups of people witnessed the same event and began conversing on it separately. As time went 300 by, connections were made between the two groups and in the n timeslot they finally merged into one.

301
Actually, the community continued to grow as shown on the n + 1 timeslot. T 4,n−1 and T n/a were both 302 created (community birth) in n − 1 and both disappeared in n differentiating in that T 4,n−1 reappears in 303 n + 1 (irregular occurrence) while T n/a does not and thus a dynamic community is not registered. This 304 is the main reason why a timeslot delay is introduced in the system as will be described later; a search 305 strictly between communities of consecutive timeslots would result in missing such re-occurrences.

306
To study the various lifecycle stages of a community, the main challenge pertains to the computational if C a1 does not appear in the second snapshot, T a,1 is not updated; a split is registered if the community 322 appears twice in the new timeslot, and a merger marker is assigned if two or more communities seem to 323 have merged into one.

324
One of the problems community evolution detection processes face is the lack of consistency in the performed on a daily basis, the trail will consist of seven timeslots in order to provide a week's depth).

331
Hence, if the evolution of a community is not detected in the immediate to last timeslot, the system queries If the similarity exceeds a matching threshold φ , the pair is matched and C in is added to the timeline of  which is a structural measure, mean-textual-entropy and unique-URL average which are contextual 362 measures, and an integrity coefficient inspired by the "ship of Theseus" paradox.

363
Persistence is defined as the characteristic of a dynamic community to make an appearance in as many 364 timeslots as possible (i.e. overall appearances / total number of timeslots), and stability as the ability to 365 appear in as many consecutive timeslots as possible disregarding the total number of appearances (i.e.

366
overall consecutive appearances / total number of timeslots). 367 where δ is the impulse function, m represents the total number of timeslots, x, y are the labels of the oldest  and quality of links to a community in order to determine an estimate of how important that community is.

377
The same measure is also applied to the users from every dynamic community, ranking them according 378 to their own centrality and thus providing the most influential users per timeslot. There is however a Centrality as it's used here is defined as: where k is the number of communities per timeslot.

386
One of the measures that provides a sense of popularity is virality which in the case of Twitter datasets Manuscript to be reviewed

Computer Science
The integrity measure employed is an extension of the ship of Theseus coefficient. The ship of 394 Theseus, also known as Theseus's paradox, is a thought experiment that raises the question of whether an 395 object which has had all its components replaced remains fundamentally the same object Rea (1995). We 396 apply this theory to find out the transformation sustained by the dynamic community by calculating the 397 number of consistent nodes within the communities which represents the integrity and consistency of the 398 dynamic community.

399
Twitter datasets differ quite a lot to other online social networks since the user is restricted to 140 400 characters of text. Given this restriction, we assume that it is safe to say that there is a connection between 401 the entropy of tweeted words used in a community (discarding URLs, mentions, hashtags and stopwords), 402 the effort the users put into posting those tweets, and the diversity of its content. Whether there is a 403 discussion between the users or a presentation of different events, high textual entropy implies a broader 404 range of information and therefore more useful results. An added bonus to this feature is that spam and 405 empty tweets containing only hashtags or mentions, as is the case in URL attention seeking tweets, rank 406 even lower than communities containing normal retweets. For the ranking we employ the mean textual 407 diversity of the dynamic community. The textual diversity in a community C i is measured by Shannon's 408 entropy H of the text resulting from the tweets that appear in that community as follows: where p(W m ) is the probability of a word W m appearing in a community containing M words and is 410 computed as follows: The second contextual feature to be employed regards the references cited by the users via URLs in 412 order for them to point out something they consider important or interesting. In fact, the URLs hold a lot 413 more information than the single tweet and as such we also consider it useful for discovering content-rich and thus we prefer a lower score equal to the number of dynamic communities to be considered.

424
Despite its simplicity, the RRF has proven to perform better than many other methods such as the  Table 3 the lack of correlation between the features' 431 ranks encourages us to employ this simple but useful method. The correlation was measured using the 432 Spearman rank-order correlation coefficient. Manuscript to be reviewed Computer Science When it comes to temporal interaction analysis, scalability is always an issue. The cost of the TISCI  Table 4. Manuscript to be reviewed Computer Science   Despite retrieving tweets for specific keywords only, the data collected was still too large for a single user 494 to organize and index in order to extract useful information.

495
Here, the granularity selection of three hours was made based on the fact that there is a discrete but 496 not wild change in activity. By employing a coarser granularity instead of an hourly one serves to reduce Manuscript to be reviewed

Computer Science
Finding URL (if available) 1 Women's voting motivational movement on instagram https://goo.gl/OKs17u 2 President Obama hoops with S. Pippen on election day https://goo.gl/Ybg83Z 3 Iran and Russia among countries with messages for Obama https://goo.gl/hgiaaN 4 Mainstream media tipped the scales in favor of Obama foxNews(removed) 5 Anti Obama protests escalate at university WashPost(removed) Table 6. Key findings from the US election dataset news but other smaller pieces of information that journalists seek out. The first one for example, which is 502 also the most heavily populated, regards a movement of motivating women into voicing their opinion by 503 urging them to post photos of their "best voting looks" but also pleading for Tony Rocha (radio producer) 504 to use his influence for one of the nominees. The first static community alone includes 2,774 people some 505 of which are @KaliHawk, @lenadunham, @AmmaAsante, @marieclaire and others.  Table 6.

515
One of the main anticipated characteristics of this particular set is that the news media, political 516 analysts, politicians, even celebrities are heavily mentioned in the event of an election.

517
Greek Election datasets 518 The two Greek elections of 2015 were held on January the 25th and on September the 20th and the 519 collection of corresponding tweets was made using Greek and English keywords, hashtags and user 520 accounts of all major running parties and their leaders. year low following the victory of the anti-austerity party but also that the markets managed to shake off 527 the initial tremors created by it. Conspiracy tweets were also posted within a community mentioning 528 operation Gladio; a NATO stay-behind operation during the Cold War that sought to ensure Communism 529 was never able to gain a foothold in Europe, which then evolved into sending warnings to the Syriza party 530 as Greece was supposedly being framed as an emerging hub for terrorists. A short list of non-mainstream 531 findings is presented on Table 7.  One of the similarities between the two election datasets which is rather impressive lies in the almost 538 identical structure of the two evolutions as shown by the respective heatmaps in the Evaluation section. It 539 is also worth mentioning that many influential users (e.g. @avgerinosx, @freedybruna) and politicians 540 (e.g. @panoskammenos, @niknikolopoulos) who were extremely active in the first election, were also 541 present in the second one.

543
Prior to the framework application, the network data is preprocessed as follows. Initially, all interaction 544 data is filtered by discarding any corrupt messages, tweets which do not contain any interaction information 545 and all self-loops (self-mentions) since they most frequently correspond to accounts who are trying to 546 manipulate their influence score on Twitter. The filtered data is then sampled resulting in a sequence of 547 activity-based snapshots. Figure 2 displays the mentioning activity of the four twitter networks.

548
The process which puts the greatest computational burden on the framework involves the evolution • the ten most influential users from each community which could provide the potential journal-573 ist/analyst with new users who are worth following 574 The color heatmap in the figure represents community size but can be adjusted to also give a compara-575 tive measure of centrality or density. By using this DyCCo containing framework, the user is provided   Since it is immensely difficult to evaluate community importance based on the tweets themselves, 588 we employed Amazon's Alexa service and the contained URLs of each static community to extract the 589 category to which it belonged. Alexa requires a domain as input and returns a string of categories in 590 a variety of languages to which it belongs. In order to avoid duplicates, categories in a language other 591 than English were translated automatically using Google's translating service. Unfortunately, most of 592 the domains, even popular ones, either returned a vary vague category (e.g. internet) or none at all.

593
Hence, manual domain categorization was also necessary in order to include the most popular of domains.

594
Specifically, the URLs we categorized using the following labels: television, video, photo, news, social 595 networking, blog, conspiracy, personal sites, politics, shop, arts and spam. The dynamic communities of 596 the three Twitter datasets combined contained 78,499 URLs which were reduced to 8,761 unique domains.

597
A mere 2,987 of these domains were categorized either by Alexa or manually, but the overall sample   the proposed method to that of the baseline. Figures 11, 12 and 13 show the comparison between the two 619 methods for all three datasets in which the proposed method seems to retrieve more diverse communities 620 in most timeslots. Table 8