1 Introduction

In the past few years, the social networking platforms have provided Internet users with a completely different and novel means of social interaction. There is major shift from classical methods of interaction and communication to virtual societies which cross borders, geographical locations, diversity in ethnicity, etc. In the environment of a networking platform each user plays the role of a social actor and gets involved by sharing content, by expressing ideas on assorted topics, by taking part in discussions and social activities, etc. As a result of the aforementioned interactions, tera-bytes of social media data are generated every day on different platforms. Popularity of the platforms, diversity of the topics being discussed by the social actors, and the online nature of the interactions make social media data a very rich source for mining public opinion. This has attracted the attentions of private sector, public sector, intelligence agencies, individuals, etc who are interested in investigating and keeping track of the trend in the general global and local opinions.

Among the networking platforms launched in the past decade, Twitter is one of the most successful and popular platforms. Twitter is a microblogging service that enables its users to publish and read short posts of up to 140 characters known as Tweets. The interface allows the users to subscribe to the posts by people they are interested in following. Furthermore, the users can group posts together by topic using Hashtags (that is, words or phrases prefixed with a ‘#’ sign), use ‘@’ sign followed by a username to mention or reply to other users, and repost another user’s Tweet using the Re-Tweet (RT) feature. Twitter experienced a rapid growth since debuted in 2006. The number of Tweets posted by Twitter users has reportedly increased from 5,000 per day in 2007 Footnote 1 to about 65 million per day in 2010 Footnote 2.

The immense number of Tweets posted every day provides a valuable source for knowledge discovery in general, e.g., a high potential for discovering social trends. In other words, the data can be studied from different perspectives by people having a wide variety of interests. A sociologist may seek the interaction patterns and social ties between social actors (Huberman et al. 2008), a politician can use social media to measure the popularity of his political party and predict election outcome (Tumasjan et al. 2010; O’Connor et al. 2010), and a businessman may be interested in people’s opinion about his brand 01 (Jansen et al. 2009). Besides, the meta-data attached to each Tweet, such as the location, language and the number of re-Tweets, can be used to assist and enhance the analysis of the Tweets.

In this paper, we present two separate analysis methods to group and summarize the Tweets pertaining to a specific subject. We first narrow down the scope of the analysis by only considering the Tweets that include some terms from a specific set of keywords related to the subject of interest. Then, we analyze the Tweets according to their generation date and author. In both analyzes, each Tweet is considered a bag of words document. In the first analysis method, we group the Tweets based on their generation date such that the Tweets posted on the same date are considered to form a separate class of documents. Then we find the most distinctive terms and the highest ranked Tweet for all the classes. The result of the analysis is a history of the most important postings about the subject of study.

The second method is more focused on the authors of the Tweets. In this method, we first create a network of the users based on the similarity of their postings during the period of study. After constructing the network, we find the largest communities of people who published similar Tweets about the subject of study. We label each community with selected terms and Tweets extracted from the postings of its members. The selected content reflects the community’s main idea regarding the topic of study. The final outcome consists of groups of people who expressed similar ideas on the subject and an insight into their ideas.

In order to validate the methods, we performed a case study on the tweets that refer to the ‘Royal Bank of Canada’. Although in the case study the subject is a company name, our approach is generic enough to extract major ideas about a wide variety of topics ranging from a singer’s name to the name of a political party, among others.

The rest of this article is organized as follows: In Sect. 2 we cover the literature related to social network platforms and social media analysis, including Twitter. In Sect. 3 we formally present our analysis methods. Section 4 includes the case study and the analysis of the results. Section 5 is conclusions and future work.

2 The necessary background and related work

Motivated by the rapid growth of the social networking platforms during the past decade, many social network analysis (SNA) tools have been developed to facilitate the analysis of the network data. Most of the tools provide features for importing and exporting network data, building and visualizing a network, and calculating commonly used metrics and statistics on a network, i.e., centrality measures, clustering coefficient, shortest paths, etc.. In addition to the basic features, some tools support more advanced features such as network partitioning, clique detection, and graph isomorphism. Hansen et al. (2009) propose a process model for social analytics tools and try to formalize the major phases of the SNA process.

Pajek Footnote 3, ORA Footnote 4, and Gephi Footnote 5 are amongst the most popular SNA tools that are available free for non-commercial use. NodeXL (Smith et al. 2009) is another SNA tool which is implemented as Microsoft Excel plugin. In addition to the tools, SNA libraries have been implemented to simplify loading, manipulation, and visualization of the networks within the computer programs. The igraph library Footnote 6 (which provides packages in R, Python, Ruby, and C) and Java Universal Network/Graph Framework (JUNG) Footnote 7 are two paradigms of SNA libraries.

Aside from the SNA tools and libraries that normally support generic features for analyzing network data and for calculating low-level statistics, solutions have been developed for automatic social media monitoring and report generation. TwitterMonitor (Mathioudakis and Koudas 2010) and Radian6 are two examples of these solutions. TwitterMonitor identifies emerging topics over Twitter stream in real time while Radian6 monitors social media data from several networking platforms and discovers the content related to a specific brand, company, or product name. Radian6 is also able to generate sentiment analysis reports that could be very useful from a company management perspective.

2.1 Social media analysis

Many articles in the area of social network analysis focus on the social media content and study the potential of using social media for measuring the actual social metrics and the public opinion. Bansal et al. (2007) study the existence of the keyword clusters in blog posts on blogosphere (a collection of blogs and their interconnections). They define two keywords as correlated if a large number of bloggers use them together in their blog posts and propose algorithms to find groups of highly correlated keywords. The possibility of extracting meaningful and high quality content from community QA systems have also been studied by many researchers, e.g., (Bian et al. 2008; Agichtein et al. 2008). Community QA systems such as Yahoo! Answers Footnote 8 are question answering platforms that allow their users to post questions, answer other user’s questions, and rank the answers. Mathioudakis et al. (2010) propose an algorithm for online identification of items (e.g., news, announcements) that attracted lots of attention in social media. Their definition of the attention gathering items incorporates the number of actions attracted such as the total number of links, comments, or distinct linkers related to the item as well as the length of the time period during which the actions have occurred.

On the academic side, there is a significant amount of research on the dynamics of the social networks. For instance, Huberman et al. (2008) investigate friendship relationships between Twitter users. They define a user B as user’s A friend if user A has directed at least two posts to user B using the ‘@’ sign. They compare the friendship network with the network created from the follow relationships and show that the friendship network is by an order of magnitude sparser than the follower-followee network. Therefore, the actual friendship relationships cannot be inferred from the network built from the follow relationships. Gilbert and Karahalios (2009) built a model for distinguishing between strong and weak ties between social actors. They focus on friendships on Facebook and define over 70 numeric metrics to model the strength of a relationship. Their metrics include ‘Wall words exchanged', ‘Appearances together in photo', ‘Days since first communication', and ‘Number of mutual friends' among others. Their proposed model is able to classify friends as weak and strong with more than 85 % accuracy. Java et al. (2007) investigate the topological and geographical aspects of Twitter’s social network built from the follow relationships. They also apply link analysis and community detection algorithms on the network to find authorities, hubs, and communities of users. Besides, they study user intentions at a community level by finding the most frequent terms in the tweets of each community. Furthermore, a number of research group recently concentrated on the analysis of Twitter data and on the development of tools that are capable of automatically handling Twitter related cases. For instance, Kumar et al. (2011) developed Tweettracker as an analysis tool for humanitarian and disaster relief. In addition, they discuss how to understand Twitter data with TweetXplorer (Morstatter et al. 2013); and how to identify relevant users to follow on Twitter during crisis (Kumar et al. 2013). Also, Mendoza et al. (2010) studied the trust and follow issues under crisis. Morstatter et al. (2013) compared data from Twitters streaming API with Twitters firehose in order to determine whether the analyzed samples are good enough. Li et al. (2012) describe TEDAS, which is a Twitter-based event detection and analysis system MacEachren et al. (2011) discuss Twitter from the perspective of analytics support for situational awareness. Finally, Purohit and Sheth (2013) describe Twitris v3, which is capable of handling the analysis, coordination and actions related to citizen sensing.

Studies are going on to investigate the potential of social media for mining customer sentiment, predicting the future, and business decision-making. Kaplan and Haenlein (2010) survey the possibility of utilizing social media for making business decisions. They study the concept of social media and provide advices for companies and firms which plan to benefit from social media in their business. Jansen et al. (2009) study micro-blogging as a form of electronic word-of-mouth (eWOM) for sharing consumer opinions concerning brands. They try to capture an insight into the characteristics of brand microblogging and the overall eWOM trends of brand micro-blogging. They also study the patterns of micro-blogging communications between companies and customers by analyzing the Tweets that mention brand names and the sentiments they convey. Asur and Huberman (2010) propose predictive models to forecast box-office revenue and Hollywood Stock Exchange for movies. They base their models on the rate of Tweets mentioning the movie name as well as the sentiments expressed in the tweets regarding the movie.

2.2 Community detection

A community is a group of actors who share common characteristics within the network and hence are as a group different from the rest of the actor groups existing in the same network. For instance, a group of actors may share same or similar political views in a network constructed based on political aspects in the society; a group of actors may form a community for sharing same medical experience and skills in a health care environment, etc. Thus the domain of knowledge being investigated and analyzed dedicates the target for community identification within the network. A community could be seen somehow as analogous to a cluster or an maximal closed itemset in data mining terminology. Our research group has already reported some interesting research results on how to use frequent pattern mining to locate community in a network (Adnan et al. 2010); also we studied how to identify and analyze calling communities using macine learning techniques (Kianmehr and Alhajj 2009). Further, we utilized the detection of communities concept to develope a personalized Web search platform (Shafiq et al. 2009). We also reported interesting results on the usage of community detection to identify within the body molecules which could be further investigated as disease biomarkers (Naji et al. 2011).

Actually, community detection in social networks has been an area of active research. Several algorithms have been proposed for finding communities and a large body of knowledge has been accumulated on the topic. Fortunato (2010) and Newman (2004) carried out surveys on the community detection algorithms. According to the surveys, the algorithm of Girvan and Newman (2002), the Clique Percolation Method (CPM) (Derényi et al. 2005; Palla et al. 2005), and Modularity (Newman and Girvan 2004) optimization algorithms are among the most efficient and commonly used methods for finding communities within a network. The Girvan-Newman algorithm divides the network into communities (connected components in this case) by iteratively removing edges from the network. In each iteration, it finds the shortest paths between all pairs of nodes and removes the edge that falls on the highest number of shortest paths. CPM identifies overlapping communities by combining all k-cliques that can be reached from each other through a series of adjacent k-cliques. Two k-cliques are defined to be adjacent if they share k−1 nodes. Modularity optimization methods try to find a partition of the network into communities in a way that maximizes the Modularity measure. Modularity measures the difference between densities of intra- and inter-community links. In other words, a partition that encompasses a higher number of links within its communities and leaves a lower number of links between the communities receives a higher Modularity score.

The algorithm proposed by Blondel et al. (2008) (Louvain method) is a widely used modularity optimization technique. The method consists in two major phases. In the first phase, it identifies communities by moving densely connected nodes into the same community. Then, it aggregates nodes of the same community and transforms the network into a new one whose nodes are the communities. These phases are repeated iteratively until no improvement to the Modularity measure could be made. The method has shown good performance in a wide variety of applications. It is also highly scalable and is able to find communities in networks of millions of nodes and edges. Considering the desirable features of the Louvain method, we utilize it for finding communities in the analysis explained in Sect. 3.2.

2.3 The need for the proposed methods

In this work, we present two analytical methods which are capable of finding the main ideas expressed related to a certain topic being discussed in the social media. The first method groups similar ideas on a daily basis and finds the major ideas that are expressed by a large mass of people on each date. This gives us an impression on the evolution of people’s opinion on the topic over time. The ideas can also be linked to certain events related to the topic. For instance, if the topic of study is a company name, a change in ideas may be caused by an announcement from the company. The second method considers the postings in relation to the authors and builds a network of authors based on the similarity of their postings. It then finds the community of users who had similar postings and labels each community with key terms and selected posting.

In some sense, our second method is similar to the work of Java et al. as presented in (Java et al. 2007) since they consider the network of Twitter users and try to find intentions of each community using the most frequent terms in their postings. However, there are significant differences between the two approaches. First, Java et al. (2007) consider the users network built from the follower-followee relationships but we build the network based on the similarity of their postings. The reason why we use similarities instead of follow or friend relationships is that such relationships do not necessarily imply the similarity of ideas and views. In other words, two users may hold the same opinion about a specific topic without following each other or a user and his/her followers may express conflicting views on a specific subject. Secondly, we use Mutual Information (MI) (Manning et al. 2008) instead of term frequency to find the key terms for each community. We argue that term frequency is not a true indicator of the importance of a term within a community especially when we study the communities that all have postings related to the same topic. For instance, if the topic of study is a company name, the terms corresponding to the company name and its variations would appear in postings from all communities with a high frequency. However, such terms cannot be used to differentiate between the major ideas of communities. Therefore, we need a feature selection method such as MI to find the most distinctive terms in the documents of each class. This way we could clearly separate the various communities and analyze them for effective knowledge discovery.

3 Analysis methods

3.1 Day-to-day analysis

In this analysis, our goal is to identify the major ideas expressed in the social media data about a specific topic during a certain period of time. Besides, we want the identified ideas to be separated by date such that we would be able to see the evolution of ideas over time and link the changes to the events in real world. Formally, given a set of posts P related to topic X published during time period I, we would like to find selected content S i for each date i in Isuch that S i gives us a summarization of the social media content about X on i. To achieve this goal, we transform each post p into a binary bag-of-words document d p . Therefore, instead of P, we deal with a set of documents D. We partition D based on the generation date of documents and put all documents generated on i in a separate class of documents c i . In order to find the key terms for each class of documents, we use mutual information which is a common feature selection measure. MI measures the amount of information reflecting the presence or absence of a term t in a document d; it implies the membership of d in class C. Equation 1 gives a formal definition of MI:

$$\text{MI}(U;C)=\sum_{e_t\in{1,0}}\sum_{e_c\in{1,0}\!}\!P\left(U = e_t, C = e_c\right) \cdot\log_{2}{\frac{P\left(U = e_t, C=e_c\right)}{P\left(U = e_t\right)P\left(C = e_c\right)}}$$
(1)

where U is a random variable that takes values e t  = 1 (the document contains term t) and e t  = 0 (the document does not contain t) and C is a random variable that takes values e c  = 1 (the document is in class c) and e c  = 0 (the document is not in class c). Equation 2 could be derived from Eq. 1 using maximum likelihood estimates (MLEs) of probabilities:

$$\text{MI}(U;C) = \frac{N_{11}}{N}\log_{2}{\frac{NN_{11}}{N_{1.}N_{.1}}} + \frac{N_{01}}{N}\log_{2}{\frac{NN_{01}}{N_{0.}N_{.1}}} + \frac{N_{10}}{N}\log_{2}{\frac{NN_{10}}{N_{1.}N_{.0}}} + \frac{N_{00}}{N}\log_{2}{\frac{NN_{00}}{N_{0.}N_{.0}}}$$
(2)

where Ns are counts of documents and values of e t and e c are indicated as subscripts. For instance, N 10 is the number of documents that contain t (e t  = 1) and are not in c (e c  = 0), N 1. = N 10 + N11, and N = N 00 + N 01 + N 10 + N 11.

In order to find the key terms for each class, we use the Modified MI (MMI) which is the sum of the expressions with N 00 and N 11 coefficients in Eq. 2. We are not interested in the expressions with N 01 and N 10 coefficients because for instance it does not make sense to identify a term whose occurrence in a document significantly decreases the likelihood of the membership in a class as the key term of the class. For each class, we calculate the MMI of all its terms and identify terms with highest measures as the key terms of the class.

Although key terms provide an insight into the main ideas expressed on each date, a selected document which shows the terms in their context would provide a better intuition. To find the selected document, we first need to rank the documents based on an importance measure. Equation 3 defines the score of a document:

$$S(d) = \frac{\sum_{t\in d}{MMI(t,c)}}{|d|}$$
(3)

where c is the class of d, and |d| is the number of terms in d. We select the document with the highest score in each class as the selected document of the class.

3.2 Network-based analysis

While the day-to-day analysis is based on the temporal aspect of postings, network-based analysis focuses on authors of posts. In the network-based analysis we build a network of authors based on the similarity of their posts. To measure the similarity of two posts, we first transform them into binary bag-of-words documents and then calculate Jaccard similarity between documents. Equation 4 defines the Jaccard similarity between documents d 1 and d 2:

$$J(d_1, d_2) = \frac{\left|d_1\cap d_2\right|}{\left|d_1\cup d_2\right|}$$
(4)

In the network of authors we establish a link between authors A and B if the similarity of at least one post from A and one post from B is greater than a threshold Tj. In this case, the following equation gives the weight of the link between A and B:

$$W(AB) = \sum_{d_1\in D_A, d_2\in D_B } F(J(d_1, d_2), T_j)$$
(5)

where D A is the set of documents published by author A and F(mn) returns m if m > n, and zero otherwise.

After constructing the network, we proceed to the community detection phase. We apply the Louvain method to find communities of authors in the network. Since a link between two nodes in the network reflects the similarity between the documents published by two authors, communities of densely connected nodes correspond to groups of authors who had similar postings. To get a better insight into the main ideas conveyed by each group of authors, we consider the documents published by each group as a separate class of documents and extract the key terms and selected document using the same approach as we utilized for the day-to-day analysis.

3.3 Scalability of the methods

In order to study the scalability of the proposed methods, we assume that all the posts associated with their author, generation date and unique ID are in a database. For the day-to-day analysis, we need one scan of the database to build an index of classes, documents, and terms. To perform the MMI calculations in an efficient way, the index should keep track of the documents in each class, the terms in each class, the number of occurrences of each term in documents of a class, total number of occurrences of each term, and total number of documents. The size of the index highly depends on the number of terms in the corpus as terms are the only non-numeric values that should be kept in memory. However, building the index would not normally raise a memory limitation issue since it is very unlikely that the number of unique terms in postings related to a specific subject go over a few millions. After finding key terms and selected documents for each class, we need to retrieve the full text of the selected documents from the database using document id’s.

The network construction phase of the second analysis poses the major scalability challenge. To build the network we need to perform O(N 2) similarity calculations. This number of calculations can significantly increase the run-time of the method. To resolve this issue we define a time interval parameter H and we compare each document only with the documents created within H hours of its generation date. Definition of the interval parameter improves the running time of the program by reducing the number of comparisons. It also resolves the potential memory limitation by reducing the amount of memory required for the data structures involved in similarity calculations. Furthermore, it enhances the performance of the network-based analysis by incorporating the temporal aspect of the documents. Once we construct the network, scalability is not a significant issue as the Louvain method is able to find communities in networks of millions of nodes and edges. Once communities are identified, selected content for communities could be found using the same index structure as we used in the day-to-day analysis.

4 Case study

4.1 The dataset

The data we are using in this study is collected from Twitter’s free API Footnote 9 during April 2011. The dataset contains 7,275 Tweets posted by 4,731 distinct Twitter users from April 8 to 27, 2011. All the Tweets in the dataset mention Royal Bank of Canada (RBC) by including either ‘RBC' or ‘Royal Bank' and ‘Canada'.

4.2 Day-to-day analysis

Table 1 lists the number of Tweets and the key terms for each date in the period of study. According to the results reported in the table, there is a peak in the number of Tweets on April 12. Table 2 shows the selected Tweet for each date. Based on the selected Tweet, we can see that the peak on April 12 is caused by RBC capital markets managing director’s note to investors about iPad sales Footnote 10.

Table 1 Day-to-day analysis result, number of Tweets and key terms of the day
Table 2 Day-to-day analysis result, selected Tweet of the day

4.3 Network-based analysis

We first build the network of users based on the similarity of the Tweets using H = 48. Then we refine the network by removing individual nodes. Individual nodes are like outliers, they correspond to users who had unique Tweets that are not similar to anyone else Tweets. The refined network has 2481 nodes that are grouped into 411 communities by the utilized community detection algorithm. To avoid clutter, we only consider the top 20 largest communities that encompass 1,199 nodes. Figure 1 shows the top communities in different colors. Table 3 lists the size and density of the communities.

Fig. 1
figure 1

Network-based analysis, top 20 largest communities shown in different colors

Table 3 Network-based analysis, size and density of the top communities

Tables 4 and 5, respectively, list the key terms and selected Tweet of each community.

Table 4 Network-based analysis, key terms of community
Table 5 Network-based analysis, selected Tweet of community

4.4 Discussion of the results

Taking a closer look at the results reveals major conflicts between the results reported by the methods. For instance, the selected Tweet of C 6 does not appear in the selected Tweets of Table 2 and the selected Tweet of 10th is not listed as any of the community’s selected Tweets. If we count the number of Tweets containing the corresponding key terms we could realize that there are 53 Tweets containing all the terms ‘Toronto', ‘manager', and ‘senior' while there are only 18 Tweets containing ‘young', ‘percent', and ‘home'. However, the first set of Tweets are posted sporadically from 13th to 26th while the second set of Tweets are published in a three day period from 9th to 11th, 14 of which posted on the 10th. In other words, the selected Tweet of the 10th is a representative of a small burst of the Tweets corresponding to a release of a survey by RBC. Although the event did not attract much attention from the users (only 18 Tweets posted), it was identified as the selected content of the day because of the absence of another trend in the small number of Tweets (193) posted on the 10th. In contrast, the selected Tweet of C 6 represents a major trend in the Tweets that were overlooked by the day-to-day analysis because the Tweets that created the trend were posted in a period of several days. Notice that the density of a community gives us a clue of the dispersion of its Tweets over time. Normally, low density communities correspond to the trends that occurred over a relatively long period of time, while high density communities are formed by the sudden bursts of Tweets.

Figure 2 shows the activity history of communities and individual users during the period of the analysis. An interesting observation that could be made on this figure is the high rate of activity of the users in communities C 2 and C 3 on 12th. This observation reveals another deficiency of the day-to-day analysis: one selected tweet per day can only reflect one trend per day. That is why the selected post of C 2 is neglected by the first analysis although from Fig. 1 C 2 is a separate large and dense community with a limited number of connections to C 3.

Fig. 2
figure 2

Activity history of communities and individual users

In conclusion, the day-to-day analysis is a good approach for summarizing the documents on a daily basis and observing the evolution of ideas over time. However, when the goal is to gain a deep insight into the major ideas expressed over a period of time, the network-based analysis would be more effective. In addition, the network-based method generates a meaningful division of users into communities. Each community reveals interesting perspective mainly shared by members of the community.

5 Conclusion and future work

In this paper, we presented two analysis methods for summarizing a high volume of social media data about a specific subject. We investigated the scalability of the methods and to evaluate the methods we conducted a case study on a dataset of Tweets that mentioned Royal Bank of Canada. We also compared the results produced by the two methods, and tried to justify the discrepancies. The study confirmed the viability of the proposed analysis methods and proved their efficiency in producing good summarizations as guidance for effective decision making process. For future work, we plan to study the evolution of the network of authors of tweets. Our goal is to find out how authors move between the communities, how new authors join the network, and which users lose their interest in the subject and become inactive over time. We want to investigate how the decision makers benefit from the outcome of the study and how the new trend due to the decisions taken would affect the communities and whether they will shrink or grow. We want to study the effect of the reaction to the social media on the social media i.e., how the Tweeters react to the positive/negative reaction from the decision makers. Are all Tweets within the same community share the same trend in their reaction or the communities would shuffle in response to the decision from the management. We also consider the possibility of using meta-data, i.e., location, hash-tags, retweets, etc to enhance the analysis.