Social media analysis and summarization for opinion mining: a business case study

Rahmani, Ali; Chen, Alan; Sarhan, Abdullah; Jida, Jamal; Rifaie, Mohammad; Alhajj, Reda

doi:10.1007/s13278-014-0171-y

Social media analysis and summarization for opinion mining: a business case study

Original Article
Published: 11 February 2014

Volume 4, article number 171, (2014)
Cite this article

Download PDF

Social Network Analysis and Mining Aims and scope Submit manuscript

Social media analysis and summarization for opinion mining: a business case study

Download PDF

Ali Rahmani¹,
Alan Chen¹,
Abdullah Sarhan¹,
Jamal Jida²,
Mohammad Rifaie³ &
…
Reda Alhajj^1,4

951 Accesses
10 Citations
1 Altmetric
Explore all metrics

Abstract

A huge amount of social media data is generated daily as a result of the interactions between subscribers of the social networking platforms. Subscribers are characterized by diversity in background, geographical locations, opinions, etc. Further, the diversity and pervasiveness of the data together with its social characteristics make it an intriguing source for gauging the public opinion. Twitter is one of the most famous and popular networking platforms. It provides a micro-blogging environment in which subscribers can post and read short messages called Tweets. In this paper, we propose a method for finding groups of people who express similar ideas about a specific subject in their postings. We also introduce an algorithm to label each group of users based on their postings. The labels help us to get a more profound insight into the main opinion of each group. The work described in this paper has high significance and a variety of practical applications. Applicability and effectiveness of the proposed approach have been demonstrated in a case study on the Tweets related to Royal Bank of Canada. The reported results prove the viability and efficiency of our approach. Besides, the proposed approach is generic enough to be adapted and applied to any type of existing social media platforms to extract main ideas related to a specific subject.

Social media analytics: a survey of techniques, tools and platforms

Article Open access 26 July 2014

Bogdan Batrinca & Philip C. Treleaven

Analyzing social media data: A mixed-methods framework combining computational and qualitative text analysis

Article 02 April 2019

Matthew Andreotta, Robertus Nugroho, … Cecile Paris

A survey of sentiment analysis in social media

Article 04 July 2018

Lin Yue, Weitong Chen, … Minghao Yin

1 Introduction

In the past few years, the social networking platforms have provided Internet users with a completely different and novel means of social interaction. There is major shift from classical methods of interaction and communication to virtual societies which cross borders, geographical locations, diversity in ethnicity, etc. In the environment of a networking platform each user plays the role of a social actor and gets involved by sharing content, by expressing ideas on assorted topics, by taking part in discussions and social activities, etc. As a result of the aforementioned interactions, tera-bytes of social media data are generated every day on different platforms. Popularity of the platforms, diversity of the topics being discussed by the social actors, and the online nature of the interactions make social media data a very rich source for mining public opinion. This has attracted the attentions of private sector, public sector, intelligence agencies, individuals, etc who are interested in investigating and keeping track of the trend in the general global and local opinions.

Among the networking platforms launched in the past decade, Twitter is one of the most successful and popular platforms. Twitter is a microblogging service that enables its users to publish and read short posts of up to 140 characters known as Tweets. The interface allows the users to subscribe to the posts by people they are interested in following. Furthermore, the users can group posts together by topic using Hashtags (that is, words or phrases prefixed with a ‘#’ sign), use ‘@’ sign followed by a username to mention or reply to other users, and repost another user’s Tweet using the Re-Tweet (RT) feature. Twitter experienced a rapid growth since debuted in 2006. The number of Tweets posted by Twitter users has reportedly increased from 5,000 per day in 2007 ^{Footnote 1} to about 65 million per day in 2010 ^{Footnote 2}.

The immense number of Tweets posted every day provides a valuable source for knowledge discovery in general, e.g., a high potential for discovering social trends. In other words, the data can be studied from different perspectives by people having a wide variety of interests. A sociologist may seek the interaction patterns and social ties between social actors (Huberman et al. 2008), a politician can use social media to measure the popularity of his political party and predict election outcome (Tumasjan et al. 2010; O’Connor et al. 2010), and a businessman may be interested in people’s opinion about his brand 01 (Jansen et al. 2009). Besides, the meta-data attached to each Tweet, such as the location, language and the number of re-Tweets, can be used to assist and enhance the analysis of the Tweets.

In this paper, we present two separate analysis methods to group and summarize the Tweets pertaining to a specific subject. We first narrow down the scope of the analysis by only considering the Tweets that include some terms from a specific set of keywords related to the subject of interest. Then, we analyze the Tweets according to their generation date and author. In both analyzes, each Tweet is considered a bag of words document. In the first analysis method, we group the Tweets based on their generation date such that the Tweets posted on the same date are considered to form a separate class of documents. Then we find the most distinctive terms and the highest ranked Tweet for all the classes. The result of the analysis is a history of the most important postings about the subject of study.

The second method is more focused on the authors of the Tweets. In this method, we first create a network of the users based on the similarity of their postings during the period of study. After constructing the network, we find the largest communities of people who published similar Tweets about the subject of study. We label each community with selected terms and Tweets extracted from the postings of its members. The selected content reflects the community’s main idea regarding the topic of study. The final outcome consists of groups of people who expressed similar ideas on the subject and an insight into their ideas.

In order to validate the methods, we performed a case study on the tweets that refer to the ‘Royal Bank of Canada’. Although in the case study the subject is a company name, our approach is generic enough to extract major ideas about a wide variety of topics ranging from a singer’s name to the name of a political party, among others.

The rest of this article is organized as follows: In Sect. 2 we cover the literature related to social network platforms and social media analysis, including Twitter. In Sect. 3 we formally present our analysis methods. Section 4 includes the case study and the analysis of the results. Section 5 is conclusions and future work.

2 The necessary background and related work

Motivated by the rapid growth of the social networking platforms during the past decade, many social network analysis (SNA) tools have been developed to facilitate the analysis of the network data. Most of the tools provide features for importing and exporting network data, building and visualizing a network, and calculating commonly used metrics and statistics on a network, i.e., centrality measures, clustering coefficient, shortest paths, etc.. In addition to the basic features, some tools support more advanced features such as network partitioning, clique detection, and graph isomorphism. Hansen et al. (2009) propose a process model for social analytics tools and try to formalize the major phases of the SNA process.

Pajek ^{Footnote 3}, ORA ^{Footnote 4}, and Gephi ^{Footnote 5} are amongst the most popular SNA tools that are available free for non-commercial use. NodeXL (Smith et al. 2009) is another SNA tool which is implemented as Microsoft Excel plugin. In addition to the tools, SNA libraries have been implemented to simplify loading, manipulation, and visualization of the networks within the computer programs. The igraph library ^{Footnote 6} (which provides packages in R, Python, Ruby, and C) and Java Universal Network/Graph Framework (JUNG) ^{Footnote 7} are two paradigms of SNA libraries.

Aside from the SNA tools and libraries that normally support generic features for analyzing network data and for calculating low-level statistics, solutions have been developed for automatic social media monitoring and report generation. TwitterMonitor (Mathioudakis and Koudas 2010) and Radian6 are two examples of these solutions. TwitterMonitor identifies emerging topics over Twitter stream in real time while Radian6 monitors social media data from several networking platforms and discovers the content related to a specific brand, company, or product name. Radian6 is also able to generate sentiment analysis reports that could be very useful from a company management perspective.

2.1 Social media analysis

Many articles in the area of social network analysis focus on the social media content and study the potential of using social media for measuring the actual social metrics and the public opinion. Bansal et al. (2007) study the existence of the keyword clusters in blog posts on blogosphere (a collection of blogs and their interconnections). They define two keywords as correlated if a large number of bloggers use them together in their blog posts and propose algorithms to find groups of highly correlated keywords. The possibility of extracting meaningful and high quality content from community QA systems have also been studied by many researchers, e.g., (Bian et al. 2008; Agichtein et al. 2008). Community QA systems such as Yahoo! Answers ^{Footnote 8} are question answering platforms that allow their users to post questions, answer other user’s questions, and rank the answers. Mathioudakis et al. (2010) propose an algorithm for online identification of items (e.g., news, announcements) that attracted lots of attention in social media. Their definition of the attention gathering items incorporates the number of actions attracted such as the total number of links, comments, or distinct linkers related to the item as well as the length of the time period during which the actions have occurred.

On the academic side, there is a significant amount of research on the dynamics of the social networks. For instance, Huberman et al. (2008) investigate friendship relationships between Twitter users. They define a user B as user’s A friend if user A has directed at least two posts to user B using the ‘@’ sign. They compare the friendship network with the network created from the follow relationships and show that the friendship network is by an order of magnitude sparser than the follower-followee network. Therefore, the actual friendship relationships cannot be inferred from the network built from the follow relationships. Gilbert and Karahalios (2009) built a model for distinguishing between strong and weak ties between social actors. They focus on friendships on Facebook and define over 70 numeric metrics to model the strength of a relationship. Their metrics include ‘Wall words exchanged', ‘Appearances together in photo', ‘Days since first communication', and ‘Number of mutual friends' among others. Their proposed model is able to classify friends as weak and strong with more than 85 % accuracy. Java et al. (2007) investigate the topological and geographical aspects of Twitter’s social network built from the follow relationships. They also apply link analysis and community detection algorithms on the network to find authorities, hubs, and communities of users. Besides, they study user intentions at a community level by finding the most frequent terms in the tweets of each community. Furthermore, a number of research group recently concentrated on the analysis of Twitter data and on the development of tools that are capable of automatically handling Twitter related cases. For instance, Kumar et al. (2011) developed Tweettracker as an analysis tool for humanitarian and disaster relief. In addition, they discuss how to understand Twitter data with TweetXplorer (Morstatter et al. 2013); and how to identify relevant users to follow on Twitter during crisis (Kumar et al. 2013). Also, Mendoza et al. (2010) studied the trust and follow issues under crisis. Morstatter et al. (2013) compared data from Twitters streaming API with Twitters firehose in order to determine whether the analyzed samples are good enough. Li et al. (2012) describe TEDAS, which is a Twitter-based event detection and analysis system MacEachren et al. (2011) discuss Twitter from the perspective of analytics support for situational awareness. Finally, Purohit and Sheth (2013) describe Twitris v3, which is capable of handling the analysis, coordination and actions related to citizen sensing.

Studies are going on to investigate the potential of social media for mining customer sentiment, predicting the future, and business decision-making. Kaplan and Haenlein (2010) survey the possibility of utilizing social media for making business decisions. They study the concept of social media and provide advices for companies and firms which plan to benefit from social media in their business. Jansen et al. (2009) study micro-blogging as a form of electronic word-of-mouth (eWOM) for sharing consumer opinions concerning brands. They try to capture an insight into the characteristics of brand microblogging and the overall eWOM trends of brand micro-blogging. They also study the patterns of micro-blogging communications between companies and customers by analyzing the Tweets that mention brand names and the sentiments they convey. Asur and Huberman (2010) propose predictive models to forecast box-office revenue and Hollywood Stock Exchange for movies. They base their models on the rate of Tweets mentioning the movie name as well as the sentiments expressed in the tweets regarding the movie.

2.2 Community detection

A community is a group of actors who share common characteristics within the network and hence are as a group different from the rest of the actor groups existing in the same network. For instance, a group of actors may share same or similar political views in a network constructed based on political aspects in the society; a group of actors may form a community for sharing same medical experience and skills in a health care environment, etc. Thus the domain of knowledge being investigated and analyzed dedicates the target for community identification within the network. A community could be seen somehow as analogous to a cluster or an maximal closed itemset in data mining terminology. Our research group has already reported some interesting research results on how to use frequent pattern mining to locate community in a network (Adnan et al. 2010); also we studied how to identify and analyze calling communities using macine learning techniques (Kianmehr and Alhajj 2009). Further, we utilized the detection of communities concept to develope a personalized Web search platform (Shafiq et al. 2009). We also reported interesting results on the usage of community detection to identify within the body molecules which could be further investigated as disease biomarkers (Naji et al. 2011).

Actually, community detection in social networks has been an area of active research. Several algorithms have been proposed for finding communities and a large body of knowledge has been accumulated on the topic. Fortunato (2010) and Newman (2004) carried out surveys on the community detection algorithms. According to the surveys, the algorithm of Girvan and Newman (2002), the Clique Percolation Method (CPM) (Derényi et al. 2005; Palla et al. 2005), and Modularity (Newman and Girvan 2004) optimization algorithms are among the most efficient and commonly used methods for finding communities within a network. The Girvan-Newman algorithm divides the network into communities (connected components in this case) by iteratively removing edges from the network. In each iteration, it finds the shortest paths between all pairs of nodes and removes the edge that falls on the highest number of shortest paths. CPM identifies overlapping communities by combining all k-cliques that can be reached from each other through a series of adjacent k-cliques. Two k-cliques are defined to be adjacent if they share k−1 nodes. Modularity optimization methods try to find a partition of the network into communities in a way that maximizes the Modularity measure. Modularity measures the difference between densities of intra- and inter-community links. In other words, a partition that encompasses a higher number of links within its communities and leaves a lower number of links between the communities receives a higher Modularity score.

The algorithm proposed by Blondel et al. (2008) (Louvain method) is a widely used modularity optimization technique. The method consists in two major phases. In the first phase, it identifies communities by moving densely connected nodes into the same community. Then, it aggregates nodes of the same community and transforms the network into a new one whose nodes are the communities. These phases are repeated iteratively until no improvement to the Modularity measure could be made. The method has shown good performance in a wide variety of applications. It is also highly scalable and is able to find communities in networks of millions of nodes and edges. Considering the desirable features of the Louvain method, we utilize it for finding communities in the analysis explained in Sect. 3.2.

2.3 The need for the proposed methods

In this work, we present two analytical methods which are capable of finding the main ideas expressed related to a certain topic being discussed in the social media. The first method groups similar ideas on a daily basis and finds the major ideas that are expressed by a large mass of people on each date. This gives us an impression on the evolution of people’s opinion on the topic over time. The ideas can also be linked to certain events related to the topic. For instance, if the topic of study is a company name, a change in ideas may be caused by an announcement from the company. The second method considers the postings in relation to the authors and builds a network of authors based on the similarity of their postings. It then finds the community of users who had similar postings and labels each community with key terms and selected posting.

In some sense, our second method is similar to the work of Java et al. as presented in (Java et al. 2007) since they consider the network of Twitter users and try to find intentions of each community using the most frequent terms in their postings. However, there are significant differences between the two approaches. First, Java et al. (2007) consider the users network built from the follower-followee relationships but we build the network based on the similarity of their postings. The reason why we use similarities instead of follow or friend relationships is that such relationships do not necessarily imply the similarity of ideas and views. In other words, two users may hold the same opinion about a specific topic without following each other or a user and his/her followers may express conflicting views on a specific subject. Secondly, we use Mutual Information (MI) (Manning et al. 2008) instead of term frequency to find the key terms for each community. We argue that term frequency is not a true indicator of the importance of a term within a community especially when we study the communities that all have postings related to the same topic. For instance, if the topic of study is a company name, the terms corresponding to the company name and its variations would appear in postings from all communities with a high frequency. However, such terms cannot be used to differentiate between the major ideas of communities. Therefore, we need a feature selection method such as MI to find the most distinctive terms in the documents of each class. This way we could clearly separate the various communities and analyze them for effective knowledge discovery.

3 Analysis methods

3.1 Day-to-day analysis

In this analysis, our goal is to identify the major ideas expressed in the social media data about a specific topic during a certain period of time. Besides, we want the identified ideas to be separated by date such that we would be able to see the evolution of ideas over time and link the changes to the events in real world. Formally, given a set of posts P related to topic X published during time period I, we would like to find selected content S _i for each date i in Isuch that S _i gives us a summarization of the social media content about X on i. To achieve this goal, we transform each post p into a binary bag-of-words document d _p. Therefore, instead of P, we deal with a set of documents D. We partition D based on the generation date of documents and put all documents generated on i in a separate class of documents c _i. In order to find the key terms for each class of documents, we use mutual information which is a common feature selection measure. MI measures the amount of information reflecting the presence or absence of a term t in a document d; it implies the membership of d in class C. Equation 1 gives a formal definition of MI:

$$\text{MI}(U;C)=\sum_{e_t\in{1,0}}\sum_{e_c\in{1,0}\!}\!P\left(U = e_t, C = e_c\right) \cdot\log_{2}{\frac{P\left(U = e_t, C=e_c\right)}{P\left(U = e_t\right)P\left(C = e_c\right)}}$$

(1)

where U is a random variable that takes values e _t = 1 (the document contains term t) and e _t = 0 (the document does not contain t) and C is a random variable that takes values e _c = 1 (the document is in class c) and e _c = 0 (the document is not in class c). Equation 2 could be derived from Eq. 1 using maximum likelihood estimates (MLEs) of probabilities:

$$\text{MI}(U;C) = \frac{N_{11}}{N}\log_{2}{\frac{NN_{11}}{N_{1.}N_{.1}}} + \frac{N_{01}}{N}\log_{2}{\frac{NN_{01}}{N_{0.}N_{.1}}} + \frac{N_{10}}{N}\log_{2}{\frac{NN_{10}}{N_{1.}N_{.0}}} + \frac{N_{00}}{N}\log_{2}{\frac{NN_{00}}{N_{0.}N_{.0}}}$$

(2)

where Ns are counts of documents and values of e _t and e _c are indicated as subscripts. For instance, N ₁₀ is the number of documents that contain t (e _t = 1) and are not in c (e _c = 0), N _1. = N ₁₀ + N11, and N = N ₀₀ + N ₀₁ + N ₁₀ + N ₁₁.

In order to find the key terms for each class, we use the Modified MI (MMI) which is the sum of the expressions with N ₀₀ and N ₁₁ coefficients in Eq. 2. We are not interested in the expressions with N ₀₁ and N ₁₀ coefficients because for instance it does not make sense to identify a term whose occurrence in a document significantly decreases the likelihood of the membership in a class as the key term of the class. For each class, we calculate the MMI of all its terms and identify terms with highest measures as the key terms of the class.

Although key terms provide an insight into the main ideas expressed on each date, a selected document which shows the terms in their context would provide a better intuition. To find the selected document, we first need to rank the documents based on an importance measure. Equation 3 defines the score of a document:

$$S(d) = \frac{\sum_{t\in d}{MMI(t,c)}}{|d|}$$

(3)

where c is the class of d, and |d| is the number of terms in d. We select the document with the highest score in each class as the selected document of the class.

3.2 Network-based analysis

While the day-to-day analysis is based on the temporal aspect of postings, network-based analysis focuses on authors of posts. In the network-based analysis we build a network of authors based on the similarity of their posts. To measure the similarity of two posts, we first transform them into binary bag-of-words documents and then calculate Jaccard similarity between documents. Equation 4 defines the Jaccard similarity between documents d ₁ and d ₂:

$$J(d_1, d_2) = \frac{\left|d_1\cap d_2\right|}{\left|d_1\cup d_2\right|}$$

(4)

In the network of authors we establish a link between authors A and B if the similarity of at least one post from A and one post from B is greater than a threshold Tj. In this case, the following equation gives the weight of the link between A and B:

$$W(AB) = \sum_{d_1\in D_A, d_2\in D_B } F(J(d_1, d_2), T_j)$$

(5)

where D _A is the set of documents published by author A and F(m, n) returns m if m > n, and zero otherwise.

After constructing the network, we proceed to the community detection phase. We apply the Louvain method to find communities of authors in the network. Since a link between two nodes in the network reflects the similarity between the documents published by two authors, communities of densely connected nodes correspond to groups of authors who had similar postings. To get a better insight into the main ideas conveyed by each group of authors, we consider the documents published by each group as a separate class of documents and extract the key terms and selected document using the same approach as we utilized for the day-to-day analysis.

3.3 Scalability of the methods

In order to study the scalability of the proposed methods, we assume that all the posts associated with their author, generation date and unique ID are in a database. For the day-to-day analysis, we need one scan of the database to build an index of classes, documents, and terms. To perform the MMI calculations in an efficient way, the index should keep track of the documents in each class, the terms in each class, the number of occurrences of each term in documents of a class, total number of occurrences of each term, and total number of documents. The size of the index highly depends on the number of terms in the corpus as terms are the only non-numeric values that should be kept in memory. However, building the index would not normally raise a memory limitation issue since it is very unlikely that the number of unique terms in postings related to a specific subject go over a few millions. After finding key terms and selected documents for each class, we need to retrieve the full text of the selected documents from the database using document id’s.

The network construction phase of the second analysis poses the major scalability challenge. To build the network we need to perform O(N ²) similarity calculations. This number of calculations can significantly increase the run-time of the method. To resolve this issue we define a time interval parameter H and we compare each document only with the documents created within H hours of its generation date. Definition of the interval parameter improves the running time of the program by reducing the number of comparisons. It also resolves the potential memory limitation by reducing the amount of memory required for the data structures involved in similarity calculations. Furthermore, it enhances the performance of the network-based analysis by incorporating the temporal aspect of the documents. Once we construct the network, scalability is not a significant issue as the Louvain method is able to find communities in networks of millions of nodes and edges. Once communities are identified, selected content for communities could be found using the same index structure as we used in the day-to-day analysis.

4 Case study

4.1 The dataset

The data we are using in this study is collected from Twitter’s free API ^{Footnote 9} during April 2011. The dataset contains 7,275 Tweets posted by 4,731 distinct Twitter users from April 8 to 27, 2011. All the Tweets in the dataset mention Royal Bank of Canada (RBC) by including either ‘RBC' or ‘Royal Bank' and ‘Canada'.

4.2 Day-to-day analysis

Table 1 lists the number of Tweets and the key terms for each date in the period of study. According to the results reported in the table, there is a peak in the number of Tweets on April 12. Table 2 shows the selected Tweet for each date. Based on the selected Tweet, we can see that the peak on April 12 is caused by RBC capital markets managing director’s note to investors about iPad sales ^{Footnote 10}.

Table 1 Day-to-day analysis result, number of Tweets and key terms of the day

Full size table

Table 2 Day-to-day analysis result, selected Tweet of the day

Full size table

4.3 Network-based analysis

We first build the network of users based on the similarity of the Tweets using H = 48. Then we refine the network by removing individual nodes. Individual nodes are like outliers, they correspond to users who had unique Tweets that are not similar to anyone else Tweets. The refined network has 2481 nodes that are grouped into 411 communities by the utilized community detection algorithm. To avoid clutter, we only consider the top 20 largest communities that encompass 1,199 nodes. Figure 1 shows the top communities in different colors. Table 3 lists the size and density of the communities.

Table 3 Network-based analysis, size and density of the top communities

Full size table

Tables 4 and 5, respectively, list the key terms and selected Tweet of each community.

Table 4 Network-based analysis, key terms of community

Full size table

Table 5 Network-based analysis, selected Tweet of community

Full size table

4.4 Discussion of the results

Taking a closer look at the results reveals major conflicts between the results reported by the methods. For instance, the selected Tweet of C ₆ does not appear in the selected Tweets of Table 2 and the selected Tweet of 10th is not listed as any of the community’s selected Tweets. If we count the number of Tweets containing the corresponding key terms we could realize that there are 53 Tweets containing all the terms ‘Toronto', ‘manager', and ‘senior' while there are only 18 Tweets containing ‘young', ‘percent', and ‘home'. However, the first set of Tweets are posted sporadically from 13th to 26th while the second set of Tweets are published in a three day period from 9th to 11th, 14 of which posted on the 10th. In other words, the selected Tweet of the 10th is a representative of a small burst of the Tweets corresponding to a release of a survey by RBC. Although the event did not attract much attention from the users (only 18 Tweets posted), it was identified as the selected content of the day because of the absence of another trend in the small number of Tweets (193) posted on the 10th. In contrast, the selected Tweet of C ₆ represents a major trend in the Tweets that were overlooked by the day-to-day analysis because the Tweets that created the trend were posted in a period of several days. Notice that the density of a community gives us a clue of the dispersion of its Tweets over time. Normally, low density communities correspond to the trends that occurred over a relatively long period of time, while high density communities are formed by the sudden bursts of Tweets.

Figure 2 shows the activity history of communities and individual users during the period of the analysis. An interesting observation that could be made on this figure is the high rate of activity of the users in communities C ₂ and C ₃ on 12th. This observation reveals another deficiency of the day-to-day analysis: one selected tweet per day can only reflect one trend per day. That is why the selected post of C ₂ is neglected by the first analysis although from Fig. 1 C ₂ is a separate large and dense community with a limited number of connections to C ₃.

In conclusion, the day-to-day analysis is a good approach for summarizing the documents on a daily basis and observing the evolution of ideas over time. However, when the goal is to gain a deep insight into the major ideas expressed over a period of time, the network-based analysis would be more effective. In addition, the network-based method generates a meaningful division of users into communities. Each community reveals interesting perspective mainly shared by members of the community.

5 Conclusion and future work

In this paper, we presented two analysis methods for summarizing a high volume of social media data about a specific subject. We investigated the scalability of the methods and to evaluate the methods we conducted a case study on a dataset of Tweets that mentioned Royal Bank of Canada. We also compared the results produced by the two methods, and tried to justify the discrepancies. The study confirmed the viability of the proposed analysis methods and proved their efficiency in producing good summarizations as guidance for effective decision making process. For future work, we plan to study the evolution of the network of authors of tweets. Our goal is to find out how authors move between the communities, how new authors join the network, and which users lose their interest in the subject and become inactive over time. We want to investigate how the decision makers benefit from the outcome of the study and how the new trend due to the decisions taken would affect the communities and whether they will shrink or grow. We want to study the effect of the reaction to the social media on the social media i.e., how the Tweeters react to the positive/negative reaction from the decision makers. Are all Tweets within the same community share the same trend in their reaction or the communities would shuffle in response to the decision from the management. We also consider the possibility of using meta-data, i.e., location, hash-tags, retweets, etc to enhance the analysis.

Notes

References

Adnan M, Alhajj R, Rokne J (2010) Identifying social communities by frequent pattern mining. In: Proceedings of the international conference on information visualisation
Agichtein E, Castillo C, Donato D, Gionis A, Mishne G (2008) Finding high-quality content in social media. In: Proceedings of the international conference on Web search and web data mining, pp 183–194
Asur S, Huberman BA (2010) Predicting the future with social media. CoRR, abs/1003.5699
Bansal N, Chiang F, Koudas N, Tompa FW (2007) Seeking stable clusters in the blogosphere. In: Proceedings of the international conference on Very large data bases, pp 806–817
Bian J, Liu Y, Agichtein E, Zha H (2008) Finding the right facts in the crowd: factoid question answering over social media. In: Proceedings of the international conference on World Wide Web, pp 467–476
Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008:P10008. doi:10.1088/1742-5468/2008/10/P10008
Derényi I, Palla G, Vicsek T (2005) Clique percolation in random networks. Phys Rev Lett 94:160202
Google Scholar
Fortunato S (2010) Community detection in graphs. Phys Rep 486(3-5):75–174
Article MathSciNet Google Scholar
Gilbert E, Karahalios K (2009) Predicting tie strength with social media. In: Proceedings of the international conference on Human factors in computing systems, pp 211–220
Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821–7826
Article MathSciNet MATH Google Scholar
Hansen DL, Rotman D, Bonsignore E, Milic’-frayling N, Rodrigues EM, Smith M, Shneiderman B, Capone T (2009) Do you know the way to sna ?: a process model for analyzing and visualizing social media data. Group, pp 1–10
Huberman BA, Romero DM, Wu F (2008) Social networks that matter: Twitter under the microscope. CoRR, abs/0812.1045
Jansen BJ, Zhang M, Sobel K, Chowdury A (2009) Twitter power: Tweets as electronic word of mouth. J Am Soc Inf Sci Technol 60(11):2169–2188
Google Scholar
Java A, Song X, Finin T, Tseng B (2007) Why we twitter: understanding microblogging usage and communities. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, pp 56–65
Kaplan AM, Haenlein M (2010) Users of the world, unite! the challenges and opportunities of social media. Bus Horizons 53(1):59–68
Article Google Scholar
Kianmehr K, Alhajj R (2009) Calling communities analysis and identification using machine learning techniques. J Expert Systems Appl 36:6218–6226
Article Google Scholar
Kumar S, Barbier G, Abbasi MA, Liu H (2011) Tweettracker: An analysis tool for humanitarian and disaster relief. In: Proceedings of the International Conference on Weblogs and Social Media
Kumar S, Morstatter F, Zafarani R, Liu H (2013) Whom should i follow? identifying relevant users during crises. In: Proceedings of ACM conference on Hypertext and social media
Li R, Lei KH, Khadiwala R, Chang K-C (2012) Tedas: a twitter-based event detection and analysis system. In: Proceedings of IEEE international conference on data engineering, pp 1273–1276
MacEachren A, Jaiswal A, Robinson A, Pezanowski S, Savelyev A, Mitra P, Zhang X, Blanford J (2011) Senseplace2: Geotwitter analytics support for situational awareness. In Proceedings of 2011 IEEE Conference on Visual Analytics Science and Technology (VAST), p 181190
Manning CD, Raghavan P, Schtze H (2008) Introduction to Information Retrieval. Cambridge University Press, Cambridge
Mathioudakis M, Koudas N (2010) Twittermonitor: Trend detection over the twitter stream. In: Proceedings of ACM SIGMOD International Conference on Management of data, pp 1155–1158
Mathioudakis M, Koudas N, Marbach P (2010) Early online identification of attention gathering items in social media. In: Proceedings of ACM international conference on Web search and data mining, pp 301–310
Mendoza M, Poblete B, Castillo C (2010) Twitter under crisis: Can we trust what we rt? In: Proceedings of the First Workshop on Social Media Analytics
Morstatter F, Kumar S, Liu H, Maciejewski R (2013) Understanding twitter data with tweetxplorer. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Naji G, Nagi M, Elsheikh AM, Gao S, Kianmehr K, Özyer T, Demetrick D, Alhajj R, Rokne J, Ridley M (2011) Effectiveness of social networks for studying biological agents and identifying cancer biomarkers. Counterterrorism and Open Source Intelligence
Newman M (2004) Detecting community structure in networks. Eur Phys J B Condens Matter Complex Systems 38:321–330
Article Google Scholar
Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69:026113. doi:10.1103/PhysRevE.69.026113
O’Connor B, Balasubramanyan R, Routledge B, Smith N (2010) From tweets to polls: Linking text sentiment to public opinion time series
Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814–818
Article Google Scholar
Purohit H, Sheth A (2013) Twitris v3: From citizen sensing to analysis, coordination and action. In: Proceedings of the International Conference on Weblogs and Social Media
Shafiq O, Alhajj R, Rokne J (2009) Community aware personalized web search. In: Proceedings of the international conference on advaces in social network analysis and mining, pp 3510–355
Smith MA, Shneiderman B, Milic-Frayling N, Mendes Rodrigues E, Barash V, Dunne C, Capone T, Perer A, Gleave E (2009) Analyzing (social media) networks with nodexl. In: Proceedings of the international conference on Communities and technologies, pp 255–264
Tumasjan A, Sprenger T, Sandner P, Welpe I (2010) Predicting elections with twitter: What 140 characters reveal about political sentiment. In: Proceedings of AAAI international conference on weblogs and social media, pp 178–185

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Calgary, Calgary, AB, Canada
Ali Rahmani, Alan Chen, Abdullah Sarhan & Reda Alhajj
Department of Informatics, College of Science, Lebanese University, Tripoli, Lebanon
Jamal Jida
Department of Information Technology, RBC Royal Bank of Canada, Toronto, ON, Canada
Mohammad Rifaie
Department of Computer Science, Global University, Beirut, Lebanon
Reda Alhajj

Authors

Ali Rahmani
View author publications
You can also search for this author in PubMed Google Scholar
Alan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Abdullah Sarhan
View author publications
You can also search for this author in PubMed Google Scholar
Jamal Jida
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Rifaie
View author publications
You can also search for this author in PubMed Google Scholar
Reda Alhajj
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Reda Alhajj.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rahmani, A., Chen, A., Sarhan, A. et al. Social media analysis and summarization for opinion mining: a business case study. Soc. Netw. Anal. Min. 4, 171 (2014). https://doi.org/10.1007/s13278-014-0171-y

Download citation

Received: 03 August 2013
Revised: 04 November 2013
Accepted: 20 November 2013
Published: 11 February 2014
DOI: https://doi.org/10.1007/s13278-014-0171-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Social media analysis and summarization for opinion mining: a business case study

Abstract

Similar content being viewed by others

Social media analytics: a survey of techniques, tools and platforms

Analyzing social media data: A mixed-methods framework combining computational and qualitative text analysis

A survey of sentiment analysis in social media

1 Introduction