Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Social data collected on a large scale has helped to promote our understanding of society and ourselves during this century [13]. Drastic changes in information technology enable us to collect various types of communication data, such as those on mobile phone and face-to-face contacts. Hidden patterns of human activities are uncovered, and these results are beginning to used to solve real social problems.

Web-based data attract many scientists particularly because it is easy to collect data on a large scale for discussion in scientific papers and such data reflect various real social phenomena such as political, market, financial, and daily events. Examples include Facebook—the world’s largest social networking service [46]—Google and Yahoo! search engines [79], Wikipedia—a multilingual online encyclopaedia [1014]—YouTube—an online movie site [15]—and Twitter—a microblogging platform on which uses can post messages up to 140 characters [1622].

Since various large amounts of social media data are available, studies regarding social media have been greatly increasing substantially. In particular, new applications are demanding and attracted, many researches have focused on the connections between real-world phenomena and social media. For example, Mandel et al. focused on the emotional changes in 65,000 tweets during half the month when hurricane Irene approached the U.S. and found that the level of concern had increased depending on region and gender [23]. However, little is known about long-term emotional changes in social media.

There are many ways to detect emotions in texts. Some examples are Linguistic Inquiry and Word Count (LIWC) [24], Positive and Negative Affect Schedule (PANAS) [25], Affective Norms for English Words (ANEW) [26], and Point Of Mood State (POMS) [27, 28]. Furthermore, Emoticons (combination of ‘emotion’ and ‘icon’) such as ‘:)’ and WordNet a thesaurus have been used to detect emotions [2932]. However, it is known that resulting polarity (i.e., positive or negative) is different depending on the methods[33]. Additionally, most of the methods are English, therefore, there is a not strict dictionary to detect emotions from Japanese texts.

In this paper, we use more than 3.2 billion Japanese blog posts for 6 years since 1 November 2007 as typical Japanese texts. Here we simply use whole Japanese adjectives to detect emotions. Our observation periods include the Great East Japan earthquake in 2011 which is said to have changed social mood qualitatively. First, we describe our data and method in Sect. 26.2. Next we compare the relative frequencies of adjectives before and after the quake and draw co-occurrence networks for the adjectives to visualize social emotions in Sect. 26.3. Summary and discussion are in Sect. 26.4.

2 Data and Methods

2.1 Data

We studied Japanese blog data from 1 November 2007 to 31 October 2013 (2192 days) using an Internet service called ‘Kuchikomi@kakaricho’Footnote 1 to collect data. This fee-charging service provides an original web-site and an application programming interface (API). The API returns the daily number of blogs in which any given target word appears in a given period. The daily number of blogs containing a blog post which includes that the target word occurring more than once is counted. Thus, if one blog post includes a target word multiple times, the API counts it as one. The API also provides a spam filter with three levels—weak, middle and strong—depending on the desired spam detection accuracy. Here, we use the middle level of spam filter to collect data. The full API database contains more than 3.2 billion blog posts from 38 million accounts.

We search for adjectives listed in the MeCabFootnote 2—a Japanese morphological analyzer—dictionary with original surface forms using the API. Because there are various ways of conjugating forms in Japanese, we search for just their original form for simplicity; there are 1741 adjectives in total. Subsequently, we summarize these adjectives’ time series in the case that they have the same pronunciations and meanings. Because Japanese uses three different character sets—Hiragana, Katakana and Kanji (Chinese characters)—instead of an alphabet, people tend to use words with different surface forms that have the same pronunciations and meanings. This procedure leads to 839 adjectives. Finally, we removed extremely low- (less than 10 times per day) or high-frequency (more than 100,000 times per day) adjectives, resulting in a total of 550 adjectives’ time series.

2.2 z-Test for the Quality of Two Proportions

In this research, we apply a z-test for the quality of two proportions to determine whether adjective i’s occurrence x i (t) differs at different times T 0 and T 1. The z-score of statistic z i is calculated as follows:

$$\displaystyle{ z_{i} = \frac{\left \vert \frac{x_{i}(T_{0})} {X(T_{0})} -\frac{x_{i}(T_{1})} {X(T_{1})} \right \vert } {\sqrt{X_{i } \left (1 - X_{i } \right )}\left [ \frac{1} {x_{i}(T_{0})} + \frac{1} {x_{i}(T_{1})}\right ]} }$$
(26.1)

where X(t) is the total number of blogs at time t and X i is calculated as follows:

$$\displaystyle{ X_{i} = \frac{x_{i}(T_{0}) + x_{i}(T_{1})} {X(T_{0}) + X(T_{1})} }$$
(26.2)

The null hypothesis is that z i follows a standard normal distribution. Therefore, we can calculate the p-value to check the proportions of adjectives in different time periods.

2.3 Partial Correlation Coefficient

To visualize the change in social emotions at the quake, we construct adjective co-occurrence networks for the pre- and post-quake periods. First, we calculate the Pearson’s linear correlation coefficient r ij between the frequency y i (t) and y j (t) of adjectives i and j respectively, with a difference of normalized time series,

$$\displaystyle{ y_{i}(t) = \frac{x_{i}(t + 1)} {X(t + 1)} -\frac{x_{i}(t)} {X(t)}. }$$
(26.3)

Pearson’s linear correlation coefficient r ij between adjectives i and j is calculated as follows:

$$\displaystyle{ r_{ij} = \frac{\sum _{t}\left (y_{i}(t) -\langle y_{i}\rangle \right )\left (y_{j}(t) -\langle y_{j}\rangle \right )} {\sigma _{i}\sigma _{j}} }$$
(26.4)

where σ i and σ j are the standard deviations of y i (t) and y j (t), respectively. Here, we use three weeks before and after the quake for the comparison, with T = 21 data points. If adjectives i and j have a positive correlation at the 0.01 significant level, then the value of r ij is greater than 0.5487 for T = 21.

To extract more essential links from the co-occurrence network, we use the partial correlation coefficient r ij k as follows:

$$\displaystyle{ r_{ij}^{k} = \frac{r_{ij} - r_{ik} \cdot r_{jk}} {\sqrt{1 - r_{ik }^{2}}\sqrt{1 - r_{jk }^{2}}}. }$$
(26.5)

r ij k is the partial correlation between i and j under the fixed condition of k. Partial correlation is equivalent to the correlation between residuals y i (t) and y j (t) after the removing correlation between each r ik and r jk .

Normally, we can calculate the partial correlation D ij that is removed by all other adjectives’ effects by calculating an inverse correlation matrix. However, we cannot calculate it that way in this case, because we have only T = 21 data points for each time series and there are 550 adjectives (samples). Therefore, we calculate the partial correlation D ij for each pair of adjectives i and j by removing k’s effect and checking the following condition.

$$\displaystyle{ D_{ij} = \left \{\begin{array}{@{}l@{\quad }l@{}} \max r_{ij}^{k}\quad &(\mbox{ if $\forall k,r_{ij}^{k} \geq D_{c}$}) \\ 0 \quad &(\text{otherwise}), \end{array} \right. }$$
(26.6)

where D c  = 0. 5613 at the 0.01 significance level of partial correlation. To stress the significant partial correlation coefficient, we use maximum value of r ij k in this research.

3 Results

3.1 Adjectives in Emergency and Normal Periods

We focus on relative changes in adjectives before and after the quake by calculating z i , as shown in Eq. (26.1). Here, we define the pre-quake period T 0 as three weeks before the quake, from 18 February to 10 March in 2011. Similarly, we define the post-quake period T 1 as three weeks after the quake, from March 12 to April 1 in 2011.

According to the results of the calculation of z i for all 550 possible adjectives, 72 adjectives increased and 74 adjectives decreased significantly at the 0.01 significant level. Tables 26.1 and 26.2 show top 10 increased and decreased adjectives respectively. Figure 26.1 shows examples of increased and decreased adjectives.

Fig. 26.1
figure 1

Daily number of blogs including ‘brand-new’, ‘lonely’, and ‘impatient’ from the top. Daily number of whole blogs X(t) is shown in the bottom. Solid lines are in 2011 and dashed lines are in 2010 for comparison. Note that sudden increase around 20 March 2010 for ‘brand-new’ is caused by spam blogs, because we confirm that it diminished when we search with the strong level of spam filter

Table 26.1 Adjectives that increased significantly (p < 0. 01) after the quake
Table 26.2 Adjectives that decreased significantly (p < 0. 01) after the quake

We found that adjectives such as ‘impatient’ which express users’ feelings of frustration have increased considerably according to Table 26.1 (#1, 4, 5, 6). Even these adjectives have different surface forms, albeit similar meanings, e.g. the frustrated feeling of wanting to help others but being unable to do so.

The usage of words such as ‘heartless’ and ‘shameless’ have also increased significantly, according to Table 26.1 (#7, 8, 10). Some people behaved selfishly, buying food and bottles of water under despite the serve shortage conditions after the quake. Therefore, blog posts included complaints about these behaviors with the increased use of these adjectives.

On the other hand, ‘brand-new’, ‘earlier’, and ‘cannot wait’ in Table 26.2 (#1, 2, 5) decreased significantly. These words are often used for positive meanings in expectation of new seasons such as spring and goods such as movies. In fact, many companies canceled or postponed their releases of new product and events. For example, the ceremony of Kyushu Shinkansen opening was canceled, iPad2 (a popular tablet device) release was postponed,Footnote 3 and many scheduled movie releases were canceled or postponed.Footnote 4 As a result, people lost many chances to use such words.

Furthermore, we found that adjectives related to taste such as ‘salty-sweet’ and ‘bitter’ decreased significantly according to Table 26.2 (#6, 8, 9). These decreases of words may reflect the so-called ‘self-restraint mood’ that people stop to have parties outside such as annual cherry blossom viewing party and to go restaurant. Consequently, the words regarding taste, e.g. restaurant and cook reviews, could decrease. These decreased adjectives seem to be more related to social activities rather than emotions, compared to increased adjectives. Therefore, estimating economic situations by using the adjectives is the crucially interesting future topic.

3.2 Adjective Changes in the Co-occurrence Network

We constructed co-occurrence network of adjectives during the pre-quake period T 0 and post-quake period T 1. The node size indicates the relative frequency of words compared with the entire period and the color corresponds to the community it belongs to. The community is decided by the modularity Q as follows:

$$\displaystyle{ Q = \frac{1} {2M}\sum _{i,j=1}^{N}\left (A_{ ij} -\frac{k_{i}k_{j}} {2M} \right )\delta (c_{i},c_{j}), }$$
(26.7)

where N = 550 and M are total number of nodes and links in network respectively. A ij is the weight of the link between node i and j. Here the weight is correlation coefficient r ij calculated in Eq. (26.4). k i  =  j A ij is the sum of the weights linked to node i, and c i is the community which i belongs to. δ(c i , c j ) is 1 if i and j belong to same community (c i  = c j ), otherwise 0. In this paper, we maximize Q for undirected weighted network by a software named GephiFootnote 5 (version 0.8.2) to detect community with the algorithm introduced by Blondel et al.[34].

There are 3354 links among 550 nodes during T 0 using the normal correlation coefficient r ij (Fig. 26.2, left) and 61 links using the partial correlation coefficient D ij (Fig. 26.2, right). As expected, the node size is nearly the same because there was no major news before the quake T 0 and there are no special properties in these networks. There are 14 communities in the network. The largest community shares 16.36 % and the second shares 14.55 %.

Fig. 26.2
figure 2

Correlation networks consisting of adjectives before the quake T 0. (Left) links are drawn on the basis of the correlation coefficient from Eq. (26.4). (Right) links are drawn on the basis of the partial correlation coefficient from Eq. (26.6). Nodes are colored by their community and sized by their relative appearances \(\langle x_{i}(T_{0})\rangle /\langle x_{i}\rangle\), where \(\langle x_{i}\rangle\) is the mean in the entire period and \(\langle x_{j}(T_{0})\rangle\) is the mean in the period \(T_{0}\)

In contrast, there are two major communities post-quake period T 1. There are 6125 links among the 550 nodes in T 1 by calculating r ij (Fig. 26.3, top) and 73 links are found by calculating the partial correlation coefficient (Fig. 26.3, bottom). There are five communities in the network. The largest community shares 26.57 % and the second shares 21.59 %. Thus, more nodes categorized into the same communities than T 0 period. This confirms that one community corresponds to the adjectives that increased significantly as shown in the previous section.

Fig. 26.3
figure 3

Correlation networks consisting of adjectives after the quake T 1. (Top) links are drawn on the basis of the correlation coefficient from Eq. (26.4). (Bottom) links are drawn on the basis of the partial correlation coefficient from Eq. (26.6). There is one large grouping that consists of increased frequency nodes soon after the quake (red). Nodes are colored by their community and sized by their relative appearances \(\langle x_{i}(T_{1})\rangle /\langle x_{i}\rangle\), where \(\langle x_{i}(T_{1})\rangle\) is the mean in the T 1 period

4 Summary and Discussion

We observed social emotions in the Japanese blog space during emergency periods, especially before and after the Great East Japan Earthquake. Here, we focus on the relative changes of adjectives. We found that the feelings such as wanting to help others but having no means and the feeling of frustration increased considerably. Thus, adjectives such as ‘impatient’, ‘sorry’, and ‘frustrating’ increased.

To visualize the change in social mood during the quake, we constructed adjective co-occurrence networks. We found that there is a clear topological difference between the pre- and post-quake periods. We found one large community in the post-quake networks with increased adjectives. This result suggests that people tended to share similar emotions post-quake period.

Our results were derived from a limited case study of Japanese social media during the 3.11 Earthquake. However, our results still have novelty and potential, because no one could record small messages from normal people accurately during emergencies until the advent of social media.

Given that during the quake, rumors and false information are said to have spread via mobile phones and social media [35], investigating social emotions during emergency periods has the potential to provide useful warnings regarding false information diffusion. Because psychologists have observed, on the basis of experiments with limited numbers of subjects, that rumors are more likely to diffuse in an emergency situation [36] and when people feel anxious [37].

We expect that data assimilation—a combination of agent-based simulation and real data analysis—will assist us in preventing potential secondary disasters such as false information diffusion and riots during emergencies. The Internet era has even said to foster ‘digital wildfires in a hyperconnected world [38]’ similar to the spread of real wildfires. Our research sheds light on this universal problem and could issue warning signals of potential digital wildfires. We hope that our results can be applied to prevent such problems in the future.