Keywords

1 Introduction

With the rapid development of Internet and information technology, social networks such as Weibo, Facebook, Flikr, Twitter, etc. have developed rapidly. Social networks have become one of the main platform for human beings to spread information. Identifying nodes with great influence on information dissemination in social networks is helpful for in-depth analysis of information dissemination and evolution in social networks. Finding the guider or pusher in network public opinion is of great significance for controlling and guiding network public opinion, cracking down on network information crimes, and realizing viral marketing and word-of-mouth communication.

There are two main methods for evaluating the importance of nodes, methods based on centralities and methods based on information dissemination scale. The centrality-based method evaluates the centrality of nodes depends on the network structure, which represents the degree of nodes in the center of the network. These centrality methods mainly include Degree Centrality [1], Betweenness Centrality [2], Closeness Centrality [3], Eigenvector Centrality [4], etc. This kind of method is suitable for finding the important nodes in the network structure, but not for evaluating the influence of nodes.

The method based on information dissemination scale is to use information dissemination model to simulate the information dissemination process, calculate the information dissemination scale of nodes, and find out the nodes with great influence. A greedy algorithm for calculating the propagation scale of nodes was first proposed by Kempe et al. [5], which is very time consuming and only suitable for small networks. In order to reduce the computational complexity, Leskovec et al. [6] proposed CELF algorithm according to the submodules of influence diffusion, avoiding redundant calculation of activation range. References [7,8,9,10,11] used heuristic strategy instead of Monte Carlo simulation to estimate propagation scale for improving time efficiency. This kind of method finds influential nodes by directly measuring the information dissemination scale of nodes, but it is not suitable for evaluating the importance of nodes in the network that indirectly disseminate influence.

The methods of Cao [15], Wang [16], Shang [17], Zhang [18] and other teams assume the independence between communities. After dividing the network into communities, they find the node with the greatest local influence in each community, and then find the node with the greatest influence in the whole network. M. M. Tulu [19] et al. calculated the node’s Shannon Entropy as the node’s importance by using the number of nodes outside the community and the number of nodes inside the community after the community was divided. Zhao [20] measured the importance of nodes by the number of communities which the nodes connected to after dividing the network into communities. These methods either do not focus on the association between the communities, or do not consider the relationship between the nodes and the different communities, or do not consider the influence of the communities themselves.

In order to deal with the above problems, we propose a node importance evaluation algorithm based on community influence (abbreviated as IEBoCI algorithm). Its basic assumption is that the stronger the ability of the community connected by nodes to disseminate information and the greater the influence of nodes on the community, the higher the importance of nodes. The algorithm first calculates the activation probability of nodes to other nodes; Secondly, the network is divided into communities based on LPA algorithm; Thirdly, calculate the influence of each community and the influence degree of nodes on the connected communities; Finally, the importance of the node is calculated by combining the influence of the community itself and the influence of the node on the community.

2 IEBoCI Algorithm Framework

If a person has many friends in different societies in social networks, this person does not necessarily directly disseminate a large amount of important information, but he can indirectly disseminate the information in the community through contacts with other community members. From this we can see that this person has a wide influence on information dissemination and plays a more important role in the network.

We believe that the influence of nodes is related to the number and quality of communities connected by nodes based on this assumption. The more communities connected, the greater the influence of the nodes and the higher the importance. Meanwhile, the influence of the nodes is also related to the influence of the connected communities. For the same community, the influence degree of different nodes on the community is also different. If a node has less influence on a community, it is difficult for the node to influence the nodes in the community, and it is not easy to further spread information through the community. Therefore, it is necessary to comprehensively evaluate the influence of the community itself and the influence degree of nodes on the community when evaluating the importance of nodes.

3 Algorithm Steps

Social network is a complex network, which is denoted as directed network G = (N, E) in this paper, among which N is a collection of nodes in a network, E is a set of directed edges in a network. The IEBoCI algorithm proposed in this paper is based on directed network. The algorithm flow is shown in Fig. 1. The steps are as follows:

Fig. 1.
figure 1

Algorithm flow chart

  1. (1)

    calculate the activation probability of nodes activating their reachable nodes based on the information propagation model, which is used to divide communities and calculate the influence range of communities and nodes; (2) divide the network based on label propagation algorithm to obtain the community structure of the network; (3) calculate the influence range of communities according to the activation probability of the nodes; (4) calculate the number expectation of the nodes on communities activated by the nodes, and further obtaining the influence degree of the nodes on communities; (5) calculate the importance of nodes by combining the results of the third and fourth steps.

3.1 Node Activation Probability Calculation

This paper calculates the information dissemination scale of nodes and communities based on independent cascade model (IC model). IC model is a probability model, and there is a probability p(vi, vj) ∈ [0, 1] for all neighboring nodes vi and vj in network G. A value between 0 and 1 is randomly assigned, which indicates the probability that the active node vi successfully directly activates the neighbor node vj. For non-adjacent nodes vi and vj, if vi to vj are unreachable, the probability of vi activating vj is 0, that is p(vi, vj) = 0. If vi is reachable to vj, the probability that node vi activates node vj along the path is the product of one node directly activating another node on each side of the path [22]. If there are m paths between vi and vj, one of the paths is Path(vi, vj)x = <vj = v1, v2,…, vj = vk>, the probability Pp(vi, vj)x that node vi activates node vj along this path is calculated as follows:

$$ Pp(v_{i} ,v_{j} )_{x} = \prod\limits_{u = 1}^{k - 1} {p(v_{u} ,v_{u + 1} )} $$
(1)

Where Pp(vi, vj)x is the probability that vi activates vj through Path(vi, vj)x and p(vu, vu+1) is the probability that node vu activates neighbor node vu+1. If the activation probabilities of different paths are different, the maximum probability is taken as the probability p(vi, vj) for node vi to activate non-adjacent reachable node vj.

3.2 Community Detection

We divide network into communities based on label propagation algorithm (LPA algorithm) to obtain a collection of communities on the network. LPA algorithm is applicable to undirected and unweighted networks. The social network constructed in this paper is a directed network. Therefore, when calculating the labels to be updated of node vj, only the in-neighbors of node vj are calculated, and the labels with the highest activation probability among the neighbors are counted. The steps are as follows:

  • step1: Label initialization: each node in the network is randomly assigned a unique label l, which represents the community in which the node is located

  • step2: Determining the node order of the asynchronous updating labels: calculate the degree of the nodes, and arrange the node order of the asynchronous updating labels from large to small according to the degree of the nodes;

  • step3: Updating the labels of nodes: according to the node order of updating the labels, the labels of nodes are updated one by one, and the label of node vj is updated to the label with the maximum sum of activation probabilities in its in-neighbor nodes. The label updating formula is as follows:

$$ l_{{v_{j} }} = \arg \mathop {\hbox{max} }\limits_{l} \sum\limits_{{i \in IN(v_{j} )}} {p(v_{i} ,v_{j} )\delta (l_{{v_{i} }} ,l)} $$
(2)

\( l_{{v_{j} }} \) represents the label of node vj to be updated, \( l_{{v_{i} }} \) represents the label of node vi, IN(vj) represents the set of nodes with out-edges to node vj, p(vi, vj) represents the activation probability from node vi to node vj, and \( \delta (l_{i} ,l) \) is a Kronecker function. When there is more than one label with the maximum sum of the calculated activation probabilities, one label is randomly selected from them as the new label of the node.

Step4: Termination judgment: it is judged whether the labels of all nodes in the network are the labels with the largest sum of activation probabilities among neighbor nodes. If not, step3 is repeatedly executed, if so, calculation is terminated, and nodes with the same label belong to the same community.

3.3 Evaluation of Influence of Communities

In this paper, the number expectation of network nodes activated by a community is taken as the influence of the community. The steps are as follows: firstly, the activation probability between nodes calculated by 3.1 is used to calculate the joint activation probability of all nodes in the community to nodes in the network; then calculate the number expectation of nodes activated by the community according to the joint activation probability to obtain the influence of the community.

In the independent cascade model, whether a node activates another node and whether other nodes activate the node are independent events, so the joint activation probability Ps(Cl, vj) [23] of the community Cl to a node vj in the network is calculated according to the probability multiplication of the independent events, and the calculation formula is:

$$ Ps(C_{l} ,v_{j} ) = 1 - \prod\limits_{{v_{i} \in C_{l} }} {(1 - p(v_{i} ,v_{j} ))} $$
(3)

With the joint activation probability Ps(Cl, vj), the influence scale expectation of the community EXPs(Cl) is calculated as follows:

$$ EXPs(C_{l} ) = \sum\limits_{{v_{j} \in N}} {Ps(C_{l} ,v_{j} )} $$
(4)

Where N is the set of all nodes in the network.

3.4 Evaluation of Influence of Nodes on Communities

The number expectation of nodes in a community activated by a node indicates how many nodes in the community a node can successfully activate. The number of nodes in the community indicates the total scale of the community. The greater the proportion of nodes in the community that a node can activate in all nodes of the community, the greater the influence of the node on the community. Therefore, this paper regards the ratio of number expectation of nodes in a community activated by a node to the number of nodes in the community as the influence degree of the node on the community.

Limiting the range of nodes activated by node vi in the community Cl, the number expectation of nodes \( EXPn(v_{i} ,C_{l} ) \) in the community Cl activated by node vi is obtained, which is equal to the sum of the probabilities of each node in the community Cl being successfully activated by node vi, and the calculation formula is as follows:

$$ EXPn(v_{i} ,C_{l} ) = \sum\limits_{{v_{j} \in C_{l} }} {p(v_{i} ,v_{j} )} $$
(5)

Where Cl represents the set of all nodes in the community Cl.

\( EXPn(v_{i} ,C_{l} ) \) represents the number of nodes that node vi can activate in community Cl. The ratio of this expectation to the total number of nodes n(Cl) in community Cl is the influence degree \( Inf(v_{i} ,C_{l} ) \) of node vi on community vi. The formula is:

$$ Inf(v_{i} ,C_{l} ) = \frac{{EXPn(v_{i} ,C_{l} )}}{{n(C_{l} )}} $$
(6)

3.5 Node Importance Evaluation

Node vi can use influence of communities directly connected by node vi to spread influence indirectly. We calculate the sum of the influence of communities that node vi can indirectly use, and get the importance I(vi) of node Vi. The importance I(vi) of the node vi is calculated by the following formula using influence of communities and the influence of a node on community:

$$ I(v_{i} ) = \sum\limits_{{C_{l} \in Com(v_{i} )}} {Inf(v_{i} ,C_{l} )EXPs(C_{l} )} $$
(7)

Where Com(vi) represents the set of communities in which node vi and its out-neighbors are located.

4 Experimental Results and Discussions

4.1 Experimental Data and Initial Setup

The experimental data used in this paper are commonly used public social network data sets, which are downloaded from Internet. The name, data scale and description of the network are shown in Table 1.

Table 1. Basic information of datasets

The nodes in Facebook are friends relationship, and the formed network is an undirected network. In this experiment, each edge is converted into two directed edges to convert undirected network into directed network. The nodes in the mail are communication relationship. Each communication forms a directed edge from one node to another node.

In order to simulate the dissemination of information on the network, this paper uses node vi as the initial activation node, and the scale of the nodes that can be affected by node vi as a measure of the node’s information dissemination capability, which is referred to as the Influence scale. In order to simulate the influence propagation process of nodes and calculate the influence scale of nodes, the commonly used independent cascade model (IC model) [21] is adopted in this paper.

The activation probability p between nodes is randomly assigned a value between 0 and 1 when constructing a network using data sets. Due to the randomness of the independent cascade model, the results may be different when calculating the influence scale of nodes. To sum up, each node is taken as the initial activated node to calculate the influence scale for 50 times when calculating the influence scale of nodes, and then the arithmetic average value is taken as the final result.

4.2 The Division and Influence of Communities

The two data sets are divided into communities after the network construction is completed, according to the activation probability between nodes by using the LPA algorithm improved previously. The division results are shown in Tables 2 and 3.

Table 2. Communities of Facebook data sets
Table 3. Communities of mail data sets

According to the results of community detection, it can be seen that the structural characteristics of Facebook and email are quite different: Facebook has 4039 nodes and 46 communities are divided, with a small number of communities and a large scale of communities, which indicates that the network is relatively close. There are 1866 nodes in the mail network, 532 of which are divided into communities. The number of communities is large, but the number of large-scale communities is small. More communities are 1 node and 2 nodes, which shows that the network is sparse.

The influence of communities is calculated after the community detection is completed, according to the method proposed in this paper, and the calculation results are shown in Fig. 2. Figure 2. (a) shows the influence of communities in Facebook data. The number of communities is small, and the size of communities (the number of members of communities) varies greatly. From the overall trend, the larger the size of communities, the greater the influence of communities. When the influence of the community reaches a certain degree (the number of nodes affected reaches more than 90% of the total number of nodes), the increase in influence becomes less and less obvious, which is consistent with the reality. Figure 2. (b) shows the influence of communities in E-Mail data. Generally speaking, there is also a trend that “the larger the community size, the greater the influence of the community”. As there are a large number of 1-node and 2-node communities in the network, the data in the lower left corner of the image is relatively dense, and the influence of communities is not necessarily the same under the same scale.

Fig. 2.
figure 2

Influence of communities

4.3 The Importance of Node

According to the method proposed in this paper, the importance I of all nodes in data sets and the Influence scale of nodes are calculated, and compared with a method of indirectly measuring the importance of nodes through communities (the number of directly connected communities V-community [20]). The Degreeout and Betweenness are analyzed as statistical data.

Distribution of Node Importance.

After calculating the importance of nodes, numbers of nodes with different values of importance were counted in Facebook and E-Mail data sets. The number of nodes was counted in Facebook data set according to the importance with 100 as an interval, and the number of nodes was counted in E-Mail data set according to the importance with 20 as an interval. The distribution of statistical results is shown in Fig. 3. The distribution in the two data sets is different. The distribution in Facebook is positively skew distribution and the distribution in mail is power law distribution. The difference between the two results lies in the different characteristics of the two data sets. The network in Facebook data is a directed network transformed from an undirected network, and each node has an out-degree greater than 0, so each node can transmit information to other nodes. In the mail data, the directed network is constructed according to the communication relationship. There are a large number of nodes in the data set that receive mail but do not send mail, which have an out-degree of 0 and do not carry out information dissemination to the outside. Therefore, a large number of nodes with an out-degree of 0 result in a large number of nodes with low importance in the data statistics. So the distribution of importance presents a power law distribution.

Fig. 3.
figure 3

Node importance statistics

Comparison of Node Importance and Number of Directly Connected Communities.

The number distribution statistics of the Influence scale of nodes in the two data sets are shown in Fig. 4.

Fig. 4.
figure 4

Node influence scale statistics

In Fig. 4, (a) Facebook data counts the number of nodes according to the node Influence scale with 100 as an interval, and (b) E-Mail data counts the number of nodes according to the node Influence scale with 20 as an interval.

Most of the nodes in the Facebook dataset have a large scale of influence, with 1746 nodes in the [3900, 4000) interval and 1649 nodes in the [3800, 3900) interval. Facebook dataset has 4,039 nodes, of which three-quarters can affect 95% of the network. This result is also related to the close connection of nodes in the data set. The density of the network is high, the average outdegree of nodes is also large, and the influence spread range of most nodes is large.

The influence scale of most nodes in E-Mail data set is very small, 921 nodes are in [0, 20) interval. This result is related to the fact that nodes in the data set are not closely related. Different nodes have different impact sizes. There are 1866 nodes in the data set, of which 800 nodes have an output of 0. These nodes cannot transmit information outward, so nodes with low Influence scale account for the majority.

The relationship between node outdegree, betweenness and node influence scale is shown in Fig. 5. In the chart, the X axis represents the node outdegree and betweenness for the corresponding data set, and the Y axis represents the node Influence scale. In Fig. 5. (a) (b), it can be seen that there is almost no correlation between the node influence scale and the node output. In Fig. 5. (c) (d), it can also be seen that there is almost no correlation between node influence scale and node betweenness.

Fig. 5.
figure 5

Relationship between outdegree, betweenness and node influence scale

The relationship between number of directly connected communitie (V-community), node importance and node influence scale is shown in Fig. 6. In the chart, theX axis represents V-community and node importance for the corresponding data set, and the Y axis represents the node Influence scale. In Fig. 6. (a) (b), it can be seen that there is a certain correlation between the influence scale of nodes and the number of communities directly connected by nodes. Nodes with a large number of directly connected communities have a larger influence scale, but the number of communities connected by nodes with a larger influence scale is not necessarily large. It can be seen from Fig. 6. (c) (d) that there is a strong correlation between node Influence scale and node importance. The image in Fig. 6. (c) has a larger value range of X axis and a larger image density on the left. for convenience of observation, the distribution image in the range of importance 0 to 5000 is captured, as shown in Fig. 7. As can be seen from Fig. 7, nodes with low importance may have a higher Influence scale, nodes with high importance have a higher Influence scale. In Fig. 6. (d), the correlation between node influence scale and node importance is more obvious, and the image is basically scattered between two oblique lines passing through the origin (oblique lines have been marked in the figure). There is a strong correlation between node influence scale and node importance, which shows that the nodes with high importance we find through this method have high Influence scale.

Fig. 6.
figure 6

Relationship between V-community, node importance and node influence scale

Fig. 7.
figure 7

Node importance and node influence scale in Facebook dataset

The data of the top ten nodes in Facebook data set are shown in Table 4, and the data of the top ten nodes in mail data set are shown in Table 5.

Table 4. Top ten nodes in Facebook data set in node importance
Table 5. Top ten nodes of node importance in E-Mail data set

It can be observed that for nodes with high I, the Influence scale is very high from Table 4 and Table 5; for nodes with high I, the Degreeout is not necessarily high, and some are even very low. I high node’s Betweenness is not necessarily high, some even very low; for nodes with high I, the V-community is not necessarily high, and some are even very low.

In conclusion, the influence of nodes in the network is not related to structural features such as node degree and betweenness. The method in this paper evaluates the importance of nodes based on community influence. Nodes with higher importance have greater influence, which can promote information dissemination, and can find some nodes with less prominent structural characteristics but larger actual influence.

5 Conclusion

Identifying nodes that have greater influence on the dissemination of information in social networks is a hot research field in social networks. However, there are few methods in the existing algorithms for evaluating node importance to study the influence of communities on the dissemination of information in social networks. In this paper, a node importance evaluation algorithm based on community influence is proposed. After the network is divided into communities, the influence of communities and the influence of nodes on communities are evaluated. Finally, the importance of nodes is comprehensively evaluated by combining both. The experimental results show that the nodes with high importance evaluated by the algorithm have high influence in the network, which can promote the dissemination of information, and can find some nodes with high influence but not prominent structural characteristics.

For the next step, we plan to introduce content features and user behavior features in social networks to integrate the importance of computing nodes.