An Influence Maximization Algorithm Based on the Influence Propagation Range of Nodes

The problem of influence maximization in the social network G is to find k seed nodes with the maximum influence. The seed set S has a wider range of influence in the social network G than other same-size node sets. The influence of a node is usually established by using the IC model (Independent Cascade model) with a considerable amount of Monte Carlo simulations used to approximate the influence of the node. In addition, an approximate effect (1 − 1/e) is obtained, when the number of Monte Carlo simulations is 10000 and the probability of propagation is very small. In this paper, we analyze that the propagative range of influence of node set is limited in the IC model, and we find that the influence of node only spread to the t′-th neighbor. Therefore, we propose a greedy algorithm based on the improved IC model that we only consider the influence in the t′-th neighbor of node. Finally, we perform experiments on 10 real social network and achieve favorable results.


Introduction
The problem of influence maximization in the social network G is to find a seed set S, where the size of seed set S is k, k = |S|. With the development of social networks, the problem of influence maximization has attracted many researchers to study this problem and has been applied in many fields. The most classic application of influence maximization is viral marketing. Viral marketing is the process of marketing products to acquaintances by using the "word of mouth" effect [Guille, Hacid, Favre et al. (2013); Goldenberg, Libai and Muller (2001)] in social network. Customers are more inclined to accept products recommended by acquaintances than strangers [Hill, Provost and Volinsky (2006) ;Sadovykh, Sundaram and Piramuthu (2015); Schmitt, Skiera and Van den Bulte (2011) In recent years, social software has existed on almost everyone's mobile phones and computers, such as facebook, twitter, and weibo. Therefore, the social network is usually used by various commercial companies to promote products. Commercial companies will choose some initial users, that is, the seed set S, which obtain the maximum commercial value by marketing products. The problem of influence maximization was firstly proposed by Kleinberg et al. [Kempe, Kleinberg and Tardos (2003)], they proved that the selection of k seeds with the maximum influence is NP-hard, and a simple greedy algorithm was proposed to approximate the optimal solution. The probability of propagation is assumed to be very small, when the simple greedy algorithm uses the IC model to simulate the influence of nodes. Therefore, we conclude that the range of influence of node is limited. The IC model is used to simulate the influence of almost all nodes in each iteration by using the simple greedy algorithm to find the seed set S, therefore, the time complexity of the simple greedy algorithm is very high.
In order to reduce the time complexity of the simple greedy algorithm, Leskovec et al. [Leskovec, Krause, Guestrin et al. (2007)] proposed the CELF algorithm for selecting the most influential seed set. The CELF algorithm uses the "lazy-forward" optimization strategy to select the most influential seed set, that is, the marginal benefit of a node in the current iterative is not more than itself in the previous iteration. Therefore, CELF algorithm reduce the calculation times of influence of nodes, and CELF algorithm is 700 times faster than the simple greedy algorithm. Chen et al. [Chen, Wang and Yang (2009)] proposed DegreeDiscountIC algorithm to select the most influential seed set, whose time complexity is much lower than the simple greedy algorithm. Chen et al. [Chen, Wang and Wang (2010)] proposed the PMIA algorithm, which established local arborescence structures for each node and used the local arborescence structures to calculate the influence of the node. The PMIA algorithm has the advantages of high accuracy and low time complexity. The PMIA algorithm needs to establish local arborescence structures for each node, therefore, the PMIA algorithm has high spatial complexity. Jung et al. [Jung, Heo and Chen (2012)] proposed the IRIE algorithm, where IR is the influence ranking algorithm and IE is the influence simulation algorithm. IRIE is an effective algorithm to find the maximum influence seed set. Xia et al. [Xia, Song, Jing et al. (2018)] proposed that the scale of disease spreading can be reduced by increasing the rate of spreading by constructing double-layer network and using markov chains theory. Wang et al. [Wang, Ju, Gao et al. (2018)] present a novel coverage control algorithm based on particle swarm optimization to improve coverage rate and reduce consumption. Li et al. [Li, Li, Chen et al. (2018)] construct an interest graph built by Gaussian graphical modeling to select seed. Li et al. [Li, Li, Chen et al. (2018)] construct an interest graph built by Gaussian graphical modeling to select seed.
In the previous studies, the researchers tried to solve the time complexity problem of searching for the most influential seed set. IC model is used to simulate the influence propagation in the problem of influence maximization, and the simple greedy algorithm is used to obtain the seed set with the maximum influence. In generally, the influence propagation simulated by the IC model in the simple greedy algorithm is high time complexity. In other words, the influence of each node needs to calculate in almost each iteration with using the IC model. The probability of propagation p usually is very small, so we conclude that the range of spreading influence of the node in each iteration is limited.
In this paper, we present a simple greedy algorithm based on improved IC model, and we found that the spreading influence of node by using IC model alaways stop in the t -th neighbor of node. Therefore, we limit the range of influence of node simulated by the IC model. In the other word, we limit the influence of the node in the t -th neighbor to reduce the time complexity of the IC model, and our algorithm is adaptive [Zhang, Zheng and Xia (2018)]. We perform experiments on 10 real social networks, and the algorithm we proposed obtained good results.
2 Description of the problem

Influence maximization problem
We define the social network as G = (V, E), V represents the set of user nodes in the social network G, and E represents the relationships of user nodes, n = |V |, m = |E|. We define the probability of propagation of the social network G is p, that is, the weight of each edge belonging to E is p. The problem of influence maximization is to find the most influential seed set S in the social network G, where k = |S|, so that the seed set has the maximum influence.

Independent cascade model
We use the independent cascade model to simulate the influence of node [Chen, Fan, Li et al. (2015); Liu, Cong, Xu et al. (2012); Goldenberg, Libai and Muller (2001)] for measuring the ability of node to propagate influence. The principle of the independent cascade model is as follows. In the network G, all nodes have two states: one is active state and the other is inactive state. A i is the set of nodes that are activated at time i. In the initial phase, that is, at time t = 0, A 0 = S represents that the nodes in the seed set S are active in the beginning of independent cascade model. In other word, the node in A 0 is active and the remaining nodes are inactive. At time t = i, for any edge (u, v)∈E, u attempts to activate the node v with the probability of propagation p, when node u∈A i−1 and node v is inactive. If v is activated successfully, v is active from time i + 1. If v fails to be activated successfully by u, v can not be activated by u starting at present moment. If node v has multiple neighbors activated, v will be activated with 1 − (1 − p) l , where l is the number of activated neighbors of v. When A i is empty, that is, at time t = i, no node is activated, then the propagation process ends and the number of activated nodes in the whole process is the influence of the seed set S [Kimura, Saito, Nakano et al. (2010); Page, Brin, Motwani et al. (1999)].

Simple greedy algorithm
We define the influence function σ(v) as the influence of the node v. In this paper, this function value is the number of nodes activated by the node v. We define a function IC(·) that simulates the influence of node set by using IC model, whose value is the number of nodes activated by the node set. The key of the simple greedy algorithm is the submodule property [Cornnejols, Fisher and Nemhauser (1977); Williams (1990)]: assume that the f (·) is a function that maps node set to a non-negative integer. If there is any S⊆T , the function f (·) satisfies the submodule property. According to the submodule property, we choose the node with the largest marginal benefit in the current iteration to join in the seed set, When selecting the node with the maximum influence. The steps of the simple greedy algorithm are shown in Tab. 1.

Table 1: Simple Greedy Algorithm
Alogrithm 1: Simple Greedy Algorithm Input: social network G = (V, E), number of iterations R and the number of seed nodes k Output: seed set S The input of Algorithm 1 is the social network G = (V, E), the number of Monte Carlo simulations R, and the size of seed set k. The output of the simple greedy algorithm is the seed set S with the maximum influence. In Step (1), the seed set S is initialized to be an empty set, that is, S = ∅; from Steps (2) to (11) are the loops for finding k seed nodes, where Step (2) is the first layer loop for finding the k seed nodes; in Step (3), this step traverses each node beside the node in the seed set S; in Step (4), s v is used to store the accumulated value of influence of the set S∪{v} and its initial value is 0; from Steps (5) to (7), Monte Carlo simulation is used to simulate the influence of the node, and the number of iterations is R; in Step (6), IC(S ∪ {v}) is the influence of the set S ∪ {v} by using the IC model, and its value is the number of nodes activated by the set S∪{v}. In addition, its value is added to s v . s v accumulates the influence of the set S∪{v} with the R iterations; in Step (8), the influence of the set s v is approximated by averaging S∪{v}; in Step (10), this step find the node v in each iteration that maximizes σ(v) and add it to the seed set; in Step (12), this algorithm outputs k seed nodes, that is, the seed set S.

Detailed process of the algorithm
We find that the simple greedy algorithm need to calculate the influence of all nodes except the seed node in the every iteration, when we study the IC model and the simple greedy algorithm, so the time complexity of the simple greedy algorithm is very high. Generally speaking, the problem of influence maximization sets the probability of propagation between nodes to a small value. Therefore, we conclude that the range of influence of node in the social network is extremely limited and we improve the traditional IC model. In this paper, we limit the range of influence of node in the IC model. In the IC model, we limit the influence of node in the t -th neighbor, which greatly reduces the time complexity of the IC model. On account of the limitation of scope of influence, we set the number of Monte Carlo simulations R as 100 to reduce unnecessary iterations. The detailed steps of the improved IC model algorithm show in Tab. 2. Alogrithm 2: Improved IC Model Input: social network G = (V, E), nodeset needs to be simulated N S, propagation probability p, the range of propagation t Output: Activate node set A (1) Initializing A = N S,A now = N S,t = 1 (2) while(t≤t ||A now =∅) (3) A now uses the principle of the IC model to active its neighbors and the set of activated neighbors is temp The input of Algorithm 2 is the social network G, nodeset N S needs to be simulated, the probability of propagation p, the range of propagation t . The output of Algorithm 2 is the nodeset A activated by the set of N S. In Step (1), A represents the nodeset that have been activated. The initial value of A is N S, A = N S. A now is the set of nodes activated in the current iteration. The initial value of A now is N S, A now = N S. t is the number of iterations and initial value of t is 1; Steps (2) to (7) are the iterative framework where the N S actives the remaining nodes. If t > t or A now == ∅, the iteration will be terminated. N S affects successively its neighbors from 1-th to t -th and N S at most affects its t -th neighbor. if N S has not actived its t -th neighbors and the activated nodeset of current iteration is empty, A now == ∅, the iteration is terminated; in Step (3), A now uses the principle of IC model to active its neighbors, and the activated neighbors are added to the temp set. In the first iteration, A now is equal to N S; in Step (4), temp is assigned to A now , and A now is in the next iteration a set of nodes activated; in Step (5), the node activated in the current iteration is added to the set of activated nodes A, A = A∪temp; in Step (6), the number of iterations is incremented by one; the Step (8) is ended by Algorithm 2 and outputs the active nodeset A.
Based on the simple greedy algorithm, we propose a simple greedy algorithm based on improved IC model. The number of Monte Carlo simulations is 100, R = 100, and the probability of propagation p is 1 degreeu , that is, the node v wants to activate the node u, then the activation probability must be between (0, 1 degreeu ]. The simple greedy algorithm based on the improved IC model shows in Tab. 3. Alogrithm 3: Simple Greedy Based On Improved IC Model Input: social network G = (V, E), number of iterations R, propagation probability p, the range of propagation t? and the number of seed nodes k Output: seed set S (1) Initializing seed set S = ∅ (2) for i = 1 : k (3) for each node v∈V \S (4) end for (8) σ(v) = s v \R (9) end for (10) S = S∪{v|v = argmax(σ(v))} (11) end for (12) Output seed set S The input of Algorithm 3 is the social network G = (V, E), the number of Monte Carlo simulations R, and the number of seed nodes k. The output of Algorithm 3 is the seed set S with the maximum influence. In Step (1), the seed set S is initialized to be an empty set, S = ∅; Steps (2) to (11) are loops for finding k seed nodes; Step (2) is the first layer loop to find the k seed nodes; Step (3) traverses all nodes except the node in the seed set S; in step (4), s v is used to store the cumulative value of influence of the set S∪{v}, whose initial value is 0; Steps (5) to (7), Monte Carlo simulation is used to simulate the influence of the node, and the number of iterations is R; in Step (6), Improve_IC(G, S∪{v}, p, t ) is the simulation of the influence of the set S∪{v} using the Improve_IC model, whose value is the number of nodes activated by the set S∪{v}, and the value is added to s v . s v accumulates the influence of the set S∪{v} with R iterations; in Step (8), the influence of the s v is approximated by averaging S∪{v}; in Step (10), it is found in each iteration that the node v maximizing σ(v) is added to the seed set; in Step (12), the output of Algorithm 3 is seed set S.

Experimental data set
The experiment was performed on 5 undirected social network datasets and 5 directed social network datasets. The five undirected social network datasets are: wiki-Vote [Leskovec, Huttenlocher and Kleinberg (2010)], facebook-combined [Mcauley and Leskovec (2012)], CA-GrQc [Leskovec, Kleinberg and Faloutsos (2007)], CA-HepTh [Gehrke, Ginsparg and Kleinberg (2003)], and CA-HepPh [Gehrke, Ginsparg and Kleinberg (2003)]. The wiki-Vote is the voting history data of the Wikipedia community administrator election. The edge between the nodes indicates the vote of the user or the administrator to an administrator; the facebook-combined contains the anonymous data of the facebook, and the edge between the nodes represents the affiliation between users; CA-GrQ, CA-HepTh, and CA-HepPh are collaborative network datasets, whose data are the scientific collaboration between papers. If author i and author j published a paper together, the graph contains an undirected edge from i to j. The five directed social network datasets are: email-Eu-core [Yin, Benson, Leskovec et al. (2017)], Political blogs [Adamic and Glance (2005)], soc-hamsterster [Dünker and Kunegis (2015)], rt-bahrain [Rossi and Ahmed (2015)], and soc-advogato [Massa, Salvetti and Tomasoni (2009)]. Email-Eu-core is a dataset generated by the email data of a large european research institution, whose data is anonymous. If i sends at least one email to j, there is an edge (i, j) in the social network; Political blogs is a front page hyperlink between blogs in the context of the US election. A node represents a blog and an edge represents a hyperlink between two blogs; soc-hamsterster is a social relationship and family relationship between users of the hamsterster.com website; rt-bahrain is derived from twitter's social and political portal data, whose edge indicates that the user sends a tweet; the soc-advogato is the Advogato trust network. The node is the Advoto user in the soc-advogato, and the directed edge indicates the trust relationship. The topological attributes of all datasets are shown in Tab. 4, where n is the total number of nodes, m is the total number of edges, d max is the maximum degree, d is the average degree, r is the same coefficient, C is the clustering coefficient, and D is the network density.

The range of node influence and algorithm analysis
We believe that the propagation range of influence of node is limited with using IC model to simulate the influence propagation of nodesets, because the probability of propagation between nodes is always a small value. Therefore, we make t from 1 to 10 and use Algorithm 3 to obtain the seed set, and the size of seed set is 50. In addition, we use the IC model to simulate the influence of seed set and observe the influence of seed set selected by Algorithm 3. We experimented on 5 directed graphs and 5 undirected graphs respectively. Fig. 1(a) is the result graph of the email-Eu-core, and the horizontal axis is the value of t brought into Algorithm 3 and vertical axis is the influence of seed set. t = 1 means that we bring t = 1 into Algorithm 3 to find 50 seed nodes by using IC model to simulate the influence. We find that the influence of the seed set tends to be stable, when t = 3. Although there are some fluctuations after t = 6, but the fluctuations are not large. Therefore, we conclude that the email-Eu-core dataset only needs to let t = 3 with using Algorithm 3, which can get good results; Figs. 1(b) and 1(c) are the result graphs of the Political blogs and soc-hamsterster datasets. When t = 4, the seed set has the maximum influence. There is no difference between the influence of t = 3 and t = 4; Fig. 1(d) is the result of the rt-bahrain. When t = 3, the influence of current seed set has the maximum influence. Fig. 1(e) is the result graph of the soc-advogato. When t = 5, the current seed set has the maximum influence. However, there is no difference between the influence of t = 3 and t = 4. Fig. 1(f) is the result graph of the wiki-Vote. The current seed node set has the maximum influence, when t = 9; Figs. 1(g) and 1(h) are the result graphs of the facebook-combined and CA-GrQc. The current seed set has the maximum influence with t = 7, but it is not too different from the value of t = 3; Fig. 1(i) is the result graph of the rt-bahrain. When t = 3, the current seed set has the maximum influence; Fig. 1(j) is the result graph of the CA-HepPh, the current seed set has the maximum influence when t = 8, There is not too difference between the influence of t = 3 and t = 8. Experiments show that most datasets can achieve good results by using Algorithm 3 at t = 3. Therefore, when we use Algorithm 3, we let t = 3. We present the Algorithm 3 in the following and compare Algorithm 3 with the Closeness [29], PageRank [30], Degree and Random algorithm. The comparison results are shown in Fig. 2. Figs. 2(a) and 2(e) are the result graphs of the email-Eu-core and soc-advogato, whose horizontal axis is the number of seed nodes and vertical axis is the influence of seed set. The influence of Algorithm 3 is better than other algorithms. The influences of other algorithms are similar. Figs. 2(b) and 2(c) are the result graphs of the Political blogs and soc-hamsterster. The influence of Algorithm 3 is good. For other algorithms, the Pagerank is less better than other algorithms; Fig. 2(d) is the result graph of the rt-bahrain, Algorithm 3 is slightly better than the Closeness and Degree methods; Figs. 2(f), 2(i) and 2(j) are the result graphs of the wiki-Vote, CA-HepTh and CA-HepPh, Algorithm 3 works better than other algorithms, and other algorithms have similar influence; Figs. 2(g) and 2(h) are facebook-combined and CA-GrQc data. The fluctuation of the influence of each algorithm except Algorithm 3 is larger, but the influence of Algorithm 3 is better than other algorithms. In this paper, we present that the range of influence of node is limited by using the traditional IC model with small transmission probability. It is found through experiments that the influence range of the seed set is no more than the 3-th neighbor. Therefore, when we use the IC model to simulate the influence of node, we limit the influence in the 3-th neighbor, which can reduce the time complexity. The seed set selected by Algorithm 3 showed good results.