On the Selection of Information Sources for Gossip Spreading

Information diffusion is efficient via gossip or rumor spreading in many of the next generation networks. It is of great importance to select some seed nodes as information sources in a network so as to maximize the gossip spreading. In this paper, we deal with the issue of the selection of information sources, which are initially informed nodes (i.e., seed nodes) in a network, for pull-based gossip protocol. We prove that the gossip spreading maximization problem (GSMP) is NP-hard. We establish a temporal mapping of the gossip spreading process using virtual coupon collectors by leveraging the concept of temporal network, further prove that the gossip spreading process has the property of submodularity, and consequently propose a greedy algorithm for selecting the information sources, which yields a suboptimal solution within 1 - 1 / e of the optimal value for GSMP. Experiments are carried out to study the spreading performance, illustrating the significant superiority of the greedy algorithm over heuristic and random algorithms.


Introduction
Information dissemination through networks is ubiquitous in the modern world [1][2][3][4][5][6], and gossip or rumor spreading is an efficient way of information diffusion: imagine that a rumor arises in a town and is epidemically like spread among the whole population [7].There are two atomic types of gossip protocols: "pull, " an uninformed node requests an unpossessed message from a randomly chosen neighbor, and "push, " an informed node sends its possessed message to a randomly chosen neighbor.Gossip-based algorithms are simple, robust, flexible, and scalable and hence are promising for many of the next generation networks [8].Existing applications are numerous, such as consensus and averaging problems in sensor networks [9,10], ad hoc message routing [11], peer-to-peer (P2P) file distribution [12], and information dissemination in social networks [13].
Most existing analytical works on gossip spreading have dealt with (high probability) upper bounds of the completion time, disregarding the choice of information sources [7,8,[12][13][14][15].Given a static connected network graph, the completion time of a gossip protocol is the first time at which all the nodes are informed.Recently, the issue on the selection of information sources for gossip spreading has been treated in our previous work [16].However, in [16], only the push-based gossip protocol was considered and the complexity issue of the gossip spreading maximization problem was left open.
In this paper, we consider the problem of selecting information sources for gossip spreading, focusing on the pull-based gossip protocol.The information sources are initially informed nodes (i.e., seed nodes) in a network.We ask the question: given a general network graph, the budgeted size of a seed set, and a constrained deadline, how to pick the elements in the seed set, which will be endowed with an identical message to be propagated to the rest of the network, such that the expected number of informed nodes is maximized within the deadline.The gossip spreading maximization problem (GSMP) frequently arises in many scenarios, especially when a popular content is demanded by a group of nodes.In sensor networks, one needs to decide the deployment of key sensors, capable of detecting and issuing emergence with the other hearing-and-forwarding sensors, so as to maximize the alarming area as quickly as possible.In P2P networks, one needs to decide the best choice of seeds so as to maximize the file distribution before delay tolerance.Besides, similar research topics can be found in viral marketing (a.k.a.influence maximization), detection of disease outbreak, and opportunistic cellular traffic offloading [17][18][19].
Our contributions are as follows.(1) We prove that GSMP is NP-hard, which means that suboptimal solutions with performance guarantee should be exploited because polynomial-time algorithms have not yet been discovered for the class of NP-hard problems [20].(2) We establish a temporal mapping of the pull-based gossip spreading process using virtual coupon collectors by leveraging the concept of temporal network and prove the submodularity in gossip spreading based on this temporal mapping method.Consequently, we propose a greedy algorithm for GSMP that yields a solution within (1 − 1/) of the optimal value.(3) We carry out extensive experiments to study the spreading performance, demonstrating that the greedy algorithm outperforms heuristic and random algorithms significantly.
Our work differs from previous works in several aspects.First, we deal with GSMP by selecting influential information sources, different from the analytical works on highprobability completion time (see, e.g., [8]).Second, we study GSMP in the area of gossip spreading and treat it using the coupon collecting and temporal mapping method, while the influence maximization problem is related to the social influence diffusion models and is treated with the coin flipping and equivalent view method (see, e.g., [17]).Third, beyond our previous work [16], we have focused on the "pull" model, rigorously proved the NP-hardness of GSMP, and leverage the new temporal mapping method to analyze the submodularity of gossip spreading.
The rest of this paper is organized as follows.Section 2 describes the pull-based gossip protocol and formulates the gossip spreading maximization problem.We analyze the complexity of GSMP and show its NP-hardness in Section 3. We recognize the submodularity in gossip spreading via a temporal mapping method and propose a suboptimal solution with performance guarantee for GSMP in Section 4. We carry out simulation experiments to study the spreading performance of our approach in Section 5. Finally, Section 6 concludes this paper.

System Model and Problem Formulation
In general, a directed network graph  = (, ) consists of a set of nodes  and a set of directed edges .Each node is spoken of being either informed or uninformed; and a node is informed if and only if it has possessed its desired message.For any pair of nodes , V ∈ ,  may send its possessed message to V if and only if the edge (, V) ∈ .
Time is slotted and a message can be transferred from a sender to a receiver within a time slot, which is called round throughout this paper.No matter which content the information flow over the underlying network is carrying, we focus on only one piece of message in our model.

Gossip Protocol.
The type of gossip spreading considered in this paper can be called multiple sources with single message [16]: initially the nodes in a seed set  become informed with an identical message, and all the other nodes wish to receive a copy of that message.Any uninformed node V can pull the message from one informed neighbor, which is connected to V by an incoming edge in .That informed neighbor can belong to either the seed nodes in  or the other initially uninformed nodes that have already obtained the message.
In each round, the nodes in the network  contact their neighbors in the following manner: each of the uninformed nodes picks a partner uniformly at random from the set of all its neighbors (connected by incoming edges), oblivious of any state, history, or other nodes' choices.Once a partner is chosen, the uninformed node pulls its desired message from the chosen partner.
In any given round, an uninformed node can pull from only one partner, and it becomes informed if its chosen partner has already possessed its desired message and remains uniformed if this partner does not possess this message.Once a node gets informed in some round, it will stay informed forever.All communications are assumed to be error-free with error control coding and protocol overhead encapsulated by physical-layer design, which are not considered herein.

Problem Formulation. Consider the sample space Ω,
where each sample specifies one possible realization of the gossip spreading process.Let  denote one sample in Ω, and let Pr[] denote its occurrence probability.We are interested in the case where the gossip spreading process runs until a constrained deadline.Given a seed set  of  nodes and a deadline of  rounds, the number of informed nodes under one sample  ∈ Ω by deadline  is   ( | ).So the expected number of informed nodes (within  rounds) is Given a directed network graph  = (, ), the budgeted size  of a seed set , and a constrained deadline , we wish to select the seed nodes in  such that   () is maximized.This is called gossip spreading maximization problem (GSMP) and is formally given by max ⊆   () subject to || ⩽ . (2)

Complexity Analysis
GSMP belongs to the field of stochastic programming, and we show that it is NP-hard in this section.

Preliminary. First of all, we consider the decision version of GSMP.
Problem 1 (gossip spreading decision problem).Given a network graph  = (, ), a constrained deadline , and a utility quota , we wish to determine whether there exists  of the nodes for the seed set  such that the expected number of informed nodes   () is at least .Let an instance of Problem 1 be denoted by (, , , , ).We see that (, , , , ) belongs to the class NP [20], since it can be validated in polynomial time given any solution of  seed nodes.In order to argue the NPhardness of GSMP, we will show that its decision version (i.e., (, , , , )) can be reduced from the following problem.
Problem 2 (partial set cover problem).Given a ground set  = { 1 ,  2 , . . .,   } and a collection of 's subsets  = { 1 ,  2 , . . .,   }, we wish to determine whether there exist ℎ of the subsets such that the cardinality of their union is at least .Let an instance of Problem 2 be denoted by (, , ℎ, ).As the partial set cover problem generalizes the NP-complete set cover problem [20], it must be NP-complete.

GSMP is NP-Hard.
Next we show that GSMP is NP-hard using a reduction from the NP-complete (, , ℎ, ) to (, , , , ).Note that a bipartite graph is leveraged in the following proof, and similar techniques had been widely used in the complexity analysis literature, such as [17,21,22].

Theorem 3. The gossip spreading maximization problem (GSMP) for the pull-based gossip protocol is NP-hard.
Proof.Consider an arbitrary instance of the partial set cover problem (, , ℎ, ) with  elements  = { 1 ,  2 , . . .,   } and  subsets  = { 1 ,  2 , . . .,   } of , and construct a directed bipartite graph  * = ( * ,  * ) as follows.The node set  * contains  +  nodes, in which a node V   ( = 1, . . ., ) is corresponding to a subset   and a node V   ( = 1, . . ., ) is corresponding to an element   .Each V   is called subset node and each V   is called element node hereafter.There is a directed edge from a subset node V   to an element node V   if   ∈   ; for example, see Figure 1.In the following, we will see that solving an arbitrary instance (, , ℎ, ) of the partial set cover problem is equivalent to solving a special-case instance ( * ,  * , ∞, ℎ, ℎ + ) of the gossip spreading decision problem, and we assume ℎ <  without loss of generality.
If we can find ℎ of the subsets in  such that the cardinality of their union is at least , then we will show that ℎ nodes can be found in  * for the seed set  such that  ∞ () ≥ ℎ +  with the deadline  being infinity.For each of these ℎ selected subsets for (, , ℎ, ), we use the corresponding subset node in  * as a seed node; eventually, at least  element nodes can pull the desired message from their subset nodes via gossip spreading given the infinite deadline; that is,  ∞ () ≥ ℎ + .
Conversely, if we can find ℎ of the nodes in  * for the seed set  such that  ∞ () ≥ ℎ + , then we will show that ℎ subsets can be found in  such that the cardinality of their union is at least .For each V of these ℎ selected seed nodes for ( * ,  * , ∞, ℎ, ℎ + ), if V is an element node, then we replace it with a subset node as follows.If the subset node V  that points to V either has already been selected as a seed node or has already replaced other seed nodes, then we replace V with any other available subset node; otherwise, we replace V with this subset node V  .After all of those possible replacements, we have obtained ℎ subset nodes in  * as seed nodes, and  ∞ () ≥ ℎ +  is clearly still satisfied.Therefore, at least  elements can be covered using ℎ subsets in , which are exactly corresponding to these ℎ subset nodes in  * .
In total, if the gossip spreading decision problem can be solvable, then the partial set cover problem must be solvable; that is, the decision version of GSMP is at least as hard as the NP-complete partial set cover problem.Remark 4. The above arguing method can be applied in analyzing the complexity of GSMP under the "push" model.Since GSMP is NP-hard and polynomial-time algorithms have not yet been discovered for the class of NP-hard problems [20], we should exploit suboptimal solutions with performance guarantee.

Submodularity and Greedy Algorithm
In this section, we establish a temporal mapping of the gossip spreading process using virtual coupon collectors by leveraging the concept of temporal network [23].This treatment provides a tractable way to recognize the submodularity in gossip spreading and leads to a greedy algorithm which yields a solution to GSMP within (1 − 1/) of the optimal value.4.1.Preliminary.Before the analysis, we introduce the preliminaries on the temporal network, the shortest timerespecting path, and the live diffusion path.
A temporal network embodies the information of when events occur in dynamic systems [23].For the case of gossip spreading, the edge between any two interacting nodes is endowed with the information of contact times when these two nodes share message.For example in Figure 2, the weights on the edge from node V  to node V  indicates that V  sends data to V  in rounds  = 6, 7, 11.The key is to consider the causality constraints of the time sequences of nodes' contacts [23]; for example, in Figure 2, V  cannot transmit data to V  even if there are contacts between V  and V  as well as between V  and V  , since the contacts of V  and V  occur before those of V  and V  .
Consider a directed temporal graph  ⊤ = (,  ⊤ ,  ⊤ ) for the gossip spreading process over a network  = (, ), where  ⊤ is the set of weights on the set of directed edges  ⊤ , indicating the time information of nodes' contacts.According to the causality constraints of time sequences, a time-respecting path P ⊤ V from  to V is given by where the weight  V  ,V +1 is the time at which V  sends data to V +1 , and the weights of successive edges on the path P ⊤ V must be strictly increasing; that is, Let   be the informed time of user , that  is, the first time at which  becomes informed; then for P ⊤ V defined in (3), its length (i.e., the distance) is defined as In particular, the shortest time-respecting path P ⊤ * V from  to V is given by Given a node set , we say a node V is reachable if either V ∈  or there exists a time-respecting path from one node in in which (, V) is the length of the shortest time-respecting path from the node  to the node V, and (V, V) ≡ 0. For an unreachable node V, (, V) = ∞.
For an ordinary path ⟨P V : V 0 = , V 1 , . . ., V  = V⟩ on the network , we say it is a live diffusion path, if  ∈  becomes informed initially in round  = 0, V +1 succeeds in pulling its desired message from its neighbor V  for each 0 ≤  ≤  − 1, and the considered node V is finally informed.Note that V is reachable from  via the seed node .

A Temporal Mapping.
In the following, we establish a temporal mapping of the pull-based gossip spreading process by constructing a directed temporal graph.Note that all the multiple weights on each edge in the temporal graph are absolute time since the gossip spreading process starts up initially, and the temporal mapping method used in this paper is different from the equivalent view method used in our previous work [16].These two methods are not simply coupled, and the "pull" model brings in new ingredients to GSMP.
Consider an arbitrary node V, which attempts to pull its desired message from its neighbors in each round since the beginning of round  = 1, as long as it is uninformed.Note that the pulling process of V from its neighbors  V is exactly a coupon collecting process [24], and denote this process using CC(V).In CC(V), V has | V | different coupons to collect, and in each round each of these coupons is collected uniformly and independently at random with replacement.For the node V, let   (V) denote the stochastic process indicating the coupon collected in its CC(V) in round  ( ≥ 1).The event that a certain coupon  is collected in round  means that the corresponding neighbor  =   (V) is pulled from in round  by the node V.Note that in the above described CC(V), we do not care whether the message-pulling node V has already possessed the message or not.
Given the constrained deadline , for each node V, we independently run CC(V) till the deadline  is reached and record all the time stamps for each collected coupon when V collects it every time.Therefore, the set  ,V of time stamps for a neighbor  which has been contacted by V can be written as After all the coupon collecting processes {CC(V), V ∈ } are completed, a directed temporal graph  ⊤ = (,  ⊤ ,  ⊤ ) is thus constructed; for example, see Figure 3.In  ⊤ , the set  ⊤ of directed edges contains just the incoming edges from those nodes in   (V) with 1 ≤  ≤  for each V ∈ , and the set  ⊤ of edge weights is given by Leveraging the constructed temporal graph  ⊤ as above, we have Theorem 5. Note that the informed time  V of a reachable node V is the first time in which V becomes informed.In addition, the above-assumed pulling process CC(V) of a node V after it becomes informed is no longer effective; that is, the attempts of V to pull its desired message from its neighbors will no longer take place in the actual gossip spreading process.
Theorem 5. Given a directed network  = (, ), an arbitrary seed set  of nodes, and a constrained deadline , the expectation of the informed time of each reachable node is equal to the expectation of the length of the shortest time-respecting path (i.e., the distance) from  to the considered node on  ⊤ = (,  ⊤ ,  ⊤ ).
Proof.For each node V ∈ , consider its CC(V).From the memoryless property of the pull-based gossip spreading process, the CC(V) can be started up from the very beginning of the spreading process till a certain time (i.e., the deadline ) even after V becomes informed.The attempts of V to pull its desired message from its neighbors are no longer effective after its informed time  =  V .Consequently, for each node V, we can let the CC(V) be run at the very beginning and independently of the coupon collecting processes of all the other nodes.
With all the coupon collecting processes {CC(V), V ∈ } run till the deadline  is reached, their results are then recorded and can be later used for revealing the (absolute) time stamps of the events that V succeeds in pulling its desired message from its neighbors for the first time within  rounds.Therefore, a temporal graph  ⊤ = (,  ⊤ ,  ⊤ ) can be constructed from {CC(V), V ∈ }, containing all the information about one sample realization of the spreading process over the network  = (, ) within  rounds.
Specially, given an arbitrary seed set  of nodes, the informed time  V for each reachable node V ∈  in which it becomes informed for the first time is equal to the length of the shortest time-respecting path from  to V on  ⊤ , and thus their expectations are also equal by taking expectations over all possible realizations of the gossip spreading process within  rounds.
Remark 6.For any sample realization of the gossip spreading process, each of the resulting live diffusion paths from  to all the other reachable nodes on  is equivalent to the shortest time-respecting path from  to the considered node on  ⊤ .

Submodularity in Gossip Spreading.
The following arguments lead to a greedy algorithm that yields a solution within (1 − 1/) of the optimal value for GSMP.Given a finite ground set  = { 1 ,  2 , . . .,   } of  elements and an arbitrary function (⋅) :  → , (⋅) maps subsets of  to real numbers.Formally, (⋅) is called submodular function if satisfying for all pairs of subsets  1 ⊆  2 and all elements V ∈  \  2 [25].The quantity ( ∪ {V}) is called the marginal increase by adding a new element V into a given subset .Besides, (⋅) is called monotone function if satisfying for all subsets  ⊆ U and all elements V ∈  \ .
Leveraging the temporal mapping of the gossip spreading process established in Theorem 5, we have Theorem 7. Note that   (⋅) is called gossip spreading function, and   () is defined in (1) for any given seed set  ⊆  in the network  = (, ).Theorem 7. Given a directed network  = (, ) and a constrained deadline , the gossip spreading function   (⋅) is submodular for the pull-based gossip protocol.
Proof.Recall Theorem 5 and consider the sample space Ω, where each sample specifies one possible realization of {CC(V), V ∈ }.Conditioned upon  ∈ Ω, define   ( | ) as the number of informed nodes within  rounds using  as the seed set.Let (, ) denote the set of nodes that are reachable from a node  on  ⊤ = (,  ⊤ ,  ⊤ ) with the length of the shortest time-respecting path no larger than .Therefore,   ( | ) is equal to the cardinality of the union ⋃ ∈ (, ); that is, We now prove that the function   (⋅ | ) is submodular for each sample , similar to [16,17].Let  1 and  2 denote two seed sets with  1 ⊆  2 .For a node V, consider the following quantity: which is the number of elements in (V, ) that are not already in the union ⋃ ∈ 1 (, ).Therefore, we have According to the defining property of submodularity in (9), we see that   (⋅ | ) is submodular from (13).To complete the proof, we have which means that within  rounds the expected number of informed nodes is just the weighted average over all the sample realizations in Ω of the gossip spreading process.Since a nonnegative linear combination of submodular functions is still submodular [25],   (⋅) is submodular.
Remark 8. Submodularity is a widely applied mathematical tool in tackling a class of nonconvex combinatorial optimization problems, such as social influence maximization, maximum facility location, sensor placement, and optimization design of cellular networks [17,[26][27][28].

Greedy Algorithm with Performance
Guarantee.We invoke the following result from [25].Note that the greedy algorithm, presented in "Algorithm 1, " selects each new element with the largest marginal increase in the gossip spreading function   (⋅) till the seed set  is filled in with  nodes; and in "Algorithm 1, " |Gossip()| is the number of finally informed users when the gossip spreading process runs till the delay-tolerant deadline  using the seed set , and the Monte Carlo method is leveraged to evaluate the average value for  repeating times.The complexity Initialize  ← 0 and  for  = 1 →  do for each node V ∈  \  do of the greedy algorithm is O(||).In addition, the lazy evaluation method in [18] can be leveraged to accelerate the greedy algorithm.
Theorem 9 (see [25]).For a nonnegative monotone submodular function (⋅), let  be the solution of the greedy algorithm and let  * be any arbitrary solution; then Remark 10.Since   (⋅) is submodular and clearly nonnegative monotone as well, the greedy algorithm yields a solution within (1 − 1/) of the optimal value for GSMP.Note that this (1 − 1/)-factor is the best reported one for the class of submodular maximization problems and the infeasibility to improve this factor is argued in [29].

Experiments
In this section, we carry out simulation experiments to study the spreading performance of the greedy algorithm.For comparison, we also implement two heuristic algorithms and a random algorithm.In total, there are  = 500 nodes in the network; and any pair of nodes is connected with two edges in both directions if their Euclidean distance is no larger than the connectivity radius   = √(log )/.Note that this   guarantees that a random geometric graph of  nodes is connected with high probability [30].Besides, each hotspot is restricted in a rectangular region within the offset ranges [−  , +  ] 2 from its hotspot center.In Figure 4, the network layout is illustrated.

Four Typical Social
Networks.We further implement our approach in four typical social networks to evaluate the  spreading performance.The small-world network [31], the scale-free network [32], the scientific-collaboration network [33], and the autonomous-system network [34] and the main network statistics are presented in Table 1.

Heuristic and Random
Algorithms.The degree centrality and distance centrality-based heuristic algorithms are widely used in social network analysis [35]; and we can select the seed nodes from the network, using their degree centralities and distance centralities as the decision criteria.
The degree centrality deg  of each node  is equal to the number of its neighbors (connected by outgoing edges).The degree-centrality algorithm selects each new element with the largest degree centrality till the seed set  is filled in with  nodes.
In particular, we assume the network is connected for the distance-centrality algorithm.The distance centrality dist  of each node  is equal to the summarization of the distances from  to all the other nodes in the network.The distance from  to each of the rest nodes is measured by the number of hops on the shortest path from  to that node.The distance-centrality algorithm selects each new element with the smallest distance centrality till the seed set  is filled in with  nodes.If the network is not connected, we can use the summarization of these distances' reciprocals instead of these distances themselves to measure the distance centrality.The random algorithm is the baseline, selecting  distinct seed nodes uniformly at random from all the network nodes.

Spreading Performance in Random Geometric
Network with Hotspots.We evaluate the spreading performance of the greedy algorithm using different deadlines and seed set sizes, and the results are presented in Figure 5.It is shown that the informed population (i.e., the expected number of informed nodes) grows larger as more seed nodes are selected.In addition, if the network nodes are willing to tolerate longer deadline, more of them can successfully receive their desired information.
Next, we present the spreading performance of different seed-selecting algorithms in Figure 6.When the seed set size is 10, the greedy algorithm outperforms the degree-centrality algorithm by 76%, the distance-centrality algorithm by 426%, and the random algorithm by 37%.
The greedy algorithm that leverages the dynamics of the gossip spreading process in the network performs much better than those centrality-based heuristic algorithms that rely only on the network's structural properties.In actual fact, many of the most central nodes (e.g., with high degree centralities or low distance centralities) are clustered, and thus it is unnecessary to select all of them.In Figure 7, this clustering effect of both centrality-based algorithms is illustrated.
Besides, we see that the random algorithm outperforms both these heuristic algorithms.The reason lies in the fact that the underlying network is generated from the random geometric graph and the random algorithm may often select some seed nodes which have high power to cause a large gossip spreading in different regions over the [0, 1] 2 square.From Figures 8, 9, 10, and 11, we see that a large portion of nodes in each network are informed when the seed set size is around 20 for the greedy algorithm.Besides, we see that the greedy algorithm significantly outperforms the random algorithm, especially in the scale-free network, the scientificcollaboration network, and the autonomous-system network.

Conclusions
In this paper, we have investigated the problem on the selection of information sources for pull-based gossip spreading.We have proved the NP-hardness of the gossip spreading maximization problem and proposed a suboptimal solution (i.e., the greedy algorithm) for this problem to select seed nodes as information sources.A temporal mapping of the dynamic gossip spreading process has been established via virtual coupon collectors by leveraging the concept of temporal network to analyze this problem.In addition, the temporal mapping method helps to bridge the connection from graph theoretic problems to submodularity and further lead to the greedy algorithm with performance guarantee.In the future, it is interesting to leverage the methods developed in the treatment of GSMP to deal with the problems on gossip spreading when implementing gossip-based algorithms in real-world networks.

Figure 1 :
Figure 1: Illustration of a directed bipartite graph  * = ( * ,  * ) constructed from an instance of the partial set cover problem.

Figure 2 :
Figure 2: Illustration of a temporal network with a timeline indicating the information of when events occur.

5 Figure 3 :
Figure 3: Illustration of the temporal mapping, in which, the red and green paths are the live diffusion paths from the seed nodes V  and V  , respectively, and the black arrow indicates the edge direction.Left: the underlying network.Right: the constructed temporal graph  ⊤ = (,  ⊤ ,  ⊤ ).

Figure 4 :
Figure 4: Network layout of a random geometric network with hotspots on the [0, 1] 2 square.

10 Figure 5 :
Figure 5: Spreading performance of the greedy algorithm.

Figure 6 :
Figure 6: Spreading performance of the greedy, heuristic, and random algorithms.

Figure 7 :
Figure 7: Seed nodes picked by the greedy and heuristic algorithms.

Figure 8 :
Figure 8: Spreading performance in the small-world network.

Figure 9 :
Figure 9: Spreading performance in the scale-free network.

Table 1 :
Node and edge numbers of four social networks.