Non-submodular model for group profit maximization problem in social networks

In social networks, there exist many kinds of groups in which people may have the same interests, hobbies, or political orientation. Sometimes, group decisions are made by simply majority, which means that most of the users in this group reach an agreement, such as US Presidential Elections. A group is called activated if β\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\beta$$\end{document} percent of users are influenced in the group. Enterprise will gain income from all influenced groups. Simultaneously, to propagate influence, enterprise needs pay advertisement diffusion cost. Group profit maximization (GPM) problem aims to pick k seeds to maximize the expected profit that considers the benefit of influenced groups with the diffusion cost. GPM is proved to be NP-hard and the objective function is proved to be neither submodular nor supermodular. An upper bound and a lower bound which are difference of two submodular functions are designed. We propose a submodular–modular algorithm (SMA) to solve the difference of two submodular functions and SMA is shown to converge to a local optimal. We present an randomized algorithm based on weighted group coverage maximization for GPM and apply sandwich framework to get theoretical results. Our experiments verify the efficiency of our methods.

gain all votes in one state if he gets the majority of tickets in the state. For simplification, the benefits from all activated groups are assumed to be calculated as economic indicator. Then, in this paper, both the cost and benefit are considered as monetary.
Since group holds an important role not only in real-world society but also online social networks, the enterprise (such as company), or government attempts to activate group. For example, when we consider family's decision, usually, only one decision is done to buy some brand product according to the advertise of different brands. Similarly, a company need to purchase computers for their employees, while this company may use majority method to decide which brand of computer. The employee may be influenced by different brand of computers, while only one brand is purchased. A group is called to be activated if a certain percent of members are influenced. Enterprise producers often draw support from the OSN providers to diffuse their advertisements, so that all possible potential groups could be influenced. Zhu et al. [3] have presented the group influence maximization problem in social networks. In which, a group is called to be activated if β percent of members in this group are activated. Enterprises will gain income from all activated groups. Simultaneously, to propagate influence, enterprise needs pay advertisement diffusion cost to the OSN provider, while the cost is usually up to total hits on these advertisements. In this paper, we aim to pick k seed users to maximize the expected profit that maximize the expected profit that equals the benefit of influenced groups minus the diffusion cost. This optimization problem is called group profit maximization (GPM) problem. Given a social network G = (V , E, P) , P is the influence probability for each directed edge (u, v) that means u could activate v with probability P after u becomes activated. β is called group activated threshold. For each activated group U, the benefit is b(U ) ≥ 0 . Meanwhile, the diffusion cost c(v) ≥ 0 is required if v is activated. An example is shown in Fig. 1. There are 8 nodes in this graph and the influence probability equals 1. There are two groups. U = {U 1 = {v 1 , v 2 , v 3 , v 4 }, U 2 = {v 6 , v 7 , v 8 }} . Benefit of group U 1 is 20 and U 2 is 10, while diffusion cost of each node is 2. Assume the activation threshold β = 0.5 which means a group will be activated if at least half of nodes are activated. Figure 1(1) chooses v 3 as the seed, and then, {v 2 , v 3 , v 4 , v 5 , v 6 } will be activated and only group U 1 is activated under the activation threshold 0.5. Then, the total profit is b(U 1 ) = 20 − 5 × 2 = 10.
Optimizing the profit return in viral marketing has proved much more difficult than only maximizing the influence propagation [16], since the number of seeds picked yields a trade-off between the benefit and cost of viral marketing. Several recent publications studied profit maximization problems from the advertiser's point [16][17][18]. These works considered the cost of seed selection which is modular and implies that their profit metric is still submodular. Ref. [19] proposed a profit maximization problem which took into account the cost of information propagation, whose profit function could be decomposed into the difference of two submodular functions.
However, most of the existing methods are either too slow for billion-scale networks such as Facebook, Twitter, and World Wide Web or fail to retain the (1 − 1/e − ǫ) -approximation guarantees. The sampling method is the bottleneck of solving IM. Borgs et al. [20] proposed a novel sampling method named reverse influence set (RIS) which can reduce the sampling complexity. Two-phase influence maximization (TIM)/TIM+ [21] and Influence Maximization via Martingales (IMM) [22] were introduced for solving IM problem. Nguyen et al. [23] made a breakthrough and proposed Dynamic-Stop-and-Stare Algorithm (D-SSA) which was much faster while guarantee the same approximation ratio. Zhu [13] presented weighted RIS sampling method.
Since the objective function of GPM is non-submodular which will be shown in the following section, the existing social IM methods can not be applied to solve the GPM. Schoenebeck [24] presented the 2-quasi-submodular function optimization problem whose objective was non-submodular. Narasimhan and Bilmes [25] presented an approximation method for solving submodular + supermodular function which substituted the supermodular function by a modular function. Bach [26] proved that any non-submodular function could decompose as a difference of two submodular function. Another approach named sandwich approximation strategy was presented by [27], which approximates the objective function by formulating its lower bound and upper bound. More recent results can be found in [28,29].

Contributions
We summarize our contributions as follows: 1. Motivated by the group structure in social network, group profit maximization (GPM) problem is presented which select k seeds, such that the expected profit is maximum. 2. We evaluate the challenges of the GPM by analyzing computational complexity. First, GPM is proved to be NP-hard under IC model. Second, the objective function of GPM is shown neither submodular nor supermodular.
3. To obtain approximate solution, we propose a lower and upper bound for the objective function. We show that maximizing the lower bound and upper bound are still NP-hard. Meanwhile, both lower bound and upper bound can be decomposed to the difference of submodular functions. We also present a submodular-modular algorithm to solve the difference of submodular functions. 4. Then, we propose a weighted group coverage maximization algorithm for solving GPM. Second, we formulate a sandwich approximation framework, which preserves a theoretical analysis result. We verify our algorithm on real-world data sets.
This paper is organized as follows: first, we present the group profit maximization (GPM) problem; then, the proof of NP-hardness and properties of objective function will be given; third, we propose lower bound and upper bound, and present our algorithms; experiments are presented in the following section; and finally, the paper is concluded. Table 1 summarizes the symbols and their meaning.

Problem formulation
Independent cascade (IC) model is an information propagation model with widely application. IC model will be introduced first, and then, the group profit maximization (GPM) problem is presented.

Independent cascade model [4]
Given an social network G = (V , E, P) , where V is a set of users and E is a set of directed edges. For each edge e = (u, v) , P e is the weight on e, representing the information activation probability ( 0 ≤ P e ≤ 1 ). Specifically, u will attempt to activate v with activation probability P e after u is activated. Assume S ⊆ V is the initial seed users. Let S t be the nodes which are activated in step t(t = 0, 1, . . .) . At the beginning, S 0 = S . The propagation process is as follows step by step. At step t, for each activated node in u ∈ S t , u will try to activate each inactivated neighbor A social network with user set V and edge set E. P is the influence probability. P e represents influence probability on edge e where 0 ≤ P e ≤ 1 The expected profit with seed set S v with the activation probability of P (u,v) . IC model assumes that u has only one chance to activate its inactivated neighbor v.

Group profit maximization
Given an instance of GPM with directed graph G = (V , E, P) , a group U is a subset of V. Let U be a collection of groups. The number of total groups is l. 0 < β ≤ 1 is the activation threshold. When β percent of users in a group are activated, this group is said to be activated. For each activated group U, there is a benefit b(U ) ≥ 0 . Simultaneously, there is a diffusion cost c(v) ≥ 0 for each activated user. Now, a realization of random graph will be introduced which can help us to understand the IC model.
The generation process is: (1) for each edge e ∈ E(G) , uniformly generate a random number r between 0 and 1; (2) this edge e is kept in g if and only if r ≤ P e . G represents the set of any realizations of G. Obviously, there are 2 |E(G)| sample graphs in G . g is generated with probability P [g]. Then, we have: Let U g (S) represent the set of groups activated by the initial seed set S. V g (S) is the set of nodes activated by the initial seed set S. Now, the benefit of activated groups is: and the cost of activated nodes is: We define the profit as ρ(S) = β(S) − γ (S) . Then, Group profit maximization (GPM) considers information propagation in social network. The objective aims to select k seed users to maximize the profit ρ(S): Figure 1 shows an example to explain the information diffusion process of GPM, where there exists 8 nodes and the influence probability on each edge is 1. Let β = 0.5 . At the beginning, v 3 is the seed. At the first time step, v 2 , v 5 are activated by v 3 , as shown in Fig. 1(1). At the second time step, v 4 is activated by v 2 and v 6 is activated by v 5 , as shown in Fig. 1(1). Finally, activated node set is {v 2 , v 3 , v 4 , v 5 , v 6 } . Since activation threshold β = 0.5 , group U 1 is activated and U 2 is inactivated.

Properties of GPM
In this section, GPM will be proved to be NP-hard. The properties of the objective function ρ(·) will be discussed.

Hardness results
It is known that any generalization of an NP-hard problem is also NP-hard. Kempe et al. have proved that the influence maximization (IM) problem is NP-hard [4], which is a special case of GPM. Each node is considered as a group and benefit of each group is 1. There does not exist cost on each node. Let β = 1 . Obviously, the GPM is NP-hard.,

Theorem 3.1 The group profit maximization problem is NP-hard.
For any instance of GPM, it is difficult to compute the objective ρ(S) even for fixed seed set S. To estimate ρ(S) , Monte Carlo method is always used to estimate ρ(S) . First, a large number of sample graphs of G are generated, and then computer ρ(S) on each sample graph. Finally, the average of ρ(S) is the estimation value. Kempe et al. have proved that computing the objective of IM was #P-hard [4], and then, the following result is true.

Modularity of objective function
A set function f : 2 V ← R is called submodular [30] if it holds that On the other hand, if, for any subsets Greedy algorithm guarantees (1 − 1/e)-approximation for polymatroid maximization problem with cardinality constraints [31]. Also, we have the following result for γ.
Meanwhile, β(·) is neither submodular nor supermodular, although β(∅) = 0 and β(·) is monotone nondecreasing. Proof We prove by a counter example. When b(U ) = 1 for any U ∈ U , β(S) is the expected number of eventually activated groups for initial seed set S. Consider an instance of GPM problem, as shown in Fig. 2 where there are 9 nodes and the influence probability of each edge is 1. There exist 4 groups First, we will prove that β(·) is not submodular. Let A = ∅, B = {v 3 } , and v 9 ∈ V \ B . We have β(A) = 0, β(B) = 3 . Putting v 9 into A and B, we have β(A ∪ {v 9 }) = 0 , since v 9 can not activate any group. β(B ∪ {v 9 }) = 4 , since all groups are eventually activated. Thus, We also have the following corollary.

Lower bound and upper bound
To optimize a non-submodular function is very hard. Lu et al. presented a sandwich approximation framework (SAF) [27]. SAF attempts to find a lower bound and upper bound for the original objective function. Now, we will design lower bound and upper bound for ρ(·) . Simultaneously, the properties of these two bounds will be analyzed.

The upper bound
A new set function β(·) is defined which satisfies β(S) ≤ β(S) . In this paper, we formulate the upper bound in two steps. First, a relaxed GPM (r-GPM) problem is generated by modifying group activation rules. For r-GPM, a group is said to be activation if at least 1 activated node is activated in this group. Second, we add a super node for each group. The benefit b(u) of this super node is defined as the benefit b(U) of the corresponding group. Then, connect every node in this group to this super node and set influence probability 1. An example is shown in Fig. 3. W represents the super node set and E ′ represents the edge set for nodes in V to super nodes in W. Next, a general weighted influence maximization (WIM) is defined as follows. V ∪ W is node set and E ∪ E ′ is edge set. C ⊆ V is the set of candidates of seed users. Node weight function f satisfies: is the expected weight of activated nodes for seed set S. Let G = (V , C, E, P, f ) be an instance of general Weighted IM problem, where C is the candidate seed set. We can prove β(·) is monotone, submodular, and β(S) ≤ β(S). Define ρ(·) = β(·) − γ (·) , and then, we have:

The lower bound
In this subsection, a lower bound will be formulated. The idea is to keep some groups and delete some groups. If at least β percent of nodes in a group can be activated simultaneously, this group will be kept. It means that there must exist 1 node that connects to β percent of nodes in this group. An example is shown in Fig. 4. The activation threshold is β = 0.5 . Since v 1 and v 2 connect to 2 nodes in group U, group U will be kept. A super node u related to group U will be generated, and new directed edges (v 1 , u), (v 2 , u) will be added with influence probability p (v 1 ,u) = p 1 p 2 , p (v 2 ,u) = p 3 p 4 . The benefit of u is set b and the other nodes are 0.
The following process is the detail. Let G = (V , E, P) be an instance of GPM. For a group U i with benefit b i , assume H i = {v ∈ V |v links to at least β percent of nodes in U i } . If H i = ∅ , a super node u i is generated and directed edges and benefits of all other nodes are 0. Next, a general weighted influence maximization (WIM) can be generated. The node set is V ∪ W , and the edge set is E ∪ E ′ . E ′ is the set of all new added edges. The candidate seed set C ⊆ V . The weight function of node f satisfies: is the expected weight of activated nodes for seed set S. Let G = (V , C, E, P, f ) be the instance of general WIM problem. β(·) is monotone, submodular, and β(S) ≥ β(S).

Algorithm
Since computing the objective function of GPM is #P-hard, the reverse influence set (RIS) sampling method will be extended to estimate ρ(·) and ρ(·) . Next, an submodularmodular algorithm will be proposed for solving the lower bound and upper bound problems. Then, we will propose an randomized algorithm which is base on weighted group coverage maximization strategy. Finally, a sandwich approximation framework will be presented with theoretical analysis. We will apply (ǫ, δ)-approximation method [32] to analyze our algorithm. The absolute error is ǫ and the confidence is (1 − δ) . Let ϒ = 4(e − 2) ln(2/δ)/ǫ 2 and ϒ 1 = 1 + (1 + ǫ)ϒ , and then, the Stopping Rule Algorithm [32] has been shown (ǫ, δ) approximation.

Extended reverse influence set (RIS) sampling
In this section, we will present an extended version of the RIS sampling method. Given a weighted directed graph G = (V , C, E, P, f ) , which represents a general weighted influence maximization problem and C is the candidate. The influence probability is P and f is the node weight function. Assume S is the seed set. ρ ′ (S) = v is activated f (v) is the expected weighted number of activated nodes. Looking for k seed users in C to maximize ρ ′ (S) . Obviously, φ(S) is submodular and monotone. Extended RIS generates a set R of random weighted reverse reachable (WRR) sets. Let R j be a WRR set which can be formulated as follows, [13]. Given G = (V , C, E, P, f ) , a random WRR set R j is generated from G by (1) selecting a random node v ∈ V ; (2) generating a sample graph g from G;

Definition 5.1 (Weighted reverse reachable (WRR) set)
(3) returning R j as the set of nodes that can reach v in g; S is the seed set. Let Cov R (S) = R j ∈R min{|S ∩ R j |, 1} be the coverage number of set S and W Cov R (S) = R j ∈R w(R j )· min{|S ∩ R j |, 1} be the coverage weight. This weighted coverage of set S might be used to estimate ρ ′ (S).
Lemma 5.1 [13]. Given G = (V , C, E, P, f ), a random WRR set R j generated from G. For each seed set S ⊆ C, where C ⊆ V is candidate seed set: The estimation procedure for computing φ(S) will be proposed as Algorithm 1, which also preserves the following theoretical result.
At first, a modular upper bound and lower bound will be presented for γ (·) according to [33]. The following formulas are two tight modular upper bounds which are tight at the given set X: For briefness, we use m X to refer either one. A modular lower bound h X which is tight at a given set X will be formulated as follows. Assume π is any permutation of V and place all the nodes in X at the front. Let S π i = {π(1), π(2), . . . , π(i)} be a chain constructed by this permutation, where S π 0 = ∅ and S π |X| = X . Define: h π X (S) = v∈S h π X (v) will be a lower bound for γ (S) , and it is tight at X. Then, h π X (S) ≤ γ (S) holds for any S ⊆ V and specially h π X (X) = γ (X) . The following results can be proved.
X (S) and these two bounds are difference of submodular and modular functions.
Using φ(S) − m X (S) ≤ φ(S) − γ (S) , we can propose the submodular-modular algorithm. In each iteration, run maximization procedures for these two modular upper bounds and select the better one. Algorithm 2 can be proved convergency to a local maximal solution.

Theorem 5.3 Algorithm 2 monotonically increasing. Furthermore, assuming a local maxima φ(X) − m X t (X) is returned from the submodular maximization procedure, then Algorithm 2 outputs a local optima solution.
Proof For either modular upper bound, we have: To show that this algorithm converges to a local maxima, we assume the submodular maximization procedure converges to a local maxima. Then, if the objective value does not increase in an iteration under both upper bounds, it implies that φ(X t ) − m X t (X t ) is already a local optimum in that (for both upper bounds), we have φ( and m 2 X t ∪{j} = γ (X t ) + γ (j|X t ) = γ (X t ∪ {j}) , and hence, if both modular upper bounds are at a local optima, it implies Hence, X t is a local optima.

Group coverage maximization algorithm
In this section, we will propose weighted group coverage maximization algorithm for solving GPM. Let U be the set of groups. U(S) represents the set of groups which includes at least one node in S, i.e., U(S) = {U ∈ U|U ∩ S � = ∅} . Then, b(U (S)) = U ∈U(S) b(U ) . Algorithm 3 is shown below by selecting the maximum marginal gain at each step and at most O(knl) time complexity. Greedy algorithm may give better solution, but the running time is O(knŴ(nm + nl)) . We will compare several different strategies by experiments.

Algorithm 3 Weighted Group Coverage Maximization Algorithm (WGCMA)
Input: An instance of GPM G = (V, E, P ), the number of seeds k. Output: a set of seed nodes, S k .
Add v * to S k 5: end for 6: return S k .

Sandwich approximation framework
For GPM, we have formulated the lower bound and upper bound for ρ(·) . Algorithm 4 gives the sandwich approximation framework.

Algorithm 4 Sandwich Approximation Framework
Input: Given an instance of CPM G = (V, E, P ), 0 ≤ , δ ≤ 1 and k. Output: a set of seed nodes, S.
1: Let S L be the output seed set of solving the lowerbound ρ by Submodular-Modular algorithm (Algorithm 2) 2: Let S Z be the output seed set of solving the upperbound ρ by Submodular-Modular algorithm (Algorithm 2) 3: Let S A be the output seed set of solving G = (V, E, P ) by Algorithm 3. 4: S =arg max S 0 ∈{S L ,S Z ,S A } EP(G, , δ, S 0 ) (by Algorithm 1) 5: return S For sandwich approximation framework, we can prove the following theoretical result. Algorithm 4, and then, we have:

Theorem 5.4 Let S be the seed set returned by
where S * L is the optimal solution to maximize the lower bound problem, S * is the optimal solution of GPM, and α is the approximation ratio of Algorithm 2.
Proof Let S * Z be the optimal solution to maximize the upper bound problem. Then, we have: and Let S max = arg max S 0 ∈{S L ,S Z ,S A } ρ(S 0 ) , and then: It follows that: Sadly, the performance of sandwich framework depends on α . Although we have proved the convergence of Algorithm 2 to a local optimal, the ratio α is still an open problem. According to Theorem 5.4, the difference between ρ(S * ) and ρ(S * L ) has great influence on the performance of Algorithm 4. Iyer and Bilmes [33] studied the minimization problem of the difference between submodular function. While the difference between ρ(S * ) and ρ(S * L ) may be bounded, we have the following result.
Theorem 5.5 Let S * L be the optimal solution to maximize the lower bound problem and S * is the optimal solution of GPM, and then, we have:

Comparison with different heuristic strategies
We will compare Sandwich Approximation Framework (SAF) with Greedy Strategy (GS) proposed by Kempe [4] and Maximum Outdegree (MO) method by choosing the first k largest outdegree nodes. Algorithm 3 is called Weighted Group Coverage Maximization Algorithm, which represents as MC for simplification.

Experiments
To evaluate our algorithms, we will test on two datasets coming from [34,35]. Facebook-like Forum Network is the first dataset which was collected from the online community of Facebook. Users' activities in this forum are recorded in this dataset, in which there are one-mode and two-mode data. There are 899 users and the relationship between users is stored in the one-mode data. Beside one-mode data, there are 522 topics and the two-mode data contain the interesting network of 899 users and 522 topics. Users related to a topic are represented as a group. Newman's scientific collaboration network is the second dataset, which represents the co-authorship network. These data are based on preprints published to Condensed Matter section of arXiv E-Print Archive from 1995 to 1999. The one-mode data indicate the relationship among the co-authors. The relation between an author and the paper is shown in the two-mode data. The authors related to the same paper are considered as a group. Table 2 shows the details of these two datasets.

Procedure
The instances are formulated from the above datasets. The basic graph is constructed by the one-mode dataset. The set of groups come from two-mode dataset. The benefit of a group is derived from the size of group. In this paper, by multiplying the size of the group by a factor of 10 is defined as the benefit. The cost of each activated node is generated as a random number from 0 to 1. We use Python 3.6 to write all programs and run on a Linux server with 16 CPUs and 256 GB RAM.

Experimental results
From the comparison of three different seed selection strategies, Greedy Strategy (GS) returns a comparatively higher benefit than SAF and MO methods. The MO strategy initially gives higher profit than the MC. The SAF outperforms MO as the number of seed nodes increasing. Figures 5 and 6 show the experimental results. Figures 7 and 8 show performance of SAF for dataset 1 and 2, respectively. The main results are as follows:

Profit increases with increase of seed number for fixed β
The experiments are carried out with three values for beta values 0.5, 0.8, and 1. From the graphs, it can be observed that, for a given beta value, the profit increases with the increase in the number of seeds in a set. Initially, a seed set of lesser number of seeds is able to activate fewer groups, thus resulting in a lesser profit being generated. However, as the size of seed set increases, it is more likely for larger number of groups to be activated, thus increasing the profit with an increase in the number of seeds.

Profit decreases with increase in β
The experiments are carried out with three values for beta values 0.5, 0.8, and 1. As the beta increases, it is observed that the number of groups activated decreases for a given seed set, which, in turn, results in the profit decreasing. As beta is the determining factor for activation of the group, as beta becomes larger and larger, lesser groups get activated. As a result, the profit generated by a lower beta value is much higher as compared to the profit generated by a higher beta value. The seed set activates more nodes, but the activation of number of group decreases.

Gap of upper bound and lower bound
It is observed from the graphs of dataset 1 that with an increase in the beta value, the gap between upper bound and lower bound increases. The reason behind this result is because of the formulation of upper bound in our problem and the size of each group in dataset 1. In our experiments, the upper bound is fixed even as the beta varies. As beta increases, the profit decreases, and as the upper bound is fixed, the gap between upper bound and lower bound becomes large. However for dataset 2, the gap remains almost the same even as beta increases as the group size are smaller having an average group size of 3.7 as compared to group size in dataset 1 having average 14.6 as the group size.

Conclusion
This paper studied profit maximization problem of information propagation in online social networks. Group activation was considered in this novel IM model. Each activated group would give a benefit, while information diffusion cost was needed for every activated users. Then, our group profit maximization (GPM) problem attempted to look for k seed users to propagate information, such that the expected profit was maximum. The profit combined benefit of activated groups and the cost on each activated users. GPM was proved to be NP-hard and the objective set function was shown neither submodular nor supermodular. We proposed a weighted version of group coverage maximization strategy for solving GPM. Simultaneously, a sandwich approximation framework was presented with theoretical analysis. Finally, the experiment results shown that our proposed algorithms were effectiveness and the efficiency. For future research, novel efficient methods for solving non-submodular optimization are eager for paying attention.