Influence maximization in social media networks concerning dynamic user behaviors via reinforcement learning

This study examines the influence maximization (IM) problem via information cascades within random graphs, the topology of which dynamically changes due to the uncertainty of user behavior. This study leverages the discrete choice model (DCM) to calculate the probabilities of the existence of the directed arc between any two nodes. In this IM problem, the DCM provides a good description and prediction of user behavior in terms of following or not following a neighboring user. To find the maximal influence at the end of a finite-time horizon, this study models the IM problem by using multistage stochastic programming, which can help a decision-maker to select the optimal seed nodes by which to broadcast messages efficiently. Since computational complexity grows exponentially with network size and time horizon, the original model is not solvable within a reasonable time. This study then uses two different approaches by which to approximate the optimal decision: myopic two-stage stochastic programming and reinforcement learning via the Markov decision process. Computational experiments show that the reinforcement learning method outperforms the myopic two-stage stochastic programming method.

sequence is the seed node, and a spreading cascade is a directed tree that has as its root that first node. The tree captures the influence between nodes (with branches that represent who transmitted the information, and to whom), and it unfolds in the same order as the activation sequence. There are two typical information diffusion models-namely, independent cascade [3] and linear threshold [4]. The differences between these two models are as follows: • Independent cascade: if x t−1 is the set of newly activated nodes or we call it seed node at time step t − 1 , then, at each time step t, each node i belonging to x t−1 will infect the inactive neighbor j with the probability p ij . • Linear threshold: if each node i has a threshold θ i in the interval [0, 1], then, at each time step t, each inactive node j becomes active if the total inference from all activated neighbors i∈H t−1 b ij > θ j , where H t−1 is the set of nodes activated at time t − 1 or earlier.
Our study is based on the assumption of the independent cascade model. Among many pioneering studies, the authors of [5] propose the expectation maximization algorithm to predict information diffusion probabilities in the independent cascade model. The authors of [6] apply the influence maximization problem with an independent cascade model in prevailing viral marketing. Furthermore, the authors of [7] showed for the first time that the computing influence spread in the independent cascade model is NP-Hard; these studies have led to the design of a new heuristic algorithm that can easily scale up, relative to the greedy algorithm proposed in [1]. The influence maximization problem involves finding the nodes for the initial injection of information so as to maximize influence in a given social network with diffusion probabilities. Our problem is a special case of the influence maximization problem. Our independent cascade model is based on the assumption that there are multiple seeds that broadcast information to the whole network. Figure 1 shows the difference between the multi-seed independent cascade model and a traditional single-seed model. In the single-seed model, node 1 is selected as the seed, and the minimal time needed to broadcast to the whole network is 2. In the multi-seed model, however, both nodes 1 and 2 are selected as seeds, and each node can receive the information within one Fig. 1 Information cascade based on seed type time period. If every node in the network is a seed, the minimal broadcast time will be 0, but this is not an economical method (owing to the seed cost). We look to pinpoint an efficient means of seed selection that can balance seed cost and broadcast time.
Unlike previous research on the independent cascade model, we assume that information diffusion probabilities or network topology probabilities dynamically change according to user behavior. Among the studies of dynamic user behavior, [8] introduces the concept of behavior change support systems. Based on that work, the authors of [9] found ample evidence of the strong influence exerted by social interaction on people's behaviors. The authors of [10] conducted extensive statistical analysis on large-scale real data and found that the general form of exponential, Rayleigh, and Weibull distribution can effectively preserve the characteristics of behavioral dynamics. The networked Weibull regression model for behavioral dynamics modeling is found to significantly improve the interpretability and generality of traditional survival models in [10].
Cascading phenomena are typically characterized by a dynamic process of information propagation among the nodes of a network, where nodes can repost information after seeing it posted by their neighbors. Moreover, the content and value of information may affect not only the reach (or depth) of a cascade, but also the topology of the underlying network; this is due to effects whereby nodes may either sever their ties with neighboring nodes (where the transmitted information is deemed unreliable, malicious, or both) or form new ties with nodes that transmit "reliable" information. In an independent cascade, people observe the choices of others and make decisions based on these observations, while concurrently considering their own personal preferences.
This phenomenon arises frequently in the field of behavioral economics and other social sciences. One real-world example is viral marketing, in which an independent cascade spreads information about a product with other people in their social networks, with the objective of promoting a product by leveraging existing social networks. A recent study of social networks [11] suggests that such processes may occur in a "bursty" fashion-that is, the patterns of network links change abruptly as a result of significant independent cascades. Thus, new information may create within a network a "burst" of node activations and edge activations/deactivations. In a decentralized autonomous network, agents or nodes act independently and behave according to their utility functions. To model their autonomous behaviors, we implement the concepts of discrete choice models, as drawn from behavioral economics [12,13].
Our contributions Bearing in mind the endpoint of maximizing the influence of the information provider within a limited time, we model our problem as a seed selection problem of information spreading in dynamic networks that feature a random topology. In a social network, each user can have as many as three roles-namely, source user, message sender (i.e., followee of neighbors), and message receiver (i.e., follower of neighbors). It is possible that one node can play these three different roles at different times. For example, Alice writes and posts a message on a social media network. At that moment, she is the source user. In this network, Alice and Bob are friends, which means Alice is following Bob and Bob is also following Alice. Bob sees Alice's message, and so Bob is a receiver of Alice's message. Bob likes this message and reposts it to his own followers. At this point Bob is also message sender, given his "repost" action. Since Alice is also Bob's follower, she sees that her message is reposted, and at this point she becomes a message receiver with respect to the reposted message.
Generally, we can decompose our problem into two steps, as follows: • Seed selection: this can be controlled by the information provider, who selects a proper set of initial seeds that will receive the deemed information. In the previous example, Alice is the seed node. • Information cascade: this includes two variables. One is the node activation status, which describes the process wherein the user receives a message from their followee. The other one is the node repost decision, which is controlled by the message receiver. In our model, the repost decision depends on the user preference and the topic of the received message. In the previous example, Bob reposted the message because he likes it. However, if Bob dislikes this message, what will be happen? Since the message is coming from Alice, Bob may think Alice has tastes different from his, and so he might unfollow Alice. The "unfollow" action will break the information flow from Alice to Bob, which leads to a change in the network topology.
In this study, we propose an information maximization model through independent cascades, with random graphs. The network size and node preference is assumed to be given, while the friendship between any two users (i.e., arc connection) dynamically changes. Our model can help decision-makers choose the optimal action when they face an uncertain network topology. The stochastic formulation considers endogenous uncertainty, which is represented by the binary choice probability distribution of arc connection between any two nodes. To solve this problem, we design two problem-specific algorithms: one involves two-stage stochastic programming with a myopic policy, while the other involves reinforcement learning and the Markov decision process. We summarize the contributions of this study as follows: • We introduce the discrete choice model in the information maximization problem, where the network topology dynamically changes during the independent cascading process. • We develop practical algorithms to solve the multistage stochastic programming problem under endogenous uncertainty. • To avoid directly dealing with large state spaces of node activation, we exploit the implicit Monte Carlo-based partially observable Markov decision process. • We compare the results using two algorithms and various sample sizes.
The remainder of this paper is structured as follows. After having briefly described information maximization and the independent cascade problem in random graphs within a finite-time horizon, we provide in "Mathematical models" section the original multistage stochastic programming models with several assumptions. In "Solution approaches" section, we design two algorithms to solve this problem. The computational results are presented in "Computational experiments on algorithms' convergence" section, while "Conclusion" section provides concluding remarks.

Mathematical models
In a social network, information spreads based on user-to-user interactions. Initially, some nodes will carry the designated information after being selected as seed nodes. During an independent cascade, each node plays two roles-namely, that of the message receiver, who is activated by a certain message from neighbors, and that of the message sender, who reposts the received message to their own neighbors. Information providers have several messages on hand, and they want to maximize their influence in a network. While the network users may have different preferences vis-à-vis the various messages, information providers face the problem of making the best selection of seed nodes (i.e., that which maximizes their influence).
In each period, the information provider will select the seed nodes by which to disseminate a certain message in the social network. Sometimes, it is the initial posting of a certain message, while sometimes it is a post repeated to increase network activity. Once the source user posts the message, the followers of the source users automatically receive the information. A follower make decisions based on their preferences, with different types of decisions being made as users play multiple roles in the social network (i.e., simultaneously being a follower and a followee). Information always flows from the followee to the follower, and the track of information transmission has a major influence on the network topology, where user relationships or arc connections dynamically change due to user preferences and actions. Since the Information maximization problem is subject to various uncertainties (e.g., network topology and user actions), we model this problem with stochastic programming, with the objective of maximizing the expected total influence within a finite-time horizon.

Problem description
To clearly demonstrate the information cascade process of our problem, we provide a simple example. Consider viral marketing in a random network G(n, p), where a company wants to promote two products in a network that features an uncertain topology. To maximize its influence, the company wants to select certain nodes as influencers who will post the promotion message in the network. Figures 2 and 3 illustrate an example of the entire information cascade process in a four-node random network with transition probability p = 0.5 . The symbols used in these figures are shown in Table 1. The network properties include the network sizes, transition probability, node preference, and initial activation status (Fig. 2). Figure 3 shows the dynamic network status for each information transition. The following symbols are using to explain the information cascade process. Assume there are two message topics (i.e., BLUE and GREEN) and that the information will cascade in the random network shown in Fig. 2a, which has four nodes and whose network topology dynamically changes with initial arc probability p = 0.5 . Before seed selection, we know about node preference in terms of message topic (Fig. 2b). Some nodes already knew of the messages before the information cascade, so we say these nodes have been "pre-activated". In Fig. 2c, nodes (1) and (4) were preactivated by message BLUE at the initial state.
Within a single period, the information cascade usually includes four steps, as follows: seed selection, message transmission (the node sends messages), node activation (the node receives messages), and network topology probability updating. When the message provider selects the seed, the message is broadcast by the seed node in the network, but it cannot guarantee that all the other network nodes will receive the message: only followers are able to receive the message from the message sender. Following information transmission, the network topology may change. There is a strong likelihood that the link from the followee will be broken if there is a mismatch between the received message and the follower's preference. This means some directed arcs will break down, even if there were connections in the most recent time period; this is due to the uncertainty of the topology. This uncertain topology is modeled by a discrete choice model with two alternatives.
At time t = 0 , node (1) is selected as the seed node of message BLUE, and node (2) is selected as the seed node of message GREEN. These two nodes then broadcast messages in the network. The initial probability of the directed arc connection between any two nodes is 0.5. When message transmission occurs, the real topology will be one scenario among all possibilities (Fig. 3b). The arc from node (1) to node (3) is disconnected, as is that from node (2) to node (4); this means node (3) cannot receive message BLUE and node (4) cannot receive message GREEN. Since nodes (1) and (2) are seed nodes, they alone are activated. Node (2) is activated from message BLUE by node (1). Node (2) dislikes message BLUE, and this will break the friendship between nodes (1) and (2). We leverage the utility of measuring the friendship. When the node initially receives the message, we assume it has a double effect on change to the utility. We reduce the utility from node 1 to node (2), because this is the first time node (1) receives this message. Node (4) is also activated with message BLUE from node (1). Since node (4) likes this message and had not received this message in any previous time period, node (4) will become the new source node for message BLUE and will repost message BLUE in the network (Fig. 3f). Similar to the arc utility  (1) and (2), the utility from node (1) to node (4) will be increased by (2) due to effective message transition. The topology probability of the directed arc connection at the next time period is updated by the utility changing. For example, the probability of a directed arc from note 1 to node 2 is updated as where a t 12 is the directed arc connection status at time t and u t 12 is the utility at time t if a t 12 = 1 . The details of probability updating are explained in "Policy improvement" subsection.

Mathematical formulation
We formulate the Independent Cascade within Random Graph (ICRG) problem by using stochastic programming model. The authors of [14] introduce the modeling and solution to find optimal decisions in problems which involve uncertain data. In our problem, the independent cascade process include 3 decision variables, seed selection x, node activation y, message transmission z and the uncertainty is the network topology. The notation is shown in Table 2.
The original stochastic programming model [SP] is shown below: Prob(a t=1 12 = 1) =Prob(a t=1 12 = 1|a t=0 12 = 1) * Prob(a t=0 12 = 1)+ Prob(a t=1 12 = 1|a t=0 12 = 0) * Prob(a t=0 12 = 0) In objective function (1a), the total influence has two parts: one is the seed cost Q(x), the other one is activation reward R(y). Constraint (1b) shows the probability of scenario s depend on the probability of arcs between any two nodes. The directed arc a ij from node i to node j is random variable, which is following logit binary choice model with utility U ij .
Utility U ij is a function to measure the user friendship or the strength of arc connection, which includes two term: observed utility u ij and unobserved utility ε ij . The observed utility u t,s ij at time t and scenario s is cumulative impact from node i to node j with all kinds of message topic. The current direct arc a t,s ij from node i to node j decide the impact happen or not, the impact sign is decided by the preference b kj of message k and node j, and the impact amount is decided by the transmission decision z t−1,s ki of message k and node i at last moment. The unobserved utility ε t,s ij is assumed to have a logistic distribution.

Symbol Definition
Indices and sets Parameters a t,s ij The directed arc from node i to node j b ki The information preference of node i with respect to message k c ki The pre-activation, that node i has known or has not known the message k before the seed selection Before the information cascade, there is no message transmission and each node does not know anything from the other nodes. Whether connect or disconnect, the observed utility is always be 0.
At the initial time period t = 0 , seed node broadcast the message in the network, and some node may received message from the seed node.
From time t = 1 to the end of time horizon t = T , except the seed node, the other node who received message also involve in the message transmission.
Prob(a t+1,s The total seed cost equals to the number of seed node. The reward equals to the weighted average of final active node amount. Constraint (1c) shows the activation reward depends on message weight, node preference and node activation status y at end of the time horizon t = |T | . Constraint (1e) is nonanticipativity constraint, and in [15] the author design an algorithm using Lagrangian dual method to solve the stochastic programming model with nonanticipativity constraint. For Constraint (1e), the scenario subset S t define as below: where the directed arc size is I · (I − 1) , the combination of all arcs status is |A| = 2 I·(I−1) , and the scenario set cardinality |S| = |A| |T | = 2 |I|·(|I|−1)·|T | .
The information cascade process is limited by 4 constraints. Constraints (1f, 1g) define the initial node activation and transmission decision at time t = 0 . Constraints (1h, 1i) define the information diffusion rule from time t = 1 to the end t = |T |.
In constraint (1f ), some node are active node at beginning because it has already known this message c ki or it is selected as seed x ki . So the initial time period t = 0 , node is not active node if and only if it did not know the message before k and it is not selected as seed node. Due to the binary property, constraint (1f ) can be linearized by the equation below: The initial message transmission happen if and only if the node is selected as seed node, shown in constraint (1g).
Except the seed selection, the node may also be activated by two causes from time t = 1 to the end t = |T | , shown in constraint (1h). One is once node i was activated by message k at previous time period t − 1 , it will be active node in the future. The other one is at least one of the followees transmit the message k at the previous time period t − 1 . Constraint (1h) can be linearized by the following inequalities: Constraint (1h-L3) is based on independent cascade assumption, that means the node will be activated ( y t,s ki = 1 ) if the neighbor node ( a t,s ji = 1 ) decide to transmit message ( z t−1,s kj = 1 ). For node i, we define the number of all the neighbors as degree DEG i = j∈I\{i} a ji . Since one of the neighbor transmit message, the receiver node will be activated, constraint (1h-L3) for all neighbor node j can be aggregated by the receiver node i.
Constraint (1h-L4) shows the node is deactivated if all the possible activation causes are failed. Constraint (1i) shows node i has two motivation to transmit message k. One is node i is selected as seed, the other one is node i is new active node of message k and like this message. Constraint (1i) can be linearized by the following inequalities: Constraint (1i-L2) is based on independent cascade assumption, that means the node is willing to transmit message ( z t,s ki = 1 ) if it likes this message ( b ki = 1 ) and it just activated ( y t,s kj = 1 ) and never knew this message before ( y t−1,s kj = 0 ). Constraint (1i-L3) shows the node decided not to transmit the message if all the transmission motivations are invalid. The computation complexity of this model is O(2 |K||I|·log |T | |S|·|T | ) . To reduce the complexity, we add an assumption of seed selection, that the decision-maker only allows to select one seed node of each message within one time period. It is formulated by the following constraint: The computation complexity is reduced to O(|I| |K|·log |T | |S|·|T | ) after adding this assumption, and the objective function (1a) can be simplified as below:

Solution approaches
Since the network topology is dynamically changed, the decision-maker is faced with an unstable node friendship. The uncertain directed arc connection leads to the scenario size exponentially growth with the network size |I| and time horizon |T|. To handle the large-scale scenarios, we have two approaches to solve the information cascade in random graph problem: • Myopic policy: does not explicitly use any forecasted network topology and separate the multistage into several two-stage problems (MYSP) by discrete time. • Reinforcement learning: reformulates the stochastic programming model to Markov decision process (MDP)

Two-stage stochastic programming with myopic policy
Contrary to the original model, the myopic model focuses on current network topology and ignores the future changing on arc. The seed selection ( x t ) is only based on current user connection ( a t ) and aims to find the local maximal influence on node activation of next time period ( y t+1 ): By using the myopic method, the multistage problem is decomposed to several twostage problems. The first stage variable is seed selection, and the second stage variable is node activation and node repost decision. The given parameters include the node preference, the probability of current network, and the node repost decision of the previous time period. Since we select seed to find the maximal expected influence at current time period, the decision only happens within one time period. Then the time index and set can be removed and the node repost decision of the previous time period should be added in the known parameter. The notation of myopic model is shown in Table 3.
The mathematical formulation of myopic model is shown below: x t = arg max R(y t+1 , a t ).
where ŷ ki is the activation status using the decision of previous seed selection x kj , ĉ ki is the parameter of previous myopic model, and d ki is the node repost decision using the decision of previous seed selection x kj . The parameter transition between two myopic models is shown in Fig. 4.

Reinforcement learning with Markov decision process
Our problem can be defined as a Markov decision process (MDP), that how information provider chooses a source user when facing the given information activation status of all Table 3 Notation of myopic two-stage stochastic programming model

Symbol Definition
Indices and sets Parameters a s ij The directed arc from node i to node j b ki The information preference of node i with respect to message k c ki The pre-activation, that node i has known or has not known the message k before the seed selection The node repost decision, that node i will repost message k in the network w k The influence weight of message k Decision variable x ki Binary variable, seed selection, whether the node i is selected as the seed node of message k at time t y s ki Binary variable, node activation, whether the node i is activated by message k at time t and scenario s users in the network. We use the Reinforcement Learning to learn the policy based on state-action pairs ( s, a ). The notation of reinforcement learning with Markov decision process model is shown in Table 4. In general, MDP is described by a 4-tuple (S, A, P, R), which are the states, actions, transitions, and reward. In our problem, these four terms are defined as below: • S: the finite set of state, i.e., activation status, s ∈ S • A: the finite set of action, i.e., source user selection, a ∈ A • P: the probability of transition from s to s ′ through action a, P a (s, s ′ ) • R: the expected reward of transition from s to s ′ through action a, i.e., weighted information influence, R a (s, s ′ ).
The probability function is not unknown since the network topology is uncertainty. The reward function is shown below: We will introduce the Q-learning algorithm to compute optimal policies, which includes policy evaluation and policy improvement.

Policy evaluation
If we have a policy, the probability of actions taken at each state are known. Then the MDP is turned into a Markov chain (with rewards). We can compute the expected total reward collected over time using this policy. For given policy π(s) , the state-value function Q π (s, a) is used the evaluated the policy value.
where γ is the discount factor and π(s, a) is the probability to take action a at state s. Consider a network with node size |I| = 4 and information size |K| = 2 . The size of state set is |S| = 2 |K|·|I| = 256 and the size of action set |A| = |I| |K| = 16 . Given initial state (no activation) s , the information provider has a trivial policy π(s) , that each node has equally probability to be seed.
We run several simulations of independent cascade with random actions and discount factor γ = 1 . The simulation uses the Monte Carlo method. Before cascade starting, we generate a large number of pseudo-random uniform variables from the interval [0,1], which is used to decide the network topology. If the value falls into the probability interval of the π(s, a) , we will take action a when we meet state s. For example, we have 3 options for state s, action a 1 , a 2 , a 3 , and the probability to take the actions are π(s, a 1 ) = 0.1, π(s, a 2 ) = 0.3, π(s, a 3 ) = 0.6 . Based on the definition of Monte Carlo method, there are 3 probability intervals [0, 0.1], (0.1, 0.4], (0.4, 1] respect to different action. When we got the random number 0.5, it falls into the probability interval (0.4, 1], which means we will take action a 3 .
The average final influence of each action is shown in Table 5. Figure 5 shows the same policy is applied in different state to calculate the expected total reward, that is the total activated node at end of the time horizon.

Policy improvement
Based on the simulation result, we create a final reward (weighted total influence) list Q(s, a) by state and action, which is used to improve the policy. π(s, a) and π ′ (s, a) are Table 4 Notation of reinforcement learning with Markov decision process model

Symbol Definition
Indices and sets The information preference of node i with respect to message k w k The influence weight of message k ρ ij The probability of arc connection from node i to node j Variable σ ki The element of state matrix s ∈ S in row k and column i, that the activation status of node i by message k α ki The element of action matrix a ∈ A in row k and column i, that the seed selection of node i by message k old policy and new policy. The action set A is splitted to two subset. A 1 is the set of all happened action, A 0 is the set of all unhappened action: where is the stepsize, which is decided by the iteration number and policy improved value.
For the example of Policy Evaluation, the updated policy is shown in Table 6. If we summarized the policy by information k and user i, it will be

Computational experiments on algorithms' convergence
Numerical experiments and results of different algorithms are presented in this section on solving the information maximization problem. We randomly generate and test three data sets, i.e., • data set (2,4) with 2 messages and 4 nodes • data set (2,7) with 2 messages and 7 nodes • data set (3,7) with 3 messages and 7 nodes (cannot converge within 24 h, the average computation time for each iteration will take 60000 s). , a) , ∀ a ∈ A 1 , s ∈ S π(s, a), ∀ a ∈ A 0 , s ∈ Ŝ Q(s, a) = · min a∈A 1 Q(s, a), = m itr n · s∈S a∈A π itr (s, a) · Q π itr (s, a) − π itr−1 (s, a) · Q π itr−1 (s, a)  The algorithms are coded in Microsoft Visual Studio 2019 C++ linked with CPLEX 12.9. All the programs are run on the Microsoft Windows 10 Professional operating system with Intel Xeon CPU E-2186 2.90GHz and 32GB RAM. Since the computation of data set (3,7) cannot converge within practical time, we will only discuss the results of data set (2,4) and data set (2,7). All the computation results are shown in Table 7.
In Fig. 6, we show the results of total rewards when we increase the number of iterations while implementing the algorithm of reinforcement learning with Markov decision process. In Fig. 6a, b, the results are based on experiments with 2 messages and 4 nodes, while a 7-node case is shown in Fig. 6c, d. In all the figures, the left sides show the convergence results using a sample size of 10,000 for the Monte Carlo simulation in the algorithm, and with 1 million for the right sides. It can be easily observed that a larger sample size can converge faster and achieve policies with higher objective value in short amount of time. This is partially due the fact that a smaller sample size provides lower accuracy in approximation.
In Fig. 7, we compare the two proposed algorithms, i.e., the two-stage stochastic programming with Myopic policy (SP-MYOPIC) and the algorithm via reinforcement learning with Markov decision process (RL-MDP) using the different data set. In both cases (2 messages plus 4 nodes versus 2 messages plus 7 nodes), we use a sample size of 1 million. The SP-MYOPIC approach's result is the straight, horizontal line in both sub-figures. Although it is faster to calculate and does not have convergence issues, it   is trailing after the RL-MDP method in terms of total rewards when a certain amount of computational time is provided.

Conclusion
In this study, we presented multistage stochastic mixed integer nonlinear programming models with endogenous uncertainty to examine influence maximization in social networks that feature a dynamic topology decided by users. We proposed two methods, each featuring a network structure based on user preference in a finite-time information cascade. One makes use of classic two-stage stochastic programming, while the other leverages reinforcement learning. Information networks generally comprise autonomous nodes that make decisions when forming links with other nodes and transmitting information. We used the discrete choice model to build the node preference distribution; additionally, we modeled dynamic changes to the network structure by using stochastic dynamic programming, which can be solved via the Markov decision process. Our models accurately describe and predict user behavior so as to ensure dynamic optimization under uncertainty; as such, they act as tools by which to analyze dynamic changes to network structure by controlling information flow, and can be used in the information maximization problem. The results of our computational experiments show that large sample sizes can provide better and more stable results when one implements the reinforcement-learning based approach, which performs better than the two-stage stochastic programming (i.e., myopic) approach.