Learning the Truth by Weakly Connected Agents in Social Networks Using Multi-Armed Bandit

This article provides a study into the social network where influential personalities collaborate positively among themselves to learn an underlying truth over time, but may have misled their followers to believe a false information. Most existing work that study leader-follower relationships in a social network model the social network as a graph network, and apply non-Bayesian learning to train the weakly connected agents to learn the truth. Although this approach is popular, it has the limitation of assuming that the truth - otherwise called the true state - is time-invariant. This is not practical in social network, where streams of information are released and updated every second, making the true state arbitrarily time-varying. Thus, this article improves on existing work by introducing online reinforcement learning into the graph theoretic framework. Specifically, multi-armed bandit technique is applied. A multi-armed bandit algorithm is proposed and used to train the weakly connected agents to converge to the most stable state over time. The speed of convergence for these weakly connected agents trained with the proposed algorithm is slower by 66% on average, when compared to the speed of convergence for strongly connected agents trained with the state-of-the-art algorithm. This is because weakly connected agents are difficult to train. However, the speed of convergence of these weakly connected agents can be improved by approximately 50% on average, by fine-tuning the learning rate of the proposed algorithm. The sublinearity of the regret bound for the proposed algorithm is compared to the sublinearity of the regret bound for the state-of-the-art algorithm for strongly connected networks.


I. INTRODUCTION
The social network has grown over the years to become a platform for influential personalities to sell their beliefs to followers within their sphere of influence. Even if the beliefs of these influential personalities are wrong, it is easy for them to access first-hand information and cooperate among themselves to learn the truth over time. However, it is difficult for the misinformed followers to learn the truth by virtue of their inherent weakness to independently find the truth. They are subject to the dominating influence of these influential personalities. Take for instance the prevalent attitude among adolescents, where they form strong loyalty to celebrities of their interest, and inherit the enemies of these celebrities. In reality, most of the celebrity fights on social media are merely for attention seeking. The rival celebrities may resolve The associate editor coordinating the review of this manuscript and approving it for publication was Zeshui Xu . their differences among themselves, but these adolescents endlessly engage in social media battle against themselves. This is a social learning problem. As a result of the exponential growth of social networks, it is imperative to study the behavior of humans in social networks, and to proffer solutions to the menace of negative social influence. There have been lots of interest among researchers to study the spread of negative information in social networks [1]- [4]. A study in [5] discussed the connection between social relationships among humans and health.
Graph theory is commonly used for theoretical research on social networks. Influential personalities or agents form a strongly connected subnetwork, while their followers form a weakly connected subnetwork. Agents in the strongly connected subnetwork can easily communicate among themselves to learn the truth. On the other hand, agents in the weakly connected subnetwork are dominated by agents in the strongly connected subnetwork, whether or not they communicate among themselves. At least two common graph theoretic approaches are used to study the interaction process [6], [7] among agents in a graph network. The Bayesian method [6], [8]- [10] is the first approach, where agents rely on some prior beliefs and on Bayes' rule to update their beliefs. The non-Bayesian method [11], is the second approach, where each agent initially obtain an intermediate belief based on an observed private signal, using the Bayes' rule. Then, the agent cooperates with its neighboring agents to update its belief. Agents asymptotically learn the truth -also known as the underlying true state -following this approach. The underlying true state is the unknown parameter of interest that the agents seek to uncover. Although, each agent observes a private signal at each time instance, this private signal is only fuzzy about the true state. No agent can directly observe the true state; hence, the agents rely on interactions among themselves to learn this true state over time. Thus, the true state is said to be underlying. The stateof-the-art non-Bayesian learning approaches use diffusion learning [12]. The performance of diffusion learning is good in situations where learning is continuous [13], [14].
There are lots of interesting work on training weakly connected agents to learn the true state, when the true state is time-invariant. The authors in [15]- [18] applied a graph theoretic approach and linear algebraic manipulations to a partitioned adjacency matrix resulting from the graph network. The weakly connected agents were able to converge to the true state, although with a belief probability less than one. This means that the weakly connected agents could still be controlled by the strongly connected agents, and this is a limitation of the work. The authors in [19] overcame this limitation by using a log-intermediate belief and updated the belief with an exponential function. The speed of convergence improved, and the weakly connected agents' belief probabilities reached one for the true state over time. In [20], the authors derived some closed-form expressions to determine how close both strongly connected agents and weakly connected agents get to their limiting points in terms of meansquare-error. The mean-square-error performance of weakly connected agents is determined by the mean-square-error of the strongly connected agents. In [21], the authors did not only proposed a model that trained weakly connected agents to converge to the true state, but also studied the reverse problem of learning the network topology given that the weakly connected agents received some measurements from the strongly connected agents. One of their main results is that for topology learning to occur, the number of hypotheses or states, must not be less than the number of strongly connected subnetworks. This means that if there are two strongly connected subnetworks in the graph network, then, the number of states -which include the true state -must be at least two.
Most existing literature that applied graph theory to study truth-learning in the social network assumed that the true state is time-invariant [11], [13], [14], [22]. An analogy of the approach in this existing literature is described as follows: At the initial time, all agents are doubtful of the true state.
However, after cooperation among one another, the beliefs of the influential agents in the strongly connected subnetwork will converge to the true state over time. On the other hand, the agents in the weakly connected subnetwork will lose their own beliefs and will be subtly manipulated to accept the false beliefs of agents in the strongly connected subnetwork. A limitation of this social learning model is that it does not consider the dynamic nature of social network. This is not practical, since streams of information are released and updated every second in social networks [23], [24]. Thus, the true state is arbitrarily time-varying. Hence, the analyses and results in this existing literature are not applicable to a dynamic setting. It has been shown in [25], [26], that convergence of agents to the true state is difficult to achieve when the true state is dynamic. For the setting discussed in this article, where the true state is arbitrarily time-varying, common existing approaches are therefore inadequate; hence online (reinforcement) learning is introduced into the graph theoretic framework.
Online learning is an aspect of machine learning where agents receive information sequentially. Online learning has shown to perform well in predicting the time-varying true state for strongly connected agents unlike conventional social learning methods that fail. For instance, [27] proposed an online learning approach that can help strongly connected agents predict the time-varying true state. Although, the performance of the agents often come with some regret. Minimizing this regret is the main goal of online learning. Online learning strategies have the flexibility to work with different forms of observed private signals. For instance, in most existing literature, the private signals observed by the agents are strictly linear; however online learning can work for arbitrary signals [28]. There are many online learning strategies. The multi-armed bandit technique, which is an online learning strategy, has proven to be one of the most successful for social learning even in difficult situations, where cooperation among the agents is difficult [27], [29]. In such a setting, an agent may be denied feedback information from its neighboring agents making consensus difficult.
Although, the multi-armed bandit technique has proven to be quite effective for training strongly connected agents to learn the truth when the true state is arbitrarily time-varying [27], [30], it is yet to be applied for weakly connected agents. Thus, this article applies multi-armed bandit technique to help weakly connected agents predict the time-varying true state. The regret incurred by these weakly connected agents is expected to be higher than the regret incurred by strongly connected agents. A preliminary study of this work can be found in [31].

A. RESEARCH CONTRIBUTIONS
The contributions of this article are as follows: 1) It studies the problem of negative influence from influential personalities to their followers in social networks. To address this problem, the social network is VOLUME 8, 2020 modeled as a graph network, where the influential personalities form strongly connected subnetworks, and their followers form weakly connected subnetworks. 2) It applies non-Bayesian learning and multi-armed bandit technique, which is an online learning strategy, to help the agents in a weakly connected subnetwork learn a time-varying true state. Hence, a non-stochastic multi-armed bandit algorithm is proposed.
3) It provides simulation results that show that the weakly connected agents converge to the most stable state, despite the arbitrarily time-varying nature of the true state. The most stable state is the true state that appears to be the most stable among the sequence of arbitrarily time-varying true states over the time horizon. Also, the sublinearity of the proposed algorithm for the weakly connected subnetwork is compared with the sublinearity of the state-of-the-art algorithm for strongly connected subnetwork already established in literature. The rest of the paper is organized as follows: Section II explains the system model in detail, Section III explains the proposed algorithm, Section IV gives some theoretical results, Section V discusses simulation results, Section VI concludes the findings, and Appendix A shows the proof of Theorem 1.

II. SYSTEM MODEL
represents the set of agents in the network with |V| = N . Let a pair of non-negative scalar weights {a jk , a kj } ∈ E be assigned to the edges connecting agents k ∈ V and j ∈ V. The edge weight a jk represents the weight assigned to the directed path from j to k. Similarly, a kj represents the weight assigned to the directed path from k to j. The network is said to be strongly connected if there exists a directed path in both ways connecting any two agents of a neighborhood, and there is at least one self-loop, i.e., a kk > 0 for an agent k. The presence of at least a self-loop means that in the strongly connected subnetwork, there is at least one agent who uses its own opinions in its decision-making process. Such an agent is said to be self-conscious. Agents not connected by an edge have a weight of 0 for that direction. This implies that it is possible to have a jk > 0, but a kj = 0. Adopting the definition for a weakly connected network in [16], a weakly connected network is defined as a network that acts as a receiver only and can be dominated. The neighborhood of any agent, say agent k, is the set of agents connected to k. Denote the neighborhood of agent k as N k . Note that agent k is a member of its own neighborhood. The adjacency matrix of the graph is a square matrix whose elements are the weights of the edges linking any two agents. Denote the adjacency matrix as A. When each column vector elements in the adjacency matrix sum up to one, then the matrix is said to be left-stochastic, i.e., a jk ≥ 0, j a jk = 1.
(1) A graph network may be a combination of subnetworks, as shown in Fig. 1. The top two subnetworks are strongly connected, while the bottom two subnetworks are weakly connected. All subnetworks in Fig. 1 have directed arrows, but the directed arrows in each subnetwork are purposely omitted to avoid confusion with the directed arrows indicating domination. Also, while the strongly connected subnetworks have at least one self-loop, the weakly connected subnetworks have no self-loop. As common in graph theory [12], [13], [16], a strongly connected subnetwork is left-stochastic and has a spectral radius of one, i.e., the eigenvalues are bounded by one. A strongly connected subnetwork also follows the Perron-Frobenius theorem, and has a single eigenvalue at one, while other eigenvalues are strictly inside a unit disc. On the other hand, a weakly connected subnetwork is not left-stochastic and does not include the strongly connected subnetwork in its neighborhood.

B. DIFFUSION LEARNING
In diffusion learning, all agents start with a uniform prior belief over each state in the network. Let = {θ 1 , . . . , θ M } be the set of all possible states that can be detected by a graph network, and let θ * t ∈ denote the unknown time-varying true state of the network at time t. Intuitively, represents a bounded set of discrete information containing a time-varying truth. Take for the purpose of illustration only, θ * t could mean the price of stock that fluctuates arbitrarily over time, and could mean a set of all known stock prices, any of which could be θ * t . Here, the strongly connected agents could mean a collection of some wealthy enterpreneurs and stockbrokers. The weakly connected agents could mean some social media followers of these enterpreneurs who are novice to the stock market, but wish to invest in stock. Each agent in the entire graph network has some prior belief. The prior belief for the agent k ∈ V for instance, is denoted as µ k,0 (θ ) = 1 M at time t = 0. It is to be noted that the agent k ∈ V in this context refers to any agent in the entire graph network. Each agent will update its belief at each time t ≥ 1 first by observing a random private signal. Intuitively, this private signal represents a side observation not fully informative but accessible by each agent. From the above illustration, the private signal could mean some rumors about the stock prices. The observed private signal of the agent k is denoted as S k,t , and it is drawn from some known likelihood function L k (·|θ * t ) that is dependent on the time-varying true state θ * t . S k,t is a member of a finite state space Z k,t . The private signals are independent over time and over all agents. These signals are not fully informative which necessitates cooperation among the agents i.e., where n ∼ N (0, 1). The observed private signal is a noisy version of the underlying time-varying true state. The agent k uses this observed private signal to compute the likelihood L k (S k,t |θ ) over each state θ ∈ as follows: where σ 2 k,t is the variance of agent k at time t. Then, the agent uses the Bayesian rule to generate an intermediate belief as follows: where γ is the exploration parameter. The agent k cooperates with other agents in its neighborhood to compute a consensus probability P k,t (θ), using the weight connections in the adjacency matrix. This is illustrated as follows: If k is a strongly connected agent, then, a kk ≥ 0 in (6), and if k is a weakly connected agent, then, a kk = 0. Because there is at least an agent in the strongly connected subnetwork with self-loop, there is at least an agent that will use a weighted portion of its own intermediate probability to compute the consensus probability according to (6). However, in the weakly connected subnetwork, since there is no self-loop, there is no agent that uses its intermediate probability to compute its consensus probability. This implies that none of the weakly connected agents uses its opinions in its decision-making process.

C. ONLINE LEARNING
The network is set up as an online learning problem that is best described as a game between the agents and an oblivious adversary. The goal of the adversary is to maximize the regret. Thus, the agents must be able to make smart decisions to outwit the adversary and minimize regret. Before the game begins, the adversary fixes the loss l k,t (θ ) ∈ [0, 1] for each agent k at each time t and over all the states. The time-varying true state incurs no loss, i.e., l k,t (θ * t ) = 0. The agents can minimize regret by ensuring it accurately predict the time-varying true state on most occasion, and thus incur minimum number of losses throughout the entire duration of the game. The agents' performance can be benchmark against an oracle that has some knowledge of the game setting, and would prefer to stick to a state that incurs the lowest possible losses over the entire duration of the game. Therefore, each agent's expected regret is the difference between the total expected loss incurred by the agent that follows a randomized approach in predicting the time-varying true state, and the total expected loss of the oracle who chooses to stick to the best fixed state θ • for the entire duration of the game. This is given as: where R(T ) is the expected regret over the time horizon T , and the filtration F T = σ (S k,1 , . . . , S k,T , l k,1 , . . . , l k,T , θ 1 , .., θ T ) represents the history of all observed private signals, states chosen and incurred losses.

III. PROPOSED ALGORITHM
The proposed algorithm is an adversarial multi-armed bandit algorithm designed to help weakly connected agents to predict the time-varying true state. The parameters for the algorithm are the feedback graph, the learning rate η ≥ 0 and the exploration parameter γ ∈ (0, 1 2 ]. The input to the algorithm is the adjacency matrix A, and the number of states M . The output of the algorithm is the belief µ k,t (θ ) ∀θ ∈ . The operation of the algorithm is discussed as follows: In step 0, each agent's belief µ k,0 (θ) is initialized over the state θ ∈ as a uniform distribution. For each round of the algorithm, the following steps are executed: In step 1, the intermediate probability p k,t (θ ) is computed. This involves a trade-off between exploitation and exploration with the parameter γ . In exploitation, the algorithm sticks to the past belief of each agent about the states, while in exploration, the algorithm combines the likelihood of each agent over the states with the effect of domination from the strongly connected subnetwork. The trade-off is necessary to minimize regret. Thus, p k,t (θ ) is computed with the introduction of the domination number δ (see Definition 1). The exploration parameter is shared evenly between ψ k,t (θ) and 1 δ . In step 2, the consensus probability is computed from (6). In step 3, a state is drawn at random according to the consensus probability distribution P k,t . Loss is incurred for the chosen state. In step 4, the estimated loss Proposed Algorithm: Online Diffusion Learning for Weakly Connected Network Parameters: Feedback graph, learning rate η > 0. V is the set of weakly connected agents and E is the set of edges. Exploration parameter γ ∈ (0, 1 2 ]. Input: The adjacency matrix A, | | = M Output: The belief µ k,t (θ ) ∀θ ∈ Step 0: Initialize µ k,0 (θ ) = 1 M For each round t ∈ {1, · · · , T } Step 1: Step 2: Compute P k,t (θ ) = j∈N k a jk p j,t (θ ), P k,t = (P k,t (θ 1 ), . . . , P k,t (θ M )) Step 3: Draw state θ t ∼ P k,t and incur loss l k,t (θ t ) ∈ [0, 1] Step 4: Computê Step 6: end over the whole state is computed. This is important because the algorithm needs to update the belief over all the states but does not know the value of the losses for unchosen states at each time. This is typical of multi-armed bandit settings. This estimated loss is an unbiased estimate of the true loss in expectation, as shown in Lemma 1. In step 5, the algorithm updates the belief of each agents over all the states using an exponential function. Notice that the belief update equation is normalized to ensure that the sum of the beliefs over all the states is one.
Step 6 ends an iteration of the algorithm. The algorithm repeats from step 1 until the time horizon is reached.
The goal of the proposed algorithm is to converge to the most stable state from the arbitrary sequence of time-varying true states over the time horizon. The most stable state appears to be the true state that is the most stable from the arbitrary sequence θ * 1 , . . . , θ * T . The belief probability of all agents over the most stable state is expected to reach a value of 1 over time. Intuitively, the agents in a weakly connected subnetwork will converge to the state that appears to be the truth in most occasion.

IV. THEORETICAL RESULTS
In this section, theoretical results are presented.
Definition 1: The weak domination number of a graph G is denoted by δ(G), and is the smallest size of any subset D ⊆ V, which belongs to the strongly connected subnetwork and dominates the weakly connected subnetwork.
Remark: In standard graph theory, a weakly dominating set D ⊆ V, is the set of nodes that dominates the weakly connected subnetwork. Computing a maximal dominating set is NP-hard but it can be efficiently approximated within a logarithmic factor using the greedy algorithm [32].
Assumption 1: The exploration over the subset D and the intermediate belief ψ k,t (θ ) is assumed to be uniform.
This assumption is useful for the proof of Theorem 1.

Lemma 1:
The estimated lossl k,t (θ), is an unbiased estimate of the true loss l k,t (θ ) in expectation, and it is given as Proof:

Theorem 1:
The upper bound on the expectation of the regret in the proposed algorithm for the weakly connected network is O (δ ln M ) 2 T 2/3 when γ = min (δ ln M ) 2 T 1/3 , 1 2 , η = γ 2 δ , and T ≥ M 3 ln M /δ 2 . Proof: See Appendix A. Remark: The regret bound for the proposed algorithm is worse than the regret bound for strongly connected network obtained in [30], which is O √ αT ln M , where α represents the independence number of the strongly connected graph. However, the regret bound of the proposed algorithm is comparable to the regret bound of EXP3.G for weakly observable graphs of orderÕ(T 2/3 ) [33], where the tilde symbol represents the removal of some constant parameters. In EXP3.G, the loss feedback is not strictly bandit like the proposed algorithm, and each weakly connected agent is allowed to observe the losses of its neighbors. Also, the proposed regret bound is comparable to the regret bound of 202094 VOLUME 8, 2020 Lazy Revealing Action algorithm with orderÕ(T 2/3 ) for the full information setting [34], where each weakly connected agents can observe the losses of all agents in the graph network. Despite the restrictions of the bandit setting, the proposed algorithm has the same regret bound as the EXP3.G and Lazy Revealing Action algorithms with less restrictive settings. This is an advantage of the proposed algorithm.

V. SIMULATION RESULTS
The simulation uses the network in Fig. 2 consisting of two strongly connected subnetworks A and B and a weakly connected subnetwork C. The adjacency matrix for the weakly connected subnetwork is given as:  Fig. 2 that none of the weakly connected agents 6, 7 and 8 in the weakly connected subnetwork C has a self-loop. This means that for a weakly connected agent k (which may be any of 6,7 or 8), its self-loop weight a kk = 0. Since there is not a single self-loop in subnetwork C, none of the weakly connected agents uses its own opinion for decision making. A weakly connected subnetwork is known not to be left-stochastic; hence, the adjacency matrix A can be formed such that a kk = 0, and j a jk < 1 with a jk ≥ 0. The domination number in Fig. 2 is 2 since two nodes from the strongly connected subnetworks are sufficient to influence the weakly connected subnetwork. For instance, node 2 and node 4 from the two strongly connected subnetworks in Fig. 2 are sufficient to send one-way information from both strongly connected subnetworks to the weakly connected subnetwork. It is not important to show the convergence of the strongly connected agents, as this is shown already in [30]. However, it is of importance to show that the weakly connected agents can converge to the most stable state, albeit, at a slower rate compared to the strongly connected agents.
Assuming there are five states, i.e., = {θ 1 , · · · , θ 5 }, any of which can be the true state at each time t, since the true state varies arbitrarily over time. This time-varying true state θ * t is randomly chosen using the randi function in MATLAB. If algorithm 1 is implemented, each weakly connected agent will converge to the most stable state over time, despite the time-varying nature of the true state. The parameters used for the simulation are: t = 1, . . . , 400; η = 0.2; and γ = 0.1. The private signals observed by each agent k is drawn from a time-varying Gaussian distribution N (θ * t , 1), where θ * t represents the arbitrarily time-varying true state at time t, and the distribution is centered around this time-varying true state. The variance of the distribution is 1. It is to be noted that at each time t, each agent draws its observation from this distribution independently. At the start of the algorithm, the belief µ k,0 (θ ) is uniformly distributed over the state. This means that µ k,0 (θ ) = 1 5 . The algorithm is iterated 50 times. The settings for the simulation parameters are very similar to what was used in [30]. However, the algorithms in [30] are different from the proposed algorithm due to the presence of domination in the weakly connected subnetwork. Figs. 3-5 show how the weakly connected agents converge to the most stable state using algorithm 1. Each of the figures have five plots. Four of those plots show how the beliefs of agents 6, 7 and 8 converge to zero for some states. However, the beliefs of the agents 6, 7 and 8 will converge to a value of 1 only for the most stable state. The most stable state appears to vary from iteration to iteration because the sequence of the true states θ * 1 , . . . , θ * T varies arbitrarily over the iteration. Fig. 3 shows the convergence of the beliefs of agents 6, 7 and 8 at the 1 st iteration when η = 0.1. It can be seen that the beliefs of these agents for states θ 2 , θ 3 , θ 4 and θ 5 go to zero over time. However, the beliefs of these weak agents converge to θ 1 with a value of 1 at t = 161. Thus, θ 1 is the most stable state at the 1 st iteration.   4 shows the convergence of the beliefs of agents 6, 7 and 8 at the 50 th iteration with η = 0.1. It can be seen that the beliefs of these weakly connected agents for states θ 1 , θ 2 , θ 4 and θ 5 go to zero over time. However, the beliefs of these agents converge to θ 3 with a value of 1 at time t = 189. Thus, θ 3 is the most stable state at the 50 th iteration. The most stable state at the 50 th iteration differs from the most stable state at the 1 st iteration. Hence, the most stable state varies arbitrarily over the number of iterations. The convergence of these weakly connected agents is slow compared to the convergence of strongly connected agents shown in [30]. We can compare the average speed of convergence for weakly connected agents, using the proposed algorithm, to the average speed of convergence for strongly connected agents in [30], using the state-of-the-art algorithm, over 50 iterations. On average, the weakly connected agents converge at time t = 171, while on average, the strongly connected agents converge at time t = 103. This means that the weakly connected agents converge 66% times slower than the strongly connected agents.
To show how important is the learning rate for fine-tuning algorithm 1, a higher learning rate η = 0.2 is used in Fig. 5. At the 50 th iteration, the beliefs of the weakly connected agents 6, 7 and 8 go to zero for states θ 1 , θ 2 , θ 4 and θ 5 . However, the beliefs of these weakly connected agents converge to θ 3 with a value of 1 at time t = 124. Here, the convergence is improved. Thus, fine-tuning the learning rate can improve the speed of convergence of the weakly connected agents. On average, the weakly connected agents converge at time t = 120. This is approximately 50% improvement. Fig. 6 shows the comparison of the sublinearity of the regret bound of algorithm 1, which is given as O((δ ln M ) 2 T 2/3 ), to the regret bound obtained in [30], which is O( √ αT ln M ) for strongly connected agents. The sublinearity is defined as lim t→∞ R(t) t . It can be seen that the regret bound of algorithm 1 decays slowly compared to that of the strongly connected agents in [30], when other parameters of the regret bound are kept constant except time.  . Sublinear regret bound comparsion between the algorithm designed for strongly connected agents in [30] and the proposed algorithm designed for weakly connected agents.
We can compare the results in this article with the results obtained in [16] and [21], for the case where the true state is time-invariant. From Fig. 9 in [16], it can be seen that the weakly connected agents converge to the true state with a belief probability less than 1. This means that the weakly connected agents are still manipulated by the strongly connected agents. This model is not very good in practice, as the goal should be to train the weakly connected agents to learn the true state with a belief probability of 1. On the other hand, the authors in [21] showed that the weakly connected agents can learn the time-invariant true state with a belief probability of 1, as shown in Fig. 8 in [21]. The convergence in this article is comparable to what is obtained in [21], even though the true state is arbitrarily time-varying here.
This work can be extended to situations, where the strongly connected agents have limited control over the weakly connected agents. More so, it will be interesting to improve the regret bound for the weakly connected agents in this article. A much challenging problem will be situations where the transfer of information from the strongly connected agents to the weakly connected agents is both ways. Also, the algorithm may be trained on massive social media data that can be modeled as a massive graph network, instead of using a graph network with few number of nodes.

VI. CONCLUSION
In conclusion, this article studied leader-follower relationships in social networks, and proposed how followers can overcome manipulations from leaders (or influential personalities), in order to learn an arbitrarily time-varying truth. These influential personalities form strongly connected subnetworks, while their followers form weakly connected subnetworks. It has been shown in existing work that strongly connected agents can cooperate among themselves to learn this time-varying truth -otherwise called the time-varying true state. However, training weakly connected agents to learn the arbitrarily time-varying true state is yet to be investigated. Thus, this article focused on training weakly connected agents to converge to the most stable state over time from the arbitrary sequence of time-varying true states over the time horizon. To achieve this, a non-stochastic multi-armed bandit algorithm is proposed, and it is shown by simulation that the beliefs of weakly connected agents can converge to this most stable state. Also, it is shown that the most stable state varies randomly over the number of iterations. The speed of convergence for these weakly connected agents trained with the proposed algorithm is slower by 66% on average, when compared to the speed of convergence for strongly connected agents using the state-of-the-art algorithm already established in the literature. This is because weakly connected agents are harder to train. Fine-tuning the learning rate of the proposed algorithm can improve the speed of convergence of the weakly connected agents by approximately 50% on average. Finally, The sublinearity of the regret bound for the proposed algorithm is compared with the sublinearity of the regret bound for the state-of-the-art algorithm for strongly connected agents.