PREDICTING MOOD BEHAVIORS OF PRISONER’S DILEMMA PLAYERS

: Over the last few years, attention goes to the model of Three-Player Prisoner’s Dilemma Iterated Game. In this work we are concerned with studying the competition between strategies with special behavior in this model. These strategies have behavior similar to Tit for Tat strategy. We focus on strategies with angry behavior and strategies with tolerant behavior. The results reveal the qualities of the winning strategy. From our studies we can judge the social behavior that must be enjoyed by the person who follows the winning strategy. A learning algorithm has been designed and a multi-agent system has been developed to obtain our results


INTRODUCTION
Game theory is a toolbox made to help understanding phenomena which is involved in the analysis of strategies between the players. Models of game theory represent our daily life [1,2].
In game theory, there are many examples of coordination games such as: Prisoner's Dilemma (PD), Hawk-Dove, matching pennies, location game, etc. PD is one famous example of game theory complex decision-making process. Each player allows two choices either defect (D) or cooperate (C). The players play simultaneously without the knowledge of the other player's choice.
On the other hand, iterated PD (IPD) is the study of long-term behaviors which emerges cooperation between selfish and non-related competitors to coexist in the long term. At every repetition (round), each player's choice may depend on the outcome of the previous round. This leads to developing players' strategies according to their interactions in the previous game [11].
Earlier, Nowak et al. [12] worked in different strategies on two prisoner's dilemma and expressed them by finite state automata. The attacking players showed each other that the best strategy was that of Pavlov. Recently, Essam and Karim [13] have studied Tit-for-Tat (TFT) strategy in three prisoner's dilemmas and four strategies exhibits the same TFT's behavior.
Yet, Maurice and Emilian [14] were interested in investigating the role of anticipation on cooperation rates and payoffs. They reproduced several characteristics of individual play, and tested their model using multi-agent simulations of small societies. Whereas, Shiheng and Fangzhen [15] studied the invincible strategies, inspecting which strategies are not invaded by other strategies. Their results have shown that the strategy is a catalyst for cooperation and wining face to face.
Furthermore, Bukowski and Mi¸ekisz [16] classified the three-player games with two strategies.
They proved that there are many evolutionary stable strategies (ESS), and there are two pure strategies as in the stage hunt game. Whilst, Dominik, Jacek and Marcin [17] provided one pure and one mixed strategy and focused on the problem of equilibrium selection. They considered this problem from dynamical view. They discussed stochastic adaptation dynamics in the three-player game.
On the other hand, Anurag and Deepak [18] studied IPD problem from the aspect of zero-sum games. A zero-sum game is a mathematical representation of the case where the profits and the 4652 SALSABEEL M. ABD EL-SALAM, ESSAM EL-SEIDY, NAGLAA M. REDA losses are exactly balanced. They explore how artificial intelligence and machine learning techniques work exceedingly to give the agent the capability of recognizing the player intention.
In addition, Harper et al [19] proved that pre-trained strategies, using reinforcement learning techniques, generally outperforms human-designed strategies, and maximizes payoff versus varied series of opponents. They also showed that trained strategies using history of multiple rounds play outperforms memory-one strategies. While, Konstantinos [20] created an AI agent that uses reinforcement learning in order to discern an optimal strategy to the two-player iterated prisoner's Dilemma game using the Soar cognitive architecture.
In this article, we are interested in the iterated three-player model of the prisoner's dilemma game. We first introduce our problem and the basic concepts of machine learning needed for this study. then, we start a theoretical discussion to detect moody players. Then, we propose new algorithm for predicting the best strategy to win the game. In the following section, experimental results followed by numerical analysis is provided. The article ends with a conclusion section with some possible future work.

THE PROBLEM VS. MACHINE LEARNING
In this section, we summarize the fundamentals of the iterated Three-Player Prisoner's Dilemma game (3P-IPD). Then, we give a concise definition of reinforcement machine learning type and indicate how it is related to IPD game.

3P-IPD PROBLEM
In [13], researchers give great attention to prisoner's dilemma game involving three players in two population game. They suppose that two players harmonize between them to cooperate/defect together against the third player. Thus, a player's current choice may affect the opposite players' future behavior and payoffs. Subsequently, to maximize the received payoffs, players use more complex strategies depending on the game history. More precisely, each player in 3P-IPD round chooses his/her move based on the previous one saved in memory. Each round outcome lies between the eight probabilities for the three players to alternate between C and D. Because the game is symmetric, the outcomes are reduced to six combinations specified by the payoffs {P, L, T, S, K, R}, governed by the rule: S<P<K<L<R<T. This leads to sixty-four strategies denoted by 4653 PREDICTING MOOD BEHAVIORS OF PRISONER'S DILEMMA PLAYERS S0, S1, ..., S63. In binary, each strategy Sk is designated by a binary vector (x0, x1, x2, x3, x4, x5), where xi is the probability of the player to play C or D, it takes any number between 0 and 1. But in our work, we studied pure strategies, if the player plays C then xi is 1 otherwise xi is 0. As an exemplification, the strategy S49 is represented by the transition rule (1, 1, 0, 0, 0, 1). Also, each payoff symbol corresponds to the binary representation of its players' choices. Therefore, the produced payoff matric is as follows: For instance, consider the strategy S33 for player I against the two strategies S32 and S49 for players II and III consecutively. Thus, all cases of interaction of strategies will be as shown below.
In infinitely repeated game, the payoff is computed as the mean payoff per round. Thus, for player I the payoff is R in the eighth case, while the payoff is (K+L+P)/3 in the seven other cases.
Accordingly, the payoff for player II equals R in the eighth case, and equals (T+L+P)/3 in the seven other cases. Finally, for the player III, the payoff is R in the eighth case, while it is (K+S+P)/3 in the seven other cases.
In the seven other cases. Accordingly, the payoff for player II equals R in the eighth case, and equals (T+L+P)/3 in the seven other cases. Finally, for the player III, the payoff is R in the eighth case, while it is (K+S+P)/3 in the seven other cases. During our work, we follow the direct approach in [15].

REINFORCEMENT LEARNING
Artificial Intelligent (AI) is a computer science field that enables computer machines to simulate human processes intelligently. Machine learning is a subfield of AI that can be defined as general purpose techniques to learn functional relationships from data without being explicitly

MOODY PLAYERS' DETECTION
In the competition of artificial intelligence algorithms, the TFT model was the optimum solution for the prisoners' dilemma for many decades. Its main idea is to select the strategy of the next round depending on the basis of the selection from the previous round [22]. In this section, we concentrate our study on specific strategies picked depending on their behavior. The state machine diagram of every studied strategy is given in Fig. 1, classified into two groups. In our studies we also use two strategies S63 and S0 to compute the selected strategies.

TEMPERED PLAYERS
In this subsection, we select strategies: S32, S33, S36 and S38 in regard to TFT for tempered players.
Our selection is based on their attitude that we observe during tracing the game in Fig. 1A. We study their choices and perceive the following:

NATURAL-TEMPERED PLAYERS
In this subsection, we pick out strategies: S48, S49, S52 and S54 in regard to TFT for natural tempered players. we choose these strategies precisely depending on what we notice throughout running the game in Fig. 1B. We study their moves and recognize the following:

S48: Fast tempered and Natural-tempered, it still plays C only if it's two opponents Play C
together or one of them plays C and the other plays D, otherwise, that it plays D forever.

S49: Natural anger and Jade, it still plays C only if it's two opponents plays C or D, in addition
one of them plays C and the other plays D, otherwise, that it moves from the state C to state D.

S52:
Natural tempered and Tolerant, it still plays C only if its two opponents play C, in addition one of them plays C and the other plays D, while it moves from the state C to the state D if two opponents play D.

S54:
Natural anger and Quick tolerant, it plays D if the two opponents play D, unlike that he plays C forever.

PROPOSED PREDICTION ALGORITHM
Our objective was to design strategic algorithm by interacting computer science with game theory, guided by algorithmic game theory of the previous work. We propose an intelligent algorithm to predict who shall win the game by which selected strategy according to its behavior.
Our algorithm considers all possible scenarios and detect the most optimal strategy that fits the mood of participants. In the following explanation, the pseudocode of the proposed algorithm and the game simulator.
Our Predict-Best proposed learning algorithm which predicts the strategy that is expected to invade others. First, we declare the function Construct for constructing the stochastic vector X= , for each distribution vector in the given set DV, guided by its corresponding regime in the set or regimes Rg. For invalid pays, Construct assigns the X coefficients to zeros.
Second, we compute the average payoff values of the first agent against the third when fixing the second using equation (1) for the selected set of strategies Stg, and store them in a square twodimensional matrix P. (1) Third, we compare the evaluated payoffs for detecting the maximum long term expected payoff 4657 PREDICTING MOOD BEHAVIORS OF PRISONER'S DILEMMA PLAYERS and return its strategy index as the best. If the two agents are equipollent, then the algorithm returns the value -1.
Actually, we apply this algorithm on the different set of candidate strategies, presented above, chosen according to the agents' mood, then we conclude invaded strategies between variant competitors for the rest of the game.

Game-Sim algorithm
, 3P-IPD as a multi-agent system, given the number of strategies, the number of rounds, the number of players, and the TFT learning matrix. We considered all N 3 mutations for 3 participants' strategies. We adapt the BCD representation when referring to each strategy index. We also define a mapping between the decimal index of the round and the players' choices taken of the next round. For each three players' strategies p1, p2, p3, we inspect the players' choices for R rounds using the TFT learning mapping, calculate the P probable payoff vectors, identify the regimes Reg, deduce the transition matrix, and produce the final payoff distribution vector DVec p1, p2, p3. To simplify the pseudocode, we defined the following functions.

EXPERIMENTAL RESULTS
Our work has been coded and experimentally tested using MATLAB (matrix laboratory). We choose MATLAB because it is an excellent multi-paradigm numerical computing environment and proprietary programming language for our problem, especially when dealing with matrices.
For-loop iterations were implemented in parallel to improve the execution time.
For testing, we develop the whole game system which creates individual players as intelligent agents artificially. We consider all possible combinations of probabilistic TFT computer strategies.
Each three strategies were competed for 1000 round (interaction game session) because the game yields stable income results in the long-term. The system evaluates the transition matrix and the payoff vector for the 262144 simulations. For each agent, the sequences of last choices and targets from the previous session were kept as experience to serve as initials in the following session. Also, payoffs were saved to help in determining the regimes.
In the experiment, the parameters compensated by =0, P=1, k=3, L=5, R=7, T=9, satisfying the condition S<P<K<L<R<T. Each strategy has been mapped to the binary code of its index. The following tables present our results. Each tuple (x0, x1, x2, x3, x4, x5) is the stochastic vector for the first player asserting that he/she will get a payoff value below, such that each row denotes the first player strategy, and each column denotes the third player strategy. (2) The strategies outcompeted Si every column except the first column in the Table 13 and Table   14 refers to the strategies competed the strategies in the first column. Each pure strategy invaded the other strategies. Sj invaded Si if we calculated the payoff for each player in matrix by Nash equilibrium as equation (2).

CONCLUSION AND FUTURE WORK
In this paper, we studied the behavior of the strategies in three prisoner's dilemmas. we studied the outcompeting between the players. In Table 13 Table 3. The payoffs for player I when player II is using S36 Table 4. The payoffs for player I when player II is using S38  Table 6. The payoffs for player I when player II is using S0 Table 7. The payoffs for player I when player II is using S48  Table 9. The payoffs for player I when player II is using S52 Table 10. The payoffs for player I when player II is using S54 Table 11. The payoffs for player I when player II is using the strategy S63 Table 12. The payoffs for player I when player II is using S0 Table 13. List of invading strategies for tempered players Table 14. List of invading strategies for natural-tempered players

CONFLICT OF INTERESTS
The author(s) declare that there is no conflict of interests.