Abstract
As an important psychological and social experiment, the Iterated Prisoner’s Dilemma (IPD) treats the choice to cooperate or defect as an atomic action. We propose to study the behaviors of online learning algorithms in the Iterated Prisoner’s Dilemma (IPD) game, where we investigate the full spectrum of reinforcement learning agents: multi-armed bandits, contextual bandits and reinforcement learning. We evaluate them based on a tournament of iterated prisoner’s dilemma where multiple agents can compete in a sequential fashion. This allows us to analyze the dynamics of policies learned by multiple self-interested independent reward-driven agents, and also allows us study the capacity of these algorithms to fit the human behaviors. Results suggest that considering the current situation to make decision is the worst in this kind of social dilemma game. Multiples discoveries on online learning behaviors and clinical validations are stated, as an effort to connect artificial intelligence algorithms with human behaviors and their abnormal states in neuropsychiatric conditions.
The data and codes to reproduce all the empirical results can be accessed at https://github.com/doerlbh/dilemmaRL.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. In: ICML (3), pp. 127–135 (2013)
Andreoni, J., Miller, J..H.: Rational cooperation in the finitely repeated prisoner’s dilemma: experimental evidence. Econ. J. 103, 570–585 (1993)
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R..E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)
Axelrod, R.: Effective choice in the prisoner’s dilemma. J. Conflict Resolut. 24, 3–25 (1980)
Axelrod, R., Hamilton, W..D.: The evolution of cooperation. Science 211(4489), 1390–1396 (1981)
Balakrishnan, A., Bouneffouf, D., Mattei, N., Rossi, F.: Incorporating behavioral constraints in online AI systems. In: Proceedings of AAAI (2019)
Balakrishnan, A., Bouneffouf, D., Mattei, N., Rossi, F.: Using multi-armed bandits to learn ethical priorities for online ai systems. IBM Journal of Research and Development 63 (2019)
Bayer, H..M., Glimcher, P..W.: Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47(1), 129–141 (2005)
Bereby-Meyer, Y., Roth, A.E.: The speed of learning in noisy games: partial reinforcement and the sustainability of cooperation. Am. Econ. Rev. 96(4), 1029–1042 (2006)
Beygelzimer, A., Langford, J., Li, L., Reyzin, L., Schapire, R.: Contextual bandit algorithms with supervised learning guarantees. In: AISTATS (2011)
Bó, P..D..: Cooperation under the shadow of the future: experimental evidence from infinitely repeated games. Am. Econ. Rev. 95(5), 1591–1604 (2005)
Bouneffouf, D., Rish, I.: A survey on practical applications of multi-armed and contextual bandits. (2019). CoRR abs/ arXiv: 1904.10040
Bouneffouf, D., Rish, I., Cecchi, G.A.: Bandit models of human behavior: Reward processing in mental disorders. In: AGI. Springer (2017)
Capraro, V.: A model of human cooperation in social dilemmas. PloS one 8(8), e72427 (2013)
Even-Dar, E., Mansour, Y.: Learning rates for q-learning. JMLR (2003)
Frank, M.J., Seeberger, L.C., O’reilly, R.C.: By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science 306(5703), 1940–1943 (2004)
Gupta, G.: Obedience-based multi-agent cooperation for sequential social dilemmas (2020)
Hasselt, H.V.: Double q-learning. In: NIPS (2010)
Holmes, A..J., Patrick, L..M.: The myth of optimality in clinical neuroscience. Trends Cognit. Sci. 22(3), 241–257 (2018)
Johnson, A., Proctor, R.W.: Attention: Theory and Practice. Sage (2004)
Kies, M.: Finding best answers for the iterated prisoner’s dilemma using improved q-learning. Available at SSRN 3556714 (2020)
Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985)
Langford, J., Zhang, T.: The epoch-greedy algorithm for multi-armed bandits with side information. In: NIPS (2008)
Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., Graepel, T.: Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint (2017)
Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: WSDM (2011)
Lin, B.: Supervisorbot: Nlp-annotated real-time recommendations of psychotherapy treatment strategies with deep reinforcement learning. arXiv preprint (2022)
Lin, B., Bouneffouf, D., Cecchi, G.: Split q learning: reinforcement learning with two-stream rewards. In: Proceedings of the 28th IJCAI (2019)
Lin, B., Bouneffouf, D., Cecchi, G.: Unified models of human behavioral agents in bandits, contextual bandits, and RL. arXiv preprint arXiv:2005.04544 (2020)
Lin, B., Bouneffouf, D., Cecchi, G.: Predicting human decision making in psychological tasks with recurrent neural networks. PLoS ONE 17(5), e0267907 (2022)
Lin, B., Bouneffouf, D., Cecchi, G.: Predicting human decision making with LSTM. In: 2022 International Joint Conference on Neural Networks (IJCNN) (2022)
Lin, B., Bouneffouf, D., Reinen, J., Rish, I., Cecchi, G.: A story of two streams: Reinforcement learning models from human behavior and neuropsychiatry. In: Proceedings of the 19th AAMAS, pp. 744–752 (2020)
Lin, B., Cecchi, G., Bouneffouf, D., Reinen, J., Rish, I.: Models of human behavioral agents in bandits, contextual bandits and RL. In: International Workshop on Human Brain and Artificial Intelligence, pp. 14–33. Springer (2021)
Luman, M., Van Meel, C..S., Oosterlaan, J., Sergeant, J..A., Geurts, H..M.: Does reward frequency or magnitude drive reinforcement-learning in attention-deficit/hyperactivity disorder? Psych. Res. 168(3), 222–229 (2009)
Maia, T.V., Frank, M.J.: From reinforcement learning models to psychiatric and neurological disorders. Nat. Neurosci. 14(2), 154–162 (2011)
Nay, J.J., Vorobeychik, Y.: Predicting human cooperation. PloS one 11(5), e0155656 (2016)
Noothigattu, R., Bouneffouf, D., Mattei, N., Chandra, R., Madan, P., Varshney, K.R., Campbell, M., Singh, M., Rossi, F.: Teaching AI agents ethical values using reinforcement learning and policy orchestration. In: IJCAI, pp. 6377–6381 (2019)
O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., Dolan, R.J.: Dissociable roles of ventral and dorsal striatum in instrumental. Science 304(5569), 452–454 (2004)
Park, H., Kim, K.J.: Active player modeling in the iterated prisoner’s dilemma. Computational intelligence and neuroscience 2016 (2016)
Perry, D.C., Kramer, J.H.: Reward processing in neurodegenerative disease. Neurocase 21(1), 120–133 (2015)
Press, W.H., Dyson, F.J.: Iterated prisoner’s dilemma contains strategies that dominate any evolutionary opponent. PNAS 109(26), 10409–10413 (2012)
Rapoport, A., Chammah, A.M., Orwant, C.J.: Prisoner’s Dilemma: A Study in Conflict and Cooperation, vol. 165. University of Michigan Press, Ann Arbor (1965)
Redish, A.D., Jensen, S., Johnson, A., Kurth-Nelson, Z.: Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling. Psychol. Rev. 114(3), 784 (2007)
Rummery, G.A., Niranjan, M.: On-line Q-learning Using Connectionist Systems, vol. 37. University of Cambridge, Cambridge, England (1994)
Schultz, W., Dayan, P., Montague, P.R.: A neural substrate of prediction and reward. Science 275(5306), 1593–1599 (1997)
Sutton, R.S., Barto, A.G., et al.: Introduction to Reinforcement Learning. MIT Press, Cambridge (1998)
Taylor, A.M., Becker, S., Schweinhardt, P., Cahill, C.: Mesolimbic dopamine signaling in acute and chronic pain: implications for motivation, analgesia, and addiction. Pain 157(6), 1194 (2016)
Thompson, W.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3–4), 285–294 (1933)
Wang, W., Hao, J., Wang, Y., Taylor, M.: Towards cooperation in sequential prisoner’s dilemmas: a deep multiagent reinforcement learning approach. arXiv preprint (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lin, B., Bouneffouf, D., Cecchi, G. (2022). Online Learning in Iterated Prisoner’s Dilemma to Mimic Human Behavior. In: Khanna, S., Cao, J., Bai, Q., Xu, G. (eds) PRICAI 2022: Trends in Artificial Intelligence. PRICAI 2022. Lecture Notes in Computer Science, vol 13631. Springer, Cham. https://doi.org/10.1007/978-3-031-20868-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-20868-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20867-6
Online ISBN: 978-3-031-20868-3
eBook Packages: Computer ScienceComputer Science (R0)