Don’t Predict Counterfactual Values, Predict Expected Values Instead

Jeremiasz Wołosiuk; Maciej Świechowski; Jacek Mańdziuk

doi:10.1609/aaai.v37i4.25661

Authors

Jeremiasz Wołosiuk Deepsolver
Maciej Świechowski QED Software Warsaw University of Technology
Jacek Mańdziuk Warsaw University of Technology

DOI:

https://doi.org/10.1609/aaai.v37i4.25661

Keywords:

APP: Games, GTEP: Imperfect Information

Abstract

Counterfactual Regret Minimization algorithms are the most popular way of estimating the Nash Equilibrium in imperfect-information zero-sum games. In particular, DeepStack -- the state-of-the-art Poker bot -- employs the so-called Deep Counterfactual Value Network (DCVN) to learn the Counterfactual Values (CFVs) associated with various states in the game. Each CFV is a multiplication of two factors: (1) the probability that the opponent would reach a given state in a game, which can be explicitly calculated from the input data, and (2) the expected value (EV) of a payoff in that state, which is a complex function of the input data, hard to calculate. In this paper, we propose a simple yet powerful modification to the CFVs estimation process, which consists in utilizing a deep neural network to estimate only the EV factor of CFV. This new target setting significantly simplifies the learning problem and leads to much more accurate CFVs estimation. A direct comparison, in terms of CFVs prediction losses, shows a significant prediction accuracy improvement of the proposed approach (DEVN) over the original DCVN formulation (relatively by 9.18-15.70% when using card abstraction, and by 3.37-8.39% without card abstraction, depending on a particular setting). Furthermore, the application of DEVN improves the theoretical lower bound of the error by 29.05-31.83% compared to the DCVN pipeline when card abstraction is applied. Additionally, DEVN is able to achieve the goal using significantly smaller, and faster to infer, networks. While the proposed modification may seem to be of a rather technical nature, it, in fact, presents a fundamentally different approach to the overall process of learning and estimating CFVs, since the distributions of the training signals differ significantly between DCVN and DEVN. The former estimates CFVs, which are biased by the probability of reaching a given game state, while training the latter relies on a direct EV estimation, regardless of the state probability. In effect, the learning signal of DEVN presents a better estimation of the true value of a given state, thus allowing more accurate CFVs estimation.

Don’t Predict Counterfactual Values, Predict Expected Values Instead

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription