Risk Sensitivity under Partially Observable Markov Decision Processes

Many real-life decisions must be made in the face of risk that is due to uncertain information about the environment. Even facing the same environment, different people might behave differently due to their individual risk preferences. For instance, a risk-seeking gambler may overestimate the chance of favorable outcomes or the amount of money going to win in those cases and therefore prefers to gamble. In cognitive neuroscience, Bayesian inference is usually applied to model the objective perception of the unobservable state, under which riskneutral decisions are made by solving a partially observable Markov decision process (POMDP). However, the subjective evaluation of such inferred state information, which leads to different individual risk preferences, and the underlying neurobiological process are still poorly understood. Hence, we derived a risk-sensitive POMDP method that models human choice behavior and response time in a simulated investment task. Our risksensitive POMDP model fits the experimental data considerably better than the risk-neutral model. The model’s risk-sensitivity parameters explained subjects’ individual risk preference under state uncertainty at the decision time. Our results may pave the way for understanding human risk-sensitive choice under perceptual uncertainty using a unified quantitative POMDP framework.


Introduction
Many real-life decisions are made in the twilight of uncertainty, such as whether to invest in a risky asset. At least two types of uncertainty can impact the economic consequences of a choice and thus result in decision risk (Bach & Dolan, 2012): first, the uncertain consequences of the decision maker's choice, and second, the decision maker's uncertain knowledge about the state underlying the choice situation. While the neural mechanisms underlying economic risk processing are fairly well established (Niv, Edlund, Dayan, & O'Doherty, 2012), the risk preferences induced by perceptual uncertainty are less clear. In this study, we generalized the recent computational work on risk-sensitive Markov decision processes (MDPs) (Shen, Tobia, Sommer, & Obermayer, 2014) to the POMDP case and empirically validated this theoretical framework as a behavioral model for human response times (RT) and choices in a novel experiment, in which human subjects performed a simulated investment game. The behavioral task was designed to flexibly manipulate individual risk preferences induced by perceptual uncertainty. We identified subject groups of similar risk-preferences and demonstrated each group's belief update about the unobservable states using the risk-sensitive POMDP model.

Method
Experimental Paradigm: Human Risk-sensitive Choice under Perceptual Uncertainty 55 participants (32 female, mean age 25.44 ± 4.7 years old) performed a sequential decision task in which they imagined themselves as an investor in a simulated stock market. The market had two unobservable states, a "good" state with a high investment return and a "bad" state with a low investment return. The good (bad) state was indicated by the left (right) motion direction of the random dot kinematogram (RDK) (Britten, Shadlen, Newsome, & Movshon, 1993) stimulus that consisted of 180 frames with 1/60 seconds per frame. The stimulus switched its direction once within a trial at a random time point, which induced uncertain economic consequences of subject's actions. The perception about motion strength was induced by the probability that a particular dot would be displaced in the signal direction, which is typically referred to as coherence. At each frame, the participant chose between two possible actions, "sell" the stock or "wait". Selling in a good state led to a reward of 2.5 units, whereas in a bad state led to a reward of 1 unit. The wait action allowed to accumulate information at the expense of a small constant waiting cost per frame. The episode terminated either immediately after the subject chose to sell or automatically set to sell at the last frame if the subjects waited until the end of the episode. As an optimal strategy, a participant who believed to be in the good state should sell the stock as quickly as possible. On the other hand, when the participant believed to be in the bad state, they could either wait for the market state to switch to the good state for a larger profit or sell immediately to avoid further waiting costs. However, the accumulated waiting costs could be higher than the profit of the good state if the switch happened too late. The risk in this task arises as a consequence of the decision under perceptual uncertainty about the unobservable market state. Details about the 2 x 2 x 2 factorial experimental design with the order of stimulus states (good first , bad first), the coherence of both motion states (high, low), and the waiting cost (high, low) are shown in Figure

Computational Modeling: Risk-sensitive POMDP
The POMDP representation of the investment game is given by a tuple (S, A, Ω, T, O, R) with the following components: • Ω is a set of observations given by the noisy RDK stimulus The RDK stimulus represented the noisy observations that the agent received from the POMDP environment (displayed in Figure 2). In each trial, the duration of the RDK stimulus lasted at maximum N=180 time steps (frames).
The resulting belief-state MDP has a belief space B which is the set of all probability distributions over the state space S.
Bayesian inference is used to update the belief upon receiving each new observation:

Risk-neutral model A risk-neutral agent aims at maximizing
the expected cumulative reward through a policy π: The optimal policy π * := arg max π∈Π J N (π, b) can be obtained using standard value iteration in the belief space B with an appropriate approximation method (e.g. Spaan, 2012).

Risk-sensitive model
To incorperate risk-sensitivity into the model, the agent is endowed with an exponential utility function, given by where λ controls the risk-preferences. When λ < 0 (λ > 0), U is convex (concave) and therefore the agent will be riskseeking (risk-averse). For risk-sensitive POMDPs, the state space is expanded by the agent's wealth w ∈ W (i.e. its cumulative reward at any given time) (Bäuerle & Rieder, 2017;Marecki & Varakantham, 2010). The objective of the risksensitive agent is then given by: Value iteration can be performed by recursively calculating V n U (b, w) starting at the final decision epoch N where it must hold that: for all b ∈ B. Here, r s ∈ {1, 2.5} is short-hand for the reward for selling in state s and c denotes the waiting costs. Accordingly, the optimal state-action values can be calculated via backward recursion by: where b is the updated belief and R(b, a) is the expected reward under action a with respect to the current belief b.
An approximation to the optimal value function was obtained using grid-based approximation with nearest-neighbor interpolation (Hauskrecht, 2000). Exploratory analysis of a risksensitive agent's choice behavior showed that negative values of λ corresponded to quicker responses at the cost of higher state uncertainty at decision time, whereas positive values of λ induced longer evidence accumulation and thus longer RTs on average.

Model-based Analysis
Both risk-neutral and risk-sensitive models are fitted to the subjects' behavioral RT. We identified subject groups with similar risk-preferences by applying k-Means clustering to their RT quantiles. Goodness of fit was determined by measuring similarity between the humans' and the model agent's RT distributions based on the Euclidean distance between the quantiles. For sets of candidate values, λ ∈ Λ, coh low , coh high ∈ C we fit parameters by the following procedure: Let Θ := Λ ×C ×C denote the parameter space. Furthermore q low and q high denote the vectorial data representation of the RT distribution quantiles corresponding to the experimental conditions with low and high coherence, respectively.
The optimal set of parameters, θ * is then given by 2 : θ * = arg min θ∈Θ q low sub jects − q low agent(λ,coh low ) + q high sub jects − q high agent(λ,coh high ) The analysis was performed on both, group level and for individual subjects 3 .

Results
The group-wise cumulative RT distributions are shown in Figure 3 for each experimental condition. The clearly distinct group patterns supported the assumption that the risksensitivity towards perceptual uncertainty guides subjects' choice behavior. For example, Group 1 prefers to accumulate a lot of evidence in order to reduce perceptual uncertainty at decision time. Conversely, Group 2 sells very quickly on average, thus avoiding waiting costs at the expense of higher perceptual uncertainty. The group-wise goodness-of-fit and best-fitting risksensitivity parameters are shown in Table 1. The overall lower values of the fitting criterion from the risk-sensitive model showed that it explained the behavioral data better than the risk-neutral model. This was particularly evident for groups favoring rather extreme choice strategies, either accumulating a lot of evidence (Group 1) or selling very quickly (Group 2).
The subject groups who favored longer evidence accumulation showed a significantly higher fraction of sell in the good state (Figure 4, left). The corresponding risk-sensitive agents that were fitted to every group replicated the choice behavior closely (Figure 4, right).
Subject-wise fitting scores under the risk-neutral vs. the risk-sensitive models are visualized in Figure 5. The results from the individual analysis further demonstrated that the risksensitive model fitted the experimental data significantly better than the risk-neutral model.

Discussions
Our results provide evidence favoring the risk-sensitive POMDPs for modeling choice behavior compared to the riskneutral model. Risk-preferences under perceptual uncertainty are reflected in the parameters of the best-fitting risk-sensitive POMDP model. Individual risk-preferences were identified by their differential RT distributions. The RT distribution of the risk-sensitive model agents with varying parameters resembled the subjects' choice behavior, especially with respect to the policies of waiting long (accumulating evidence) or selling quickly (avoiding waiting costs). In summary, our study demonstrated that the concepts derived for risk-sensitive planning under economic uncertainty can be carried over to perceptual uncertainty at the within-trial level for describing behavioral RT. A follow-up study at the between-trial level with both behavioral and neuroimaging experiments could yield further insights into whether the neural correlates of risksensitive reinforcement learning (Niv et al., 2012;Shen et al., 2014) are also involved in the acquisition of risk-sensitive decision policies under perceptual uncertainty.