1 Introduction

Often, human decisions are stored and this builds interesting datasets. With the success of modern machine learning tools, comes the hope of exploiting them to construct useful decision helpers. To achieve this, we can use an imitation learning approach [1]. However, in this case, we will at best be as good as humans. Moreover, the performance of this approach depends heavily on the quality of the training dataset. In this work, we would like to avoid these limitations and thus, we consider another framework: reinforcement learning (RL).

RL has achieved impressive results in a number of challenging areas, including games [2, 3], robotic control [4, 5] and even health care [6, 7]. In particular, offline RL seems to be really interesting for real-world applications as it aims to train agents from a dataset of past interactions with the environment.

With the digitization of society, more and more features could be used to represent the environment. Unfortunately, classical RL algorithms are not able to work with high-dimensional states. Although we could manually choose a feature subset, this choice is not straightforward and it could have a huge impact on the performances. Therefore, it may be more practical to use RL algorithms capable of learning from high-dimensional states.

It is common to evaluate RL algorithms on deterministic (in the sense that the reward function is deterministic) environments such as the DeepMind Control suite [8]. However, in many real-world applications, environments are not deterministic but stochastic. In stochastic environments, the same action may lead to different outcomes due to randomness. This can make it difficult for the agent to distinguish between the quality of different actions and states, thereby complicating the learning process. Due to this inherent randomness, stochastic environments demand a larger number of samples to accurately estimate value functions and policies. On the other hand, offline learning has limited data available for training. The combination of these factors makes offline RL in stochastic environments particularly challenging. Furthermore, the performance of classical RL methods in such contexts remains unclear. Moreover, and as presented in Sect. 5.2, our experiments suggest that adding stochasticity to the environment decreases significantly the performance of these algorithms.

Based on these observations, we aim to develop a method which is able to train policies in offline settings with high-dimensional states while being robust to stochasticity. In this paper, we present Latent Offline Distributional Actor-Critic (LODAC), an algorithm designed to train policies in high-dimensional, stochastic environments within offline settings. The main idea is to encode the natural environment space into a smaller representation and train the agent directly in this latent space. But instead of considering the expected return, we use a risk measure. First, assuming some hypotheses, we show that minimizing this risk measure in the latent space is equivalent to minimizing the risk measure directly in the natural state. This theoretical result provides a natural framework: employ a latent variable model to encode the natural state into a latent space, and subsequently train a policy atop this latent space using a risk-sensitive RL algorithm. In the experimental part, we evaluate our algorithm on high-dimensional stochastic and deterministic datasets. In the best of our knowledge, we are the first authors to propose an algorithm to train policies in high-dimensional, stochastic and offline settings.

2 Related work

Before going further into our work, we present some related research.

As discussed and motivated in the previous section, this paper focuses on offline RL [9]. Offline RL is a specific approach of RL that aims to learn from past interactions with the environment. This framework holds promise for real-world applications as it enables the deployment of trained policies. Therefore, it is not surprising that offline RL has received a lot of attention in recent years [10,11,12,13,14,15]. One of the main problems in offline RL is that the Q-function tends to be overly optimistic in the case of out-of-distribution (OOD) states-actions pairs, where observational data is limited [16]. Several approaches have been introduced to address this issue. While some studies extend importance sampling methods [17, 18] other propose model-based techniques [19, 20] or even incorporate dataset re-weight sampling [21]. In our work, and following previous researches [22,23,24], we build a conservative estimation of the Q-function for OOD state-action pairs.

Another challenge encountered in our work is training policies from high-dimensional states. Over the past years, several algorithms have been proposed to address policy training within such contexts. For instance, several methods have been developed to directly handle image inputs [25,26,27,28]. Acquiring a meaningful representation of the observation is crucial to tackle this type of problem. Therefore, this issue has been studied by previous works [29, 30]. Some researchers employ data augmentation techniques to facilitate the learning of optimal representations [31]. However, these methods are limited to scenarios involving image states. Hence, in our work, we adhere to other common practices [32, 33] and we encode high-dimensional states into a latent space without relying on data augmentation. Furthermore, we integrate planning within this latent space, building upon studies that highlight its potential for performance enhancement [34,35,36]. Subsequently, a common approach involves training policies using classical RL algorithms, such as Soft Actor-Critic (SAC) [4], on top of this compact representation [37, 38]. Unfortunately, these classical RL algorithms struggle to train agents in stochastic environment. Therefore, in our study, we employ a risk-sensitive RL method atop this latent space.

Risk-sensitive RL is a specific approach to safe RL [39]. In safe RL, policies are trained to maximize performance while satisfying some safety constraints during training and/or in the deployment. In risk-sensitive RL, the focus is on minimizing a measure of the risk induced by the cumulative rewards, rather the maximizing the expected return. Risk-sensitive RL has gained attention in recent years [40,41,42,43]. Various risk measures may be considered, such that Exponential Utility [44] Cumulative Prospect Theory [45] or Conditional Value-at-Risk (CVaR) [46]. In this work, given the robust theoretical underpinnings and intuitive nature of Conditional Value-at-Risk [47, 48], we opt to focus on this operator. Furthermore, prior research suggests that taking into account Conditional Value-at-Risk instead of the classical expectation, can prevent the performance gap between simulations and real-world applications [49]. CVaR is really popular and has been strongly studied in the context of RL for many years [50,51,52]. For instance, [53] extend SAC in a distributional settings, enabling the training of risk-averse policies. However, this algorithm is not suited for offline RL. Other authors presented O-RAAC [54], an algorithm which is able to train risk-averse policies in an offline settings. In this paper, OOD errors are managed through an imitation learning component [55]. Finally, Ma et al. [56] presented a method for training risk-averse policies in an offline setting. In our work, we aim to extend this method to enable the training of policies in the context of high-dimensional states.

3 Preliminaries

In this section, we introduce notations and recall concepts we will use later.

Coherent risk measure Let \((\Omega , {\mathcal {F}}, {\mathbb {P}})\) a probability space and \({\mathcal {L}}^{2}:= {\mathcal {L}}^{2}(\Omega , {\mathcal {F}}, {\mathbb {P}})\). A function \({\mathcal {R}} \text {: } {\mathcal {L}}^{2} \rightarrow (-\infty , +\infty ]\) is called a coherent risk measure [57] if

  1. 1.

    \({\mathcal {R}}(C) = C \text { for all constants } C\).

  2. 2.

    \({\mathcal {R}}((1 - \lambda )X + \lambda X) \le (1 - \lambda ) {\mathcal {R}}(X) + \lambda {\mathcal {R}}(X) \text { for } \lambda \in (0, 1)\).

  3. 3.

    \({\mathcal {R}}(X) \le {\mathcal {R}}(X')\text { when } X \le X'\).

  4. 4.

    \({\mathcal {R}}(X) \le 0 \text { when } \Vert X_{k} - X \Vert _{2} \rightarrow 0 \text { with } {\mathcal {R}}(X_{k}) \le 0\).

  5. 5.

    \({\mathcal {R}}(\lambda X) = \lambda {\mathcal {R}}(X) \text { for } \lambda > 0\).

Coherent risk measures have some interesting properties. In particular, a risk measure is coherent if and only if, there exists a risk envelope \({\mathcal {U}}\) such that

$$\begin{aligned} {\mathcal {R}}(X) = \sup _{\delta \in {\mathcal {U}}} {\mathbb {E}}_{q}[\delta X] \end{aligned}$$
(1)

[48, 58, 59]. A risk envelope is a nonempty convex subset of \({\mathcal {P}}\) that is closed and where \({\mathcal {P}}:= \{ \delta \in {\mathcal {L}}^{2} \text { } | \text { } \text { } \delta \ge 0 \text {, } {\mathbb {E}}_{p}[\delta ] = 1 \}\).

The definition of a risk measure \({\mathcal {R}}\) might depend on a probability distribution p. In some cases, it may be useful to specify which distribution we are working with. Thus, we sometimes use the notation \({\mathcal {R}}_{p}\). There exists a lot of different coherent risk measures, for example, the Wang risk measure [60], the entropic Value-at-Risk [61] or Conditional Value-at-Risk [46, 62].

Conditional Value-at-Risk (CVaR\(_{\alpha }\)) with probability level \(\alpha \in (0, 1)\) is defined as

$$\begin{aligned}&\text {CVaR}_{\alpha }(X) \\&\quad :=\min _{t \in {\mathbb {R}}} \left\{ t + \frac{1}{1 - \alpha }{\mathbb {E}}_{p} \Bigl [ \max \{ 0, X-t \} \Bigr ] \right\} \end{aligned}$$
(2)

Moreover the risk envelope associated to CVaR\(_{\alpha }\), can be written as \({\mathcal {U}} = \{ \delta \in {\mathcal {P}} \text { } | \text { } {\mathbb {E}}_{p}[\delta ] = 1 \text {, } 0 \le \delta \le \frac{1}{\alpha } \}\) [59, 63]. This rigorous definition may not be intuitive, but roughly speaking, CVaR\(_{\alpha }\) is the expected value of X given the upper \(\alpha\)-tail of its conditional distribution, representing the \((1-\alpha )\) worst-case scenarios. In this paper, we use the classical definition of Conditional Value-at-Risk presented in risk measure literature. In particular, X should be interpreted as a loss function.

Offline RL We consider a Markov Decision Process (MDP), \((S, {\mathcal {A}}, p, r, \mu _{0}, \gamma )\) where S is the environment space, \({\mathcal {A}}\) the action space, r the reward distribution (\(r_{t} \sim r(\cdot | s_{t}, a_{t})\)), p the transition probability distribution (\(s_{t+1} \sim p( \cdot | s_{t}, a_{t}))\). \(\mu _{0}\) is the initial state distribution and \(\gamma \in (0, 1)\) denotes the discount factor. For the purpose of notation, we define \(p(s_{0}):= \mu _{0}(s_{0})\) and we write the MDP \((S, {\mathcal {A}}, p, r, \gamma )\).

Actions are taking following a policy \(\pi\) which depends on the environment state, (i.e., \(a_{t} \sim \pi (\cdot | s_{t})\)). A sequence on the MDP \((S, {\mathcal {A}}, p, r, \gamma )\), \(\tau = s_{0}, a_{0}, r_{0}, \ldots s_{H}\), with \(s_{i} \in S\) and \(a_{i} \in {\mathcal {A}}\) is called a trajectory. A trajectory of fixed length \(H \in {\mathbb {N}}\) is called an episode. Given a policy \(\pi\), a rollout from state-action \((s, a) \in S \times {\mathcal {A}}\) is a random sequence \(\{ (s_{0}, a_{0}, r_{0}), (s_{1}, a_{1}, r_{1}), \ldots \}\) where \(a_{0}=a\), \(s_{0}=s\), \(s_{t+1} \sim p( \cdot | s_{t}, a_{t})\), \(r_{t} \sim r(\cdot | s_{t}, a_{t})\) and \(a_{t} \sim \pi (\cdot | s_{t})\). Given a policy \(\pi\) and a fixed length H, we have a trajectory distribution given by

$$\begin{aligned} p_{\pi }(\tau ) = p(s_{0}) \prod _{t=0}^{H-1} \pi (a_{t}| s_{t})p(s_{t+1} | s_{t}, a_{t})r(r_{t}|s_{t}, a_{t}) \end{aligned}$$

The goal of classical risk-neutral RL algorithms is to find a policy that maximizes the expected discounted return \({\mathbb {E}}_{\pi }[\sum _{t=0}^{H} \gamma ^{t}r_{t}]\) where H could be infinite. Equivalently, we can look for the policy that maximizes the Q-function, which is defined as \(Q_{\pi } \text {: } S \times {\mathcal {A}} \rightarrow {\mathbb {R}}\), \(Q_{\pi }(s_{t}, a_{t}):= {\mathbb {E}}_{\pi }[\sum _{t'=t}^{H} \gamma ^{t'-t} r_{t'}]\).

Instead of this classical objective function, other choices are possible. For example, the maximum entropy RL objective function \({\mathbb {E}}_{\pi }[ \sum _{t=0}^{H} r_{t} + {\mathcal {H}}(\pi ( \cdot | s_{t}))]\), where \({\mathcal {H}}\) denotes the entropy. This function has an interesting connection with variational inference [64] and it has shown impressive results in recent years [4, 19, 65].

In offline RL, we have access to a fixed dataset \({\mathcal {D}} = \{ (s_{t}, a_{t}, r_{t}, s_{t+1}) \}\), where \(s_{t} \in S\), \(a_{t} \in {\mathcal {A}}\), \(r_{t} \sim r(s_{t}, a_{t})\) and \(s_{t+1} \sim p(\cdot | s_{t}, a_{t})\), and we aim to train policies without interaction with the environment. A such dataset comes with the empirical behavior policy \(\pi _{\beta }(a | s):= \frac{\sum _{(s_{t}, a_{t}) \in {\mathcal {D}}} \mathbbm {1}_{\{ s_{t}=s, a_{t}=a \}}}{\sum _{s_{t} \in {\mathcal {D}}} \mathbbm {1}_{\{s_{t}=s \}}}.\)

Latent variable model There are different methods for learning directly from high-dimensional states [29, 65, 66]. However, in this work, we build upon the framework presented in Stochastic Latent Actor-Critic (SLAC) [38]. The main idea of this work is to train a latent variable model to encode the natural MDP \((S, {\mathcal {A}}, p, r, \gamma )\) into a latent space \((Z, {\mathcal {A}}, q, r, \gamma )\) and to train policies directly in this space. To achieve this, the variational distribution \(q(z_{1:H}, a_{t+1:H} | s_{1:t}, a_{1:t})\) is factorized into a product of inference term \(q(z_{i+1} | z_{i}, s_{i+1}, a_{i})\), latent dynamic term \(q(z_{i+1} | z_{i}, a_{i})\) and policy term \(\pi (a_{i} | s_{1:t}, a_{1:t-1})\) as follows

$$\begin{aligned}&q(z_{1:H}, a_{t+1:H} | s_{1:t}, a_{1:t}) \\&\quad =\prod _{i=0}^{t} q(z_{i+1} | z_{i}, s_{i+1}, a_{i})\prod _{i=t+1}^{H-1}q(z_{i+1} | z_{i}, a_{i}) \\&\quad \prod _{i=t+1}^{H-1}\pi (a_{i} | s_{1:i}, a_{1:i-1}) \end{aligned}$$
(3)

Using this factorization, the evidence lower bound (ELBO) [67] and a really interesting theoretical approach [64], the following objective function for the latent variable model is derived

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{z_{1:t}, a_{t+1:H} \sim q} \biggl [ \sum _{i=0}^{t} \log D(s_{i+1} | z_{i+1}) \\&- D_{KL}\Bigl ( q(z_{i+1} | z_{i}, s_{i+1}, a_{i}) || p(s_{i+1} | s_{i}, a_{i}) \Bigr ) \biggr ] \end{aligned} \end{aligned}$$
(4)

where \(D_{KL}\) is the Kullback–Leibler divergence and D is a decoder.

Distributional RL The aim of distributional RL is to learn the distribution of the discounted cumulative rewards \(Z_{\pi }:= - \sum _{t}^{H} \gamma ^{t}r_{t}\). \(Z_{\pi }\) is a random variable. A classical approach to distributional RL is to learn \(Z_{\pi }\) implicitly using its quantile function \(F^{-1}_{Z(s, a)} \text {: } [0, 1] \rightarrow {\mathbb {R}}\), which is defined as \(F^{-1}_{Z(s, a)}(y):= \inf \{ x \in {\mathbb {R}} \text { } | \text { } y \le F_{Z(s, a)}(x) \}\) and where \(F_{Z(s, a)}\) is the cumulative density function of the random variable Z(sa). A model \(Q_{\theta }(\eta , s, a)\) is used to approximate \(F^{-1}_{Z(s, a)}(\eta )\). For \((s, a, r, s') \sim {\mathcal {D}}\), \(a' \sim \pi (\cdot | s')\), \(\eta , \eta ' \sim \text {Uniform}[0, 1]\), we define \(\delta := r + \gamma Q_{\theta '}(\eta ', s', a') - Q_{\theta }(\eta , s, a)\). \(Q_{\theta }\) is trained using the \(\tau\)-Huber quantile regression loss at threshold k [68]

$$\begin{aligned}&{\mathcal {L}}_{k}(\delta , \eta ) \\&\quad := \left\{ \begin{array}{lll} | \eta - \mathbbm {1}_{ \{ \delta< 0 \} }| (\delta ^{2} / 2k) &{} \text { if } |\delta |< k \\ | \eta - \mathbbm {1}_{ \{ \delta < 0 \} } | (|\delta | - k/2) &{} \text { otherwise.} \\ \end{array} \right. \end{aligned}$$
(5)

With this function \(F_{Z(s, a)}\), different risk measures can be computed, such that Cumulative Probability Weight (CPW) [45], Wang measure [60] or Conditional Value-at-Risk. For example, the following equation [69] is used to compute CVaR\(_{\alpha }\)

$$\begin{aligned} \text {CVaR}_{\alpha }(X) = \frac{1}{1 - \alpha } \int _{\alpha }^{1} F_{Z^{-1}(s,a)}(\tau ) \textrm{d}\tau \end{aligned}$$
(6)

It is also possible to extend distributional RL to offline settings. For example, O-RAAC [54] decomposes the actor into two different components an imitation actor and a perturbation model. CODAC [56] extends DSAC [70] in offline settings. More precisely, the \(Q_{\theta }(\eta , s, a)\) are trained using the following loss function

$$\begin{aligned} \alpha {\mathbb {E}}_{\eta \sim U} \Biggl [&{\mathbb {E}}_{ s \sim {\mathcal {D}}} \biggl [ \log \sum _{a} \exp (Q_{\theta }(\eta , s, a)) \biggr ] \\&- {\mathbb {E}}_{ (s, a) \sim {\mathcal {D}}} \left[ Q_{\theta }(\eta , s, a) \right] \Biggr ] + {\mathcal {L}}_{k}(\delta , \eta ) \end{aligned}$$
(7)

where \(U = \text {Uniform}[0, 1]\). The first term of the equation, is introduced to avoid overly optimistic estimations for OOD state-action pairs [22]. The second term (i.e. \({\mathcal {L}}_{k}(\delta , \tau ))\) is the classical objective function used to train the Q-function in a distributional settings.

4 Theoretical considerations

In this paper, we aim to train policies in high-dimensional, stochastic environment within offline settings. For a practical point of view, our idea is straightforward. Since training with high-dimensional states directly fails and following previous works [19, 38], we encode our high-dimensional states into a more compact representation using \(\phi \text {: } S \rightarrow Z\), where \(Z = \phi (S)\). Then, using \(\phi\), we build an MDP in the latent space and train policies directly on top of this space.

However, from a theoretical point of view, this general idea is not entirely clear. Is this latent MDP always well-defined? If we find a policy that minimizes a risk measure in the latent space, will it also minimize the risk measure in the natural state?

The first goal of this section is to rigorously construct an MDP in the latent space. Then, we show that, for certain coherent risk measures and under some assumptions, minimizing the risk measure in the latent space is equivalent to minimizing it in the natural space. In particular, we demonstrate that it is the case for Conditional Value-at-Risk.

4.1 Theoretical results

First, we make the following assumptions

  1. 1.

    \(\forall a \in {\mathcal {A}}\) we have \(r(\cdot | s, a) = r(\cdot | s', a)\) if \(\phi (s) = \phi (s')\).

  2. 2.

    We note \({\mathbb {P}}(\cdot | s_{t}, a_{t})\) the probability measure with probability density function \(p(\cdot | s_{t}, a_{t})\). We suppose that if \(s_{t}, s'_{t} \in S\) satisfy \(\phi (s_{t}) = \phi (s'_{t})\), then \({\mathbb {P}}(\cdot | s_{t}, a_{t}) = {\mathbb {P}}(\cdot | s_{t}', a_{t})\).

  3. 3.

    We note \({\mathbb {Q}}(\cdot | z_{t}, a_{t})\) the probability image of \({\mathbb {P}}(\cdot | s_{t}, a_{t})\) by \(\phi\) and where \(s_{t}\) is any element of \(\phi ^{-1}(z_{t})\). We suppose \({\mathbb {Q}}(\cdot | z_{t}, a_{t})\) admits a probability density functions \(q(\cdot | z_{t}, a_{t})\).

Under these assumptions, \(\phi\) induces an MDP \((Z, {\mathcal {A}}, q, r', \gamma )\) where the reward distribution \(r'\) is defined as \(r'(\cdot | z, a):= r(\cdot | s, a)\) for any \(s \in \phi ^{-1}(z)\). \(r'\) is well defined by surjectivity of \(\phi\) and by assumption (1). \({\mathbb {Q}}\) is well defined by hypothesis (2).

Obviously, for a given policy \(\pi '\) on the MDP \((Z, {\mathcal {A}}, q, r', \gamma )\) and fixed length H, we have a trajectory distribution

$$\begin{aligned} q_{\pi '}(\tau ') = q(z_{0}) \prod _{t=0}^{H-1} \pi '(a_{t}| z_{t})q(z_{t+1} | z_{t}, a_{t})r'(r_{t}| z_{t}, a_{t}) \end{aligned}$$

Given a policy \(\pi\) and a fixed length H, we denote \(\Omega\) the set of all trajectories on the MDP \((S, {\mathcal {A}}, p, r, \gamma )\), \({\mathcal {F}}\) the \(\sigma\)-algebra generated by these trajectories and \({\mathbb {P}}_{\pi }\) the probability with probability distribution \(p_{\pi }\). Following the same idea and given a policy \(\pi '\), we denote \(\Omega '\) the set of trajectories of length H on the MDP \((Z, {\mathcal {A}}, q, r')\), \(\mathcal {F'}\) the \(\sigma\)-algebra generated by these trajectories and \({\mathbb {Q}}_{\pi '}\) the probability with probability distribution \(q_{\pi '}\). \((\Omega , {\mathcal {F}}, {\mathbb {P}}_{\pi })\) and \((\Omega ', \mathcal {F'}, {\mathbb {Q}}_{\pi '})\) are probability spaces.

\(\phi \text {: } S \rightarrow Z\) induces a map

$$\begin{aligned} \Omega :&\rightarrow \Omega ' \\ (s_{0}, a_{0}, r_{0}, \ldots , s_{H})&\mapsto (\phi (s_{0}), a_{0}, r_{0}, \ldots , \phi (s_{H})) \end{aligned}$$

With a slight abuse of notation, we write it \(\phi\).

We denote \(\Pi\) the set of all policies \(\pi\) on the MDP \((S, {\mathcal {A}}, p, r)\) which satisfies \(\pi (a | s) = \pi (a | s')\) if \(\phi (s) = \phi (s')\). If \(\pi \in \Pi\), then \(\pi\) induces a policy \(\pi '\) on the MDP \((Z, {\mathcal {A}}, q, r')\), taking \(\pi '(a | z):= \pi (a | s)\) where s is any element of \(\phi ^{-1}(z)\).

For any \(\pi \in \Pi\), we denote \(\pi '\) the associated policy on the MDP \((Z, {\mathcal {A}}, q, r')\) as defined above. Furthermore, we note \(\Pi ':= \{ \pi ' \text { } | \text { } \pi \in \Pi \}\). If \(\pi ' \in \Pi '\), a policy \(\pi \in \Pi\) can be defined as \(\pi (a | s):= \pi '(a | \phi (s))\). X is a random variable on \(\Omega\) and \(X'\) a random variable on \(\Omega '\).

With all these notations, we can introduce our first result.

Lemma 1

Let \(\pi \in \Pi\). Then, \({\mathbb {Q}}_{\pi '}\) is the probability image of \({\mathbb {P}}_{\pi }\) by \(\phi\).

Proof

Let \(B':= Z_{0} \times A_{0} \times R_{0} \times \ldots Z_{H} \in {\mathcal {F}}'\). We have

$$\begin{aligned}{} & {} \begin{aligned}&{\mathbb {P}}_{\pi }(\phi ^{-1}(B')) \\&\quad = \int _{\phi ^{-1}(B')} p_{\pi }(\tau ) \textrm{d}\tau \\&\quad = \int _{\phi ^{-1}(Z_{0}) \times A_{0} \times R_{0} \times \ldots \phi ^{-1}(Z_{H})} p(s_{0})\pi (a_{0} | s_{0}) \\&r(r_{0} | a_{0}, s_{0}) p(s_{1} | a_{0}, s_{0}) \ldots p(s_{H} | a_{H-1}, s_{H-1}) \\&\textrm{d}s_{0}\textrm{d}a_{0}\textrm{d}r_{0}\textrm{d}s_{1} \ldots \textrm{d}s_{H} \\ \end{aligned} \\{} & {} \begin{aligned}&\quad = \int _{\phi ^{-1}(Z_{0}) \times A_{0} \times R_{0} \times \ldots \phi ^{-1}(Z_{H})} p(s_{0})\pi '(a_{0} | \phi (s_{0})) \\&r'(r_{0} | a_{0}, \phi (s_{0})) p(s_{1} | a_{0}, s_{0}) \ldots p(s_{H} | a_{H-1}, s_{H-1}) \\&\textrm{d}s_{0}\textrm{d}a_{0}\textrm{d}r_{0}\textrm{d}s_{1} \ldots \textrm{d}s_{H} \\ \end{aligned} \end{aligned}$$

Then, denoting \({\bar{s}}_{0}\) one arbitrary element of \(\phi ^{-1}(z_{0})\), we obtain

$$\begin{aligned}&{\mathbb {P}}_{\pi }(\phi ^{-1}(B')) \\&\quad = \int _{Z_{0} \times A_{0} \times R_{0} \times \ldots \phi ^{-1}(Z_{H})} q(z_{0})\pi '(a_{0} | z_{0})r'(r_{0} | a_{0}, z_{0}) \\&p(s_{1} | a_{0}, {\bar{s}}_{0}) \ldots p(s_{H} | a_{H-1}, s_{H-1}) \textrm{d}z_{0} \ldots \textrm{d}s_{1} \ldots \textrm{d}s_{H} \\ \end{aligned}$$

Iterating this process, we find

$$\begin{aligned}&{\mathbb {P}}_{\pi }(\phi ^{-1}(B')) \\&\quad = \int _{Z_{0} \times A_{0} \times R_{0} \times \ldots Z_{H}} q(z_{0})\pi '(a_{0} | z_{0})r'(r_{0} | a_{0}, z_{0}) \\&q(z_{1} | a_{0}, z_{0}) \ldots q(z_{H} | a_{H-1}, z_{H-1}) \textrm{d}z_{0}\textrm{d}a_{0}\textrm{d}r_{0} \ldots \textrm{d}z_{H} \\&\quad = \int _{B'} q_{\pi '}(\tau ') \textrm{d}\tau ' = {\mathbb {Q}}_{\pi '}(B') \end{aligned}$$

\(\square\)

This result is not a surprise but it has an interesting implication. Indeed, it implies that if \(X' \circ \phi = X\), then \({\mathbb {E}}_{q_{\pi '}}[X'] = {\mathbb {E}}_{p_{\pi }}[X]\). And thus, we get the following result.

Corollary 1

Suppose \(X' \circ \phi = X\). Then, if a policy \(\pi _{\star }'\) satisfies

$$\begin{aligned} \pi _{\star }' = \text {argmin}_{\pi ' \in \Pi '} {\mathbb {E}}_{q_{\pi '}}[X'] \end{aligned}$$

its associated policy \(\pi _{\star }\) satisfies

$$\begin{aligned} \pi _{\star } = \text {argmin}_{\pi \in \Pi } {\mathbb {E}}_{p_{\pi }}[X] \end{aligned}$$

For a coherent risk measure \({\mathcal {R}}\), we write \({\mathcal {U}}\) the risk envelope associated to \({\mathcal {R}}\) in \(\Omega\) and \(\mathcal {U'}\) the corresponding risk envelope in \(\Omega '\). If \(\forall \delta \in {\mathcal {U}}\), \(\exists \delta ' \in \mathcal {U'}\) such that \(\delta = \delta ' \circ \phi\) and \(\forall \delta ' \in \mathcal {U'}\) \(\exists \delta \in {\mathcal {U}}\) with \(\delta = \delta ' \circ \phi\) almost everywhere, we write \({\mathcal {U}} = \mathcal {U'} \circ \phi\).

Proposition 1

Let \({\mathcal {R}}\) be a coherent risk measure. Suppose that \(X' \circ {\phi } = X\) and \({\mathcal {U}} = \mathcal {U'} \circ \phi\). Then, if a policy \(\pi _{\star }'\) satisfies

$$\begin{aligned} \pi _{\star }' = \text {argmin}_{\pi ' \in \Pi '} {\mathcal {R}}(X') \end{aligned}$$

its associated policy \(\pi _{\star }\) verifies

$$\begin{aligned} \pi _{\star } = \text {argmin}_{\pi \in \Pi } {\mathcal {R}}(X) \end{aligned}$$

Proof

Since \({\mathcal {U}} = \mathcal {U'} \circ \phi\), we have

$$\begin{aligned} \sup _{\delta ' \in \mathcal {U'}} {\mathbb {E}}_{q}[\delta 'X'] = \sup _{\delta \in {\mathcal {U}}} {\mathbb {E}}_{p}[\delta X] \end{aligned}$$

Thus

$$\begin{aligned} {\mathcal {R}}(X'):= \sup _{\delta ' \in \mathcal {U'}} {\mathbb {E}}_{q}[\delta 'X'] = \sup _{\delta \in {\mathcal {U}}} {\mathbb {E}}_{p}[\delta X] = {\mathcal {R}}(X) \end{aligned}$$

Now, let \(\pi _{\star }':= \text {argmin}_{\pi ' \in \Pi } {\mathcal {R}}(X')\) and \(\pi _{\star }\) the associated policy. By contradiction, suppose there exists \(\pi _{1}\) with \({\mathcal {R}}_{\pi _{1}}(X) < {\mathcal {R}}_{\pi _{\star }}(X)\). But in this case and by the above observation, we would have \({\mathcal {R}}_{\pi _{1}'}(X') < {\mathcal {R}}_{\pi _{\star }'}(X')\). \(\square\)

This result provides a useful criteria to prove the optimality of a method. For example, consider \({\mathcal {R}}(X) = {\mathbb {E}}_{p_{\pi }}[X]\) which is obviously a coherent risk-measure. Its risk envelope is \({\mathcal {U}} \equiv 1\). And thus, Corollary 1 follows directly from the last proposition too.

We also have the same result in the case where \({\mathcal {R}}(X) = \text {CVaR}_{\alpha }(X)\).

Proposition 2

Suppose that \(X' \circ {\phi } = X\). Then, if a policy \(\pi _{\star }'\) satisfies

$$\begin{aligned} \pi _{\star }' = \text {argmin}_{\pi ' \in \Pi '} \text {CVaR}_{\alpha }(X') \end{aligned}$$

its associated policy \(\pi _{\star }\) verifies

$$\begin{aligned} \pi _{\star } = \text {argmin}_{\pi \in \Pi } \text {CVaR}_{\alpha }(X) \end{aligned}$$

Proof

Recall that the risk envelop of the CVaR\(_{\alpha }\), takes the form \({\mathcal {U}} = \{ \delta \text { } | \text { } 0 \le \delta \le \frac{1}{\alpha } \text { } {\mathbb {E}}_{p}[\delta ] = 1 \}\). First, we will show that for each \(\delta ' \in {\mathcal {U}}'\) there exists \(\delta \in {\mathcal {U}}\) such that \({\mathbb {E}}_{q}[\delta 'X'] = {\mathbb {E}}_{p}[\delta X]\). Then, we will show that for each \(\delta \in {\mathcal {U}}\) there exists \(\delta ' \in {\mathcal {U}}'\) such that \({\mathbb {E}}_{p}[\delta X] = {\mathbb {E}}_{q}[\delta 'X']\).

  • Let \(\delta ' \in \mathcal {U'}\). We define \(\delta := \delta ' \circ \phi\). By construction, we have \(0 \le \delta \le \frac{1}{\alpha }\) and \(\begin{aligned} {\mathbb {E}}_{p}[\delta ] = {\mathbb {E}}_{p}[\delta ' \circ \phi ] = {\mathbb {E}}_{q}[\delta '] = 1 \end{aligned}\). Thus, \(\delta \in {\mathcal {U}}\). With the same idea, remark that \({\mathbb {E}}_{q}[\delta 'X'] = {\mathbb {E}}_{p}[\delta X]\).

  • Let \(\delta \in {\mathcal {U}}\). By construction, \(\delta p_{\pi }\) is a density function. Let \(\tilde{{\mathbb {P}}}_{\pi }\) its associated probability measure. We consider \(\tilde{{\mathbb {Q}}}_{\pi '}\) the probability image of \(\tilde{{\mathbb {P}}}_{\pi }\) by \(\phi\). Then, remark that \(\tilde{{\mathbb {Q}}}_{\pi '}\) is absolutely continuous with respect to \({\mathbb {Q}}_{\pi '}\). Indeed, if A satisfies \({\mathbb {Q}}_{\pi '}(A) = 0\), we have

    $$\begin{aligned} \tilde{{\mathbb {Q}}}_{\pi '}(A)&= \tilde{{\mathbb {P}}}_{\pi }(\phi ^{-1}(A)) = \int _{\phi ^{-1}(A)} \delta (\tau )p(\tau )\textrm{d}\tau \\&\le \frac{1}{\alpha } {\mathbb {P}}_{\pi }(\phi ^{-1}(A)) = \frac{1}{\alpha } {\mathbb {Q}}_{\pi }(A) = 0 \end{aligned}$$

    Thus by Radon-Nikodym, there exists \(\delta ' \ge 0\), such that for all \(B \in {\mathcal {F}}'\)

    $$\begin{aligned} \tilde{{\mathbb {Q}}}_{\pi '}(B) = \int _{B} \delta ' \textrm{d}{\mathbb {Q}}_{\pi '} = \int _{B} \delta '(\tau ') q_{\pi '}(\tau ') \textrm{d}\tau ' \end{aligned}$$

    We will show that \(\delta ' \in {\mathcal {U}}'\). We define \(C:= \{ \tau ' \in \Omega ' \text { } | \text { } \delta '(\tau ) > \frac{1}{\alpha } \}\). By contradiction, suppose that C is not empty. On one hand have

    $$\begin{aligned} \tilde{{\mathbb {Q}}}_{\pi '}(C)&= \int _{C} \delta '(\tau ')q_{\pi }(\tau ')\textrm{d}\tau ' \\&> \frac{1}{\alpha } \int _{C} q_{\pi }(\tau ')\textrm{d}\tau ' = \frac{1}{\alpha } {\mathbb {Q}}_{\pi '}(C) \end{aligned}$$

    And on the other hand, we have

    $$\begin{aligned} \tilde{{\mathbb {Q}}}_{\pi }(C)&= \tilde{{\mathbb {P}}}_{\pi }(\phi ^{-1}(C)) \\&= \int _{\phi ^{-1}(C)} \delta (\tau ) p_{\pi }(\tau ) \textrm{d}\tau \\&\le \frac{1}{\alpha } {\mathbb {P}}_{\pi }(\phi ^{-1}(C)) = \frac{1}{\alpha } {\mathbb {Q}}_{\pi '}(C) \end{aligned}$$

    Therefore C is empty and thus, \(0 \le \delta ' \le \frac{1}{\alpha }\). Then, since \(\delta 'q_{\pi '}\) is a probability density function \(\delta ' \in {\mathcal {U}}'\).

    Finally, since, \(\delta 'q_{\pi '}\) is the probability density function of the probability image of \(\tilde{{\mathbb {P}}}_{\pi }\), we obtain \({\mathbb {E}}_{\delta p_{\pi }}[X] = {\mathbb {E}}_{\delta ' q_{\pi '}}[X']\)

\(\square\)

In particular, for

$$\begin{aligned} X(\tau ) = \sum _{t=0}^{H} -r_{t} - \beta {\mathcal {H}}(\pi (\cdot | s_{t})) \end{aligned}$$

and

$$\begin{aligned} X'(\tau '):= \sum _{t=0}^{H} - r_{t} - \beta {\mathcal {H}}(\pi '(\cdot | z_{t})) \end{aligned},$$

(with \(\beta\) potentially null) we have

$$\begin{aligned} (X' \circ {\phi })(\tau )&= \sum _{t=0}^{H} - r_{t} - \beta {\mathcal {H}}(\pi '(\cdot | \phi (s_{t}))) \\&= \sum _{t=0}^{H} - r_{t} - \beta {\mathcal {H}}(\pi (\cdot | s_{t})) = X(\tau ) \end{aligned}$$

Therefore, we get the following result.

Corollary 2

Let \(X \text {: } \Omega \rightarrow {\mathbb {R}}\) and \(X' \text {: } \Omega ' \rightarrow {\mathbb {R}}\) defined as \(X(\tau ): = \sum _{t=0}^{H} -r_{t} - \beta {\mathcal {H}}(\pi (\cdot | s_{t}))\) and \(X'(\tau '):= \sum _{t=0}^{H} - r_{t} - \beta {\mathcal {H}}(\pi '(\cdot | z_{t}))\). Then, if a policy \(\pi _{\star }'\) satisfies

$$\begin{aligned} \pi _{\star }' = \text {argmin}_{\pi ' \in \Pi '} \text {CVaR}_{\alpha }(X') \end{aligned}$$

its associated policy \(\pi _{\star }\) verifies

$$\begin{aligned} \pi _{\star } = \text {argmin}_{\pi \in \Pi } \text {CVaR}_{\alpha }(X) \end{aligned}$$

4.2 Discussion

The results presented in the last section establish a theoretical equivalence between minimizing the risk measure in the latent space and in the natural space. However, to obtain this guarantee we need to make some assumptions.

Assumptions (1), (2) ensure that the reward distribution \(r(\cdot | s_{t}, a_{t})\) and the probability measure \({\mathbb {P}}(\cdot | s_{t}, a_{t})\) are insensitive to any change in \(\phi ^{-1}(z_{t})\). Moreover \(\pi \in \Pi\) ensures that the policy distribution is stable to any change in \(\phi ^{-1}(z_{t})\). Thus, and roughly speaking, these assumptions guarantee we do not lose information by encoding \(s_{t}\) into \(\phi (s_{t})\), in terms of reward distribution, transition probability measure and the optimal policy.

These theoretical considerations highlight that in order to learn a meaningful latent representation of the natural MDP, we should consider all components of the MDP and not only focus on the environment space.

5 Latent Offline Distributional Actor-Critic

The theoretical results presented above justify our really natural idea: encode the natural environment space S into a compact representation Z and then, use a risk-sensitive offline RL algorithm to learn on top of this space. This is the general idea of LODAC.

5.1 Practical implementations of LODAC

In this section, we present the practical implementation of LODAC used in our experiments.

First, we need to implement and train a latent variational latent model. Following previous works [19, 23], our latent variable model contains the following components

$$\begin{array}{ll}\text{Image encoder: }& h_{t} = E_{\theta }(s_{t})\\ \text{Latent transition model: }& z_{t} \sim q_{\theta }(\cdot | z_{t-1}, a_{t-1}) \\ \text{Inference model: }& z_{t} \sim \phi _{\theta }(\cdot | h_{t}, z_{t-1}, a_{t-1}) \\ \text{Image decoder: }& {\hat{s}}_{t} \sim D_{\theta }( \cdot | z_{t}) \end{array}$$

The image encoder, denoted as \(E_{\theta }\), is a classical Convolutional Neural Network consisting of 4 2D convolutional layers. Each of them uses a kernel of size 4, a stride of 2 and Rectified Linear Unit (ReLU) as activation function. These layers use 32, 64, 128 and 256 filters, respectively.

The inference and the latent transition model are implemented as a Recurrent State Space Model (RSSM) [66]. Specifically, a latent state contains two components \(z_{t} = [d_{t}, x_{t}]\) where \(d_{t}\) is the deterministic part and \(x_{t}\) the stochastic. \(d_{t}\) is computed using a dense layer which takes \(x_{t-1}, a_{t-1}\) as input, uses Exponential Linear Unit as activation function and contains 256 hidden units. This layer is followed by a gated recurrent unit (GRU) cell which uses the hyperbolic tangent as activation function. The dimension of the output is 256. \(d_{t}\) corresponds to the hidden state of this GRU cell.

Fig. 1
figure 1

Architecture of our latent variational model

Then, the latent transition model employs two dense layers. The first one uses Exponential Linear Unit as activation function and yields an output of dimension 256. The second one does not have any activation function and provide an output of dimension 128. This output is subsequently split into two parts. The first part is interpreted as the mean of a normal distribution, while the second part (after applying a softplus) represents the standard deviation. The stochastic component of the latent state \(x_{t}\) is sampled from this distribution.

The inference model computes the deterministic part of the latent state \(d_{t}\) using the first layers of the latent transition model. Then, \(d_{t}\) and \(h_{t}\) are fed into two consecutive dense layers. The first one uses an Exponential Linear Unit activation function and 256 hidden units. The final layer does not have any activation function and the dimension of the output is 128. Similar to the latent transition model, the output is divided into two parts, which are interpreted as the mean and the variance of a normal distribution. \(x_{t}\) is sampled from this distribution.

Finally, the decoder \(D_{\theta }\) consists of a dense layer with 1024 units followed by four deconvolutional layers. Each of them have a stride of 2 and except the last layer which do not have any activation function. All layers have ReLU as activation function. The first two deconvolutional layers have a kernel size of 5, while the last two have a kernel size of 6. Finally, these layers contain 128, 64, 32, 32 and 3 filters, respectively. The reconstructed image is sampled from a normal distribution where the mean is determined by \(D_{\theta }\) and the standard deviation is fixed to one. The main components of our latent variable model are presented in Fig. 1.

These models have been trained using Adam optimizer, a batch size of 64, a learning rate of \(6*10^{-4}\) and the following objective function [19]

$$\begin{aligned}&{\mathbb {E}}_{q_{\theta }} \left [ \sum _{t=0}^{H-1} \log D_{\theta }(s_{t+1} | z_{t+1}) \right. \\&\quad \left.- D_{KL} \left( \phi _{\theta }(z_{t+1} | s_{t+1}, z_{t}, a_{t}) || q_{\theta }(z_{t+1} | z_{t}, a_{t}) \right) \right ] \end{aligned}$$
(8)

Rewards estimation come from a sampling from a normal distribution where the mean is determined by the output of a neural network \(r_{\theta }\) and the standard deviation is fixed to one. \(r_{\theta }\) contains one hidden dense layer with 128 units and uses Exponential Linear Unit as activation function. \(r_{\theta }\) is trained using maximum log-likelihood.

After the training of these models, the dataset \({\mathcal {D}}\) is encoded into the latent space and stored in a replay buffer \({\mathcal {B}}_{\text {latent}}\). More precisely, \({\mathcal {B}}_{\text {latent}}\) contains transitions of the form \((z_{1:H}, r_{1:H}, a_{1:H})\) where \(z_{1:H} \sim \phi _{\theta }(\cdot | s_{1:H}, a_{H-1})\) and \(s_{1:H}, r_{1:H-1}, a_{1:H-1} \sim {\mathcal {D}}\). Next, we introduce a latent buffer \({\mathcal {B}}_{\text {synthetic}}\) which contains rollouts transitions performed using the policy \(\pi _{\theta }\), the latent model \(q_{\theta }\) and the reward estimator \(r_{\theta }\).

The policy \(\pi _{\theta }\) comprises three dense layers. The first two use ReLU as activation function and 256 hidden units. The final layer does not use any activation function and yields two distinct numbers. The first number is interpreted as the mean of a normal distribution and the second one as a standard deviation. Action is sampled from this distribution.

The critic \(Q_{\theta }\) is implemented as a quantile distributional critic network [71]. Specifically, we have \(Q_{\theta }(\eta , z, a) = f_{\theta }(\psi _{\theta }(z, a) \odot \xi _{\theta }(\eta ))\) where \(\odot\) denotes the element-wise (Hadamard) product. \(\psi _{\theta }\) is a dense layer with ReLU as activation function, provides an output of dimension 256 and uses layer normalization [72]. \(\xi _{\theta }\) consists of an embedding of \(\eta\) into a space of dimension 64 followed by a dense layer. Specifically, \(\eta\) is embedded into a vector where the components take the form \(\cos (i \eta \pi )\) with i ranging from 1 to 64. After that, this embedded vector is fed to a dense layer which uses sigmoid as activation function and produces an output of dimension 256. This layer also employs layer normalization. Finally, the model \(f_{\theta }\) consists of two dense layers. The first one uses layer normalization, ReLU as activation function and 256 hidden units. The last layer does not use any activation function and produces an output of dimension one.

Actor \(\pi _{\theta }\) and critic \(Q_{\theta }\) are trained on \({\mathcal {B}}:= {\mathcal {B}}_{\text {synthetic}} \cup {\mathcal {B}}_{\text {latent}}\). To achieve this, and based on our empirical results, we follow Ma et al. [56]. Thus, the critic \(Q_{\theta }\) is iteratively chosen to minimize

$$\begin{aligned} \alpha {\mathbb {E}}_{\eta \sim U}&\Biggl [ {\mathbb {E}}_{ z \sim {\mathcal {B}}} \bigg [ \log \sum _{a} \exp (Q_{\theta }(\eta , z, a)) \bigg ] \\&- {\mathbb {E}}_{ (z, a) \sim {\mathcal {B}}} \left[ Q_{\theta }(\eta , z, a) \right] \Biggr ] + {\mathcal {L}}_{k}(\delta , \eta ') \end{aligned}$$
(9)

where \(U = \text {Uniform}[0, 1]\),

$$\begin{aligned} \delta = r_{t} + \gamma Q_{\theta }(\eta ', z_{t+1}, a_{t+1}) - Q_{\theta }(\eta , z_{t}, a_{t}) \end{aligned}$$

with \((z_{t}, a_{t}, r_{t}) \sim {\mathcal {B}}\), \(a_{t+1} \sim \pi _{\theta }(\cdot | z_{t})\) and \({\mathcal {L}}_{k}\) is the \(\tau\)-Huber quantile regression loss as defined in (5). Following Kumar et al. [22], Ma et al. [56], we add two parameters \(\zeta , \omega \in {\mathbb {R}}_{> 0}\) to the last equation

$$\begin{aligned} \text {max}_{\alpha \ge 0} \text { }&\alpha \text { } {\mathbb {E}}_{\eta \sim U} \Biggl [ \omega \Biggl ( {\mathbb {E}}_{ z \sim {\mathcal {B}}} \biggl [ \log \sum _{a} \exp (Q_{\theta }(\eta , z, a)) \biggr ] \\&- {\mathbb {E}}_{ (z, a) \sim {\mathcal {B}}} \biggl [ Q_{\theta }(\eta , z, a) \biggr ] \Biggr ) - \zeta \Biggr ] + {\mathcal {L}}_{k}(\delta , \eta ) \end{aligned}$$

\(\zeta\) is used to threshold the difference between \({\mathbb {E}}_{(z, a) \sim {\mathcal {B}}} \left[ Q_{\theta }(\eta , z, a) \right]\) and the regularizer \({\mathbb {E}}_{ z \sim {\mathcal {B}}} \left[ \log \sum _{a} \exp (Q_{\theta }(\eta , z, a)) \right]\). The parameter \(\omega\) scales this difference. Remark that if this difference (scaled to \(\omega\)) is smaller than the parameter \(\zeta\), then \(\alpha\) will be set to 0 and only the term \({\mathcal {L}}_{k}(\delta , \eta )\) will be considered.

Furthermore, since the action space \({\mathcal {A}}\) is continuous, the computation \(\log \sum _{a} \exp (Q_{\theta }(\eta , z, a)\) is intractable. To overcome this problem, and as introduced in Kumar et al. [22], we use the following approximation

$$\begin{aligned}&\log \sum _{a} \exp (Q_{\theta }(\eta , z, a)) \\&\quad \approx \log \Biggl ( \frac{1}{2M} \sum _{a_{i} \sim U({\mathcal {A}})}^{M} \left[ \frac{\exp (Q_{\theta }(\eta , z, a))}{U({\mathcal {A}})} \right] \\ &\qquad + \frac{1}{2M} \sum _{a_{i} \sim \pi (\cdot |z)}^{M} \left[ \frac{\exp (Q_{\theta }(\eta , z, a)) }{\pi (a_{i}|z)} \right] \Biggr ) \end{aligned}$$
(10)

where \(U({\mathcal {A}}) = \text {Unif}({\mathcal {A}})\) and we choose \(M=10\).

The actor is trained to minimize Conditional Value-at-Risk of the negative cumulative reward, which can be computed using the formula

$$\begin{aligned} \text {CVaR}_{\alpha }(Z_{\pi }) = \frac{1}{1 - \alpha } \int _{\alpha }^{1} Q_{\theta }(\eta , z, a) \textrm{d}\eta \end{aligned}$$
(11)

\(\pi _{\theta }\) and \(Q_{\theta }\) have been trained using Adam optimizer with a batch size of 256. Following previous works [19, 23], batches of equal data mixed from \({\mathcal {B}}_{\text {latent}}\) and \({\mathcal {B}}_{\text {synthetic}}\) are used. While the critic has been trained using a learning rate of \(3*10^{-4}\), we used a value of \(3*10^{-5}\) for the actor. Finally, it is worth mentioning we used the clipped double Q-learning trick [73] for training \(Q_{\theta }\). A summary of LODAC can be found in Algorithm 1.

Remark that for training a policy using the approach presented above, we need to be able to compute an expectation of a uniform distribution (Eq. 9), approximate \(\log \sum _{a} \exp (Q_{\theta }(\eta , z, a))\) using 10 and estimate Conditional Value-at-Risk using the formula 11 to compute the loss of the policy. Unfortunately, all these computations take time and are unavoidable. Thus, it is not surprising that optimizing a policy using this approach takes more time than classical methods.

figure a

5.2 Experimental setup

In this section, we evaluate the performance of LODAC.

First, our method is compared with LOMPO [19], an offline high-dimensional risk-free algorithm. Then, we build a version of LODAC where the actor and the critic are trained using O-RAAC [54]. We denote it LODAC-O. Moreover, it is also possible to use a risk-free offline RL in the latent space. Thus, a risk-free policy is also trained in the latent space using COMBO [23].

These algorithms are evaluated on the standard walker walk task from the DeepMind Control suite [8], but here we learn directly from the pixels of dimensions \(3\times 64 \times 64\). As a standard practice, each action is repeated two times on the ground environment and episodes of length 1000 are used. These algorithms are tested on three different datasets: expert, medium and expert-replay. Each dataset consists of 100K transitions steps. More precisely, the following datasets are built.

  • Expert For the expert dataset, actions are chosen according to an expert policy that has been trained online, in a risk-free environment using SAC for 500K training steps. The states used for this training are the classical states provided by DeepMind Control suite.

  • Medium In this dataset, actions are chosen according to a policy that has been trained using the same method as above. However, here, the training was stopped when the policy achieves about half the performance of the expert policy.

  • Expert_replay The expert_replay dataset consists of episode that were sampled from the expert policy during the training.

The setup presented above allows to test our approach on deterministic environments.

However, we would like to evaluate our algorithm on stochastic environments too. To achieve this, we use the same datasets, but we modify the reward using the following formula

$$\begin{aligned} r_{t} \sim \biggl ( r(s_{t}, a_{t}) - \lambda \mathbbm {1}_{\{r(s_{t}, a_{t})> {\overline{r}} \}} {\mathcal {B}}_{p_{0}} \biggr ) \end{aligned}$$

where r is the classical reward function. \({\overline{r}}\), \(\lambda\) are hyperparameters and \({\mathcal {B}}_{p_{0}}\) is a Bernoulli distribution of parameter \(p_{0}\). In our experiments, we choose \(\lambda =8\) and \(p_{0}=0.1\). Different values of \({\overline{r}}\) are used for each dataset such that about the half of the states verify \(r(s_{t}, a_{t}) > {\overline{r}}\).

Hyperparameters of each model have been manually tuned based on our experiments and hyperparameters used in previous papers. However, due to the substantial time required to train these models (specifically LODAC and LODAC-O), we were unable to test a wide range of hyperparameter combinations.

Algorithms have been tested using the following procedure. First of all, to avoid excessive computation time and for a more accurate comparison, we use the same latent variable model for all algorithms.

We evaluate each algorithm using 100 episodes, reporting the mean and CVaR\(_{\alpha }\) of the returns. LODAC and LODAC-O are trained to minimize CVaR\(_{0.7}\). LOMPO and COMBO are trained to maximize the return. We run 4 different random seeds. As introduced in Fu et al. [74], we use the normalized score to compare our algorithms. Specifically, a score of 0 corresponds to a fully random policy and a score of 100 corresponds to an expert policy on the deterministic task. However, and as suggested in Agarwal et al. [75], instead of taking the mean of the results, we consider the interquartile means (IQM).

5.3 Results discussion

In this section, we discuss the results of our experiments, which are presented in Table 1 for the stochastic environment and in Table 2 for the deterministic environment. We bold the highest score across all methods. Complete results of all different runs can be found in Appendix A.

Stochastic environment The first general observation is that LODAC and LODAC-O generally outperform risk-free algorithms. The only exception is with the expert_replay dataset where LODAC-O provides worse results. This is not really surprising since actors trained with O-RAAC contains an imitation component, and obviously, the imitation agent provides really poor performance on this dataset. LODAC provides really interesting results as it significantly outperforms risk-free algorithm in terms of \(\text {CVaR}_{0.7}\) and return on all datasets. Moreover, it provides better results than LODAC-O on the medium and the expert_replay dataset, while achieving comparable result on the expert dataset. A final observation is that risk-free RL policies provide generally bad performances on this stochastic environment.

Deterministic environment First, a drop of performance between the deterministic and the stochastic environment can be noticed. This is not surprising, as stochastic environments are generally more challenging than deterministic ones. However, this change appears to affect risk-free algorithms more significantly than LODAC-O or LODAC. Indeed, we get a difference of more than \(27\%\) with LOMPO and COMBO on the medium and expert_replay dataset in terms of return between the deterministic and the stochastic environment. This difference even reached \(42\%\) for COMBO on the expert_replay dataset. For LODAC-O and for the same tasks, we obtain a deterioration of less than \(17\%\). In contrast to risk-free methods, and in a lesser degree LODAC-O, adding stochasticity to the dataset does not seem to have a significant impact on the performance of LODAC. Indeed, with this algorithm we observe a drop in performance of less than \(9 \%\). For the medium dataset, a difference of only \(5.23\%\) can even be noticed.

Table 1 Performances for the stochastic offline high-dimensional walker_walk task on expert, medium and expert_replay datasets
Table 2 Performances for the deterministic offline high-dimensional walker_walk task on expert, medium and expert_replay datasets

6 Conclusion

While offline RL appears to be an interesting paradigm for real-world applications, many of these real-world applications are high-dimensional and stochastic. However, current high-dimensional offline RL algorithms are trained and tested in deterministic environments. Our empirical results suggest that adding stochasticity to the training dataset significantly decreases the performance of high-dimensional risk-free offline RL algorithms.

Based on this observation, we develop LODAC. LODAC can be used to minimize various risk measures like Conditional Value-at-Risk. Our theoretical considerations in Sect. 4 show that our algorithm relies on a strong theoretical foundation. Finally, the use of LODAC to minimize CVaR empirically outperforms previous algorithms in term of CVaR and return on stochastic high-dimensional environments.