Deep learning for CVA computations of large portfolios of financial derivatives

In this paper, we propose a neural network-based method for CVA computations of a portfolio of derivatives. In particular, we focus on portfolios consisting of a combination of derivatives, with and without true optionality, \textit{e.g.,} a portfolio of a mix of European- and Bermudan-type derivatives. CVA is computed, with and without netting, for different levels of WWR and for different levels of credit quality of the counterparty. We show that the CVA is overestimated with up to 25\% by using the standard procedure of not adjusting the exercise strategy for the default-risk of the counterparty. For the Expected Shortfall of the CVA dynamics, the overestimation was found to be more than 100\% in some non-extreme cases.


Introduction
In this paper, we consider a set of financial contracts, which we refer to as the portfolio of derivatives, or just the portfolio, written between two parties. The first party is referred to as the bank and is considered to be default-free. The second party, which may default, is referred to as the counterparty. We take the perspective of the default-free bank in order to investigate some of the risks associated with a defaultable counterparty. It is straight-forward to extend the methodologies used in this paper to a defaultable bank as well as to multiple counterparties.

Risk-free valuation
We consider the problem of finding the value of a portfolio of derivatives with early-exercise features. In particular, we are focusing on portfolios with multiple derivatives with true optionality, e.g., American or Bermudan derivatives. We construct a portfolio of J derivatives, where the individual derivatives depend on d 1 , d 2 , . . . , d J risk factors. This means that we could face high-dimensionality in two ways: 1. Derivative j could depend on a large number of risk factors, i.e., d j could be large; 2. We could have many derivatives in the portfolio, i.e., J could be large.
In [1], a neural network-based method for valuation of a single Bermudan derivative was proposed and proved to be highly accurate for derivatives with up to 100 risk factors. Later, the algorithm was extended in [2] to also include pathwise valuations of the derivative (in contrast to only finding the value at the initial time). In this paper, we extend [1] and [2] to the portfolio case, i.e., finding the value of a large portfolio of, possibly high-dimensional, derivatives with true optionality, without having to compute the value of each individual derivative.
In a traditional setting, the so-called continuation value is computed, and subsequently, the value of the derivative is given by the maximum of the continuation value and the immediate pay-off. For a single derivative, this is straight-forward. For instance, the continuation value can be computed by solving an associated PDE, which is done in e.g., [3], [4], [5], [6] and [7], or the continuation value can be approximated by a Fourier transform methodology, which is done in e.g., [8], [9] and [10]. Furthermore, classical tree-based methods such as [11], [12] and [13], can be used. These types of methods are, in general, highly accurate but they suffer severely from the curse of dimensionality, meaning that they are computationally feasible only in low dimensions (say up to 4 risk factors), see [14]. In higher dimensions, Monte-Carlo-based methods are often used, see e.g., [15], [16], [17], [18] and [19]. Monte-Carlo-based methods can generate highly accurate derivative values at the initial time, but often less accurate values between the initial time and maturity of the contract.
In contrast to the single derivative case, it is not enough to know the continuation value of a portfolio (with more than one derivative) in order to decide optimally which derivatives should be exercised. Therefore, it is common to do the valuation at the level of each derivative, and then add the individual values of each derivative to obtain the portfolio value. This becomes cumbersome for large portfolios. As mentioned above, the methodology used in this paper, generalizes [1] and [2], in which the optimal exercise policy is approximated by maximizing expected discounted cash-flows, i.e., the continuation value is not computed. By not relying on computations of the continuation value, the algorithm is able to compute the portfolio value without having to compute the individual values for each derivative.

Risky valuation and CVA
The Credit Valuation Adjustment (CVA) is the difference between the risk-free portfolio value and the risky portfolio value, where the risky portfolio value is defined as the portfolio value when taking default risk of the counterparty into account. While there is no ambiguity of the risk-free portfolio value, it is not completely clear how the risky portfolio value should be computed. The question is whether the exercise policy should be adjusted for the fact that the counterparty may default. For instance, if the counterparty ends up in financial distress, it is reasonable to assume that the bank (which in this paper is assumed to be the risk-free party) would be more willing to exercise the callable derivatives, in order to lower its exposure to the counterparty. Even though it seems common to ignore the effect of a defaultable counterparty when computing risky derivative values, it has been discussed in the literature, see e.g., [20], [21], [22] and [23]. In the case of a single derivative [20] states that the exercise region for a risk-free derivative is always a subset of the exercise region for a risky counterpart. However, in the case of a portfolio, the situation is more complex, and depends on contractual details such as the close-out and netting agreements. One consequence is that, in the presence of a netting agreement, the exercise decisions can no longer be made individually. To explain this, we give a simple example.
Example 1.1. Assume that we have a portfolio, consisting of three derivatives, one European future and two American options. All contracts are initialized at time 0, mature at time T and depend on the same risk-factor (X t ) t∈[0,T ] . Assume that, at time t ∈ (0, T ), and given X t = x, the intrinsic values are V future (t, x) = −10, V Am 1 (t, x) = 10, V Am 2 (t, x) = 10, and the immediate pay-off for the American options satisfy g Am 1 (t, x) < 10, g Am 2 (t, x) < 10.
In a risk free environment (no-defaultable counterparty), it is sub-optimal to exercise the American options. However, in case of a defaultable counterparty, the situation is less trivial. In Table 1, the exposure to the counterparty, given different exercise decisions at t, is given with and without a netting agreement. If the counterparty is in severe financial distress, then it is likely optimal for the bank to Without netting With netting Exposure -no exercise 20 10 Exposure -exercise one of the American options 10 0 Exposure -exercise both American options 0 0 exercise both American options in the case of no netting agreement, and one of them in the case of a netting agreement. From this simple example, two things become clear; 1) The exercise decisions for the American options are affected not only by a risky counterparty, but also by whether or not a netting agreement exists. 2) in the presence of netting, exercise decisions cannot be made for one derivative in isolation, but only for all the American options simultaneously.
In general, for a risky portfolio, it is not possible to describe the value of a single derivative, but only the value of the entire portfolio. This is an interesting problem since almost all existing algorithms rely on exercise decisions made in isolation and risky derivative values that can be added up to obtain the risky portfolio value.
If this is not taken into account we would obtain a biased low valuation for the risky portfolio by using a sub-optimal exercise strategy. Since the CVA is the difference between the risk-free and risky portfolio values, we would obtain an overestimation of the CVA. Furthermore, this effect is likely to increase with decreasing credit quality of the counterparty. In practice, this means that the counterparty is paying a CVA which is based on a sub-optimal exercise strategy used by the bank, which is out of control for the counterparty. Even more problematic is that the overestimation of the CVA is higher for counterparties that already are under financial distress.
One could argue that it is reasonable for the bank to charge the counterparty the higher CVA, since the bank will probably not follow the theoretically optimal risky exercise strategy. However, there is another level of complexity not yet discussed. When the mark-to-market (MtM) CVA moves in time against the bank, the bank could face losses, not because the counterparty actually defaults, but because disadvantageous changes in the MtM CVA. For instance, in Basel III [24] the following is stated: "Under Basel II, the risk of counterparty default and credit mitigation risk were addressed but mark-tomarket losses due to credit valuation adjustments (CVA) were not. During the global financial crisis, however, roughly two-thirds of losses attributed to counterparty credit risk were due to CVA losses and only about one-third were due to actual defaults." This is further discussed in [25], in which the authors also recommend computations of different risk measures for the future distribution of CVA. Two examples of such measures are the Value at Risk of the CVA (VaR-CVA) and the Expected Shortfall of the CVA (ES-CVA). The advantage of the ES-CVA is that it is a coherent risk-measure, and we therefore focus on ES-CVA in this paper.

Structure of the paper
In Section 2 the mathematical problem formulation is given. We define the risk-free and risky portfolios, close-out agreements both with and without netting agreements and the associate CVA. Furthermore, the problems are formulated in terms of so-called decision functions, which control the exercise strategies. In Section 3, the algorithms are presented. In the first part, the algorithm for learning optimal exercise strategies is given and in the second part, an algorithm for learning pathwise entities such as the pathwise portfolio exposure is presented. Finally, in Section 4 numerical experiments are presented. The experiments include a first part, in which risk-free values are computed and compared to a well-established regression based method. In the second part we compare CVA computed with the risk-free and the risky exercise strategy to verify that, indeed, the CVA is often overestimated with algorithms in use today. We present comparisons with and without netting, for different levels of Wrong Way Risk (WWR), and for different credit quality of the counterparty. As a final example, we analyse the effect of the different exercise strategies on ES-CVA. In the Appendix, we provide some additional details on the algorithms and the specific choice of neural networks.

Problem formulation
Let (Ω, F, Q) be a probability space completed with the Q−null-sets of F. For T ∈ (0, ∞), and 1 d ∈ N, let X : [0, T ] × Ω → R d and r : [0, T ] × Ω → R represent the (market) risk-factors of the portfolio and the short rate, respectively. Furthermore, we denote by τ D the default event of the counterparty, which is a stopping time defined on (Ω, F, Q) and we let 1 D : [0, T ] × Ω → {0, 1}, be the jump-to-default process given by The information structure is given by the sub−σ−algebras generated by X, r and 1 D , i.e., and we define the enlarged filtrations In this paper, we use either a constant short rate (risk-free rate), or we view the short rate as one of the risk factors. In the latter case, we model the short rate as one of the d component processes of X, which implies that, H t = H X t . The motivation for introducing a separate notation for the short rate is to simplify the notation when the short rate is used to discount cash-flows. For commonly used conditional expectations, we introduce the short-hand We use a numéraire, which, for t ∈ [0, T ], is defined by B t := exp t 0 r s ds , which should be interpreted as the value at time t of a savings-account, which was worth 1 at time 0. For t, u ∈ [0, T ] with t ≤ u, we use D t,u := Bt Bu to discount a cash-flow obtained at time u back to time t. The measure Q is the risk-free measure, under which all tradeable assets are martingales relative to the numéraire, e.g., if component i ∈ {1, 2, . . . , d} of X is tradeable, then (Xt) i Bt is a Q−martingale. If not specifically stated otherwise, equalities and inequalities of random variables should be interpreted in a Q−almost sure sense.

A portfolio of derivatives
We assume a portfolio of J ∈ N derivatives. For t ∈ [0, T ], and for derivative j ∈ {1, 2, . . . , J}, we denote the set of exercise dates greater than or equal to t by T j (t) ⊆ [0, T ], and set T(t) = {T 1 (t), T 2 (t), . . . , T J (t)}. Note that for a European-type contract, the only exercise date is at the maturity, for a Bermudan-type contract there are multiple exercise dates, and for an American-type contract, there are infinitely many exercise dates. We emphasize that the exercise dates are simply subsets of the time interval [0, T ], and provide no information on which exercise policy to follow, except in some trivial cases e.g., when there is only one exercise date.
Since we want to be able to treat derivatives with early-exercise features, we need to introduce a framework for stopping times. For j ∈ {1, 2, . . . , J}, an X−stopping time with respect to T j (0), is a random variable, τ j , defined on (Ω, F, Q), taking on values in T j (0), such that for all s ∈ T j (t), it holds that the event {τ j = s} ∈ H s . Furthermore, we define an X t,x −stopping time as an X−stopping time, conditional on X t = x, and τ ≥ t.
For each derivative, j ∈ {1, 2, . . . , J} we use individual pay-off functions, g j : Since we are treating portfolios where the individual derivatives may have different maturities, we set each pay-off function to zero for all times larger than its maturity, i.e., for j ∈ {1, 2, . . . , J}, x ∈ R d and for t > max{T j (0)}, we set g j (t, x) ≡ 0, where max{T j (0)} represents the largest element belonging to the set T j (0).

Risk-free and risky portfolio valuation without netting
The value of a derivative (not) taking default risk of the counterparty into account is referred to as the risky (risk-free) value. We define the risk-free and the risky values of derivative j ∈ {1, 2, . . . , J}, where T j (t) is the set of all X−stopping times taking on values in T j (t) and for x ∈ R, (x) + = max{0, x} and (x) − = min{0, x}. In the above, we assume a close-out agreement which uses the risk-free derivative values as reference valuation. At default of the counterparty, the bank receives only a fraction, R ∈ [0, 1), referred to as the recovery-rate, of the positive part of each derivative. On the other hand, each derivative with a negative risk-free value at default needs to be added entirely to the portfolio. Note that for the risky value we need additional information of prior defaults of the counterparty, which is captured in the realization, ν ∈ {0, 1}, of the jump-to-default process, i.e., ν = 1 if no default has occurred prior to, or at t, and ν = 0 otherwise. The notation above trivially holds for European-type derivatives since the only exercise date is at maturity of the contract. Furthermore, a barrier-type feature could be added by also including a spatial dimension to T j (0). The value of a portfolio, consisting of J derivatives, at market state (t, X t = x) and default state 1 D Using (3) and (4), the above can be written as Since the aim is to approximate the optimal exercise policy with neural networks, we wish to reformulate the problem into an optimization problem, in which the target function can be represented by a neural network. Following [1] and [2] we use so-called decision functions, to determine for each derivative and given a market state, whether or not to exercise the derivative. For j ∈ {1, 2, . . . , J}, decision function j denoted by f j , is of the form f j : . In order to guarantee that an exercise decision can only occur at an exercise date, we require for s / ∈ T j (0), that f j (s, ·) ≡ 0.
We now restrict our attention to the case when there, for each derivative, is a finite number of exercise dates, i.e., for j ∈ {1, 2, . . . , J}, it holds that |T j (0)| ∈ N. From a theoretical perspective, this excludes American-type derivatives, but from a practical perspective, an infinite number of exercise dates is often approximated by a large, but finite, number of exercise dates. This implies that we can still consider American-type derivatives by increasing the number of exercise dates until the derivative value converges (until the value does not increase with additional exercise dates). We denote by T Π (t) the set of dates which represent an exercise date for at least one of the J derivatives. Mathematically, we define the exercise dates of the portfolio as and the number of unique exercise dates in the portfolio is given by N = |T Π (0)|. We assume that the initial time T 0 = 0 is not an exercise date for any of the derivatives.
To simplify, we use the following notation for the N exercise dates, the risk-factors evaluated at the N exercise dates, and the discounting between exercise dates X k := X T k , and D k, := D T k ,T for k, = 1, 2, . . . , N.
Furthermore, for t ∈ [0, T ], the risk-factor process on [t, T ], conditional on X t = x, is denoted by T ] , where we also note that X 0,x 0 = X. The above notation allows us to express an X t,x −stopping time in terms of decision functions 2 The notation above is used to emphasize that, the decision function f j , controls the exercise strategy, given the stochastic process X t,x . Moreover, X t,x is not just a random value at a specific time, but the entire process, starting at X t = x and until stopping occurs. In later sections the valuation of a derivative or a portfolio is formulated as an optimization problem, which is optimized by varying f j . Although the notation is practical when optimization is discussed, it is cumbersome to use when we define the value of a derivative. We therefore use the following short-hand notation and keep in mind, that the strategy is controlled by a decision function, f j , and for u ≥ t, the event I {τ n,j ≤u} is σ(X t,x )−measurable. We can now define the value of the risk-free and risky derivatives, given an exercise strategy expressed in terms of decision functions. For derivative j ∈ {1, 2, . . . , J}, market state (t, x) ∈ (T n−1 , T n ] × R d , we define the parametrized valuation functions Similarly, we define the portfolio values with respect to the exercise strategy given by f , as the parametrized functions where the value of the risky portfolio also depends on the default state of the counterparty, 1 D t = ν ∈ {0, 1}. We now want to find decision functions such that, when inserted in (12) and (13), we obtain (3) and (4). With this in mind, we define for j ∈ {1, 2, . . . , J}, at t ∈ [0, T ], the (optimal) exercise regions, E Z j (t), in which it is optimal to exercise, and the (optimal) continuation regions, C Z j (t), in which it is optimal to hold on, by The above states that the derivative should be exercised if its value equals the immediate exercise value, and we are at an exercise date and the derivative should not be exercised if its value is greater than the immediate exercise value or if we are not at an exercise date.
Furthermore, we denote by f Z , the vector consisting of the individual decision functions For market states (t, x) ∈ [0, T ] × R d and for a derivative j ∈ {1, 2, . . . , J}, it holds that The validity of the above is a direct consequence of Proposition 4 in [1]. In turn, this implies that by inserting the optimal decision functions in the functionals in equations (14), we obtain the risky and risk-free portfolio values, i.e., In subsequent sections the optimal decision function f Z , for Z ∈ {V, U }, is approximated with a series of neural networks. The reason for using the rather complicated notation, (10), is that this structure allows us to view the valuation of the derivatives as an optimization problem over the set of decision functions, which we approximate on some finite-dimensional function space. One example of such function space is the functions generated by a series of neural networks with a fixed number of parameters. When we use a specific strategy, e.g., f V or f U , this is specified by adding a superscript referring to the particular strategy. For Z ∈ {V, U }, we define the short-hand notation where it is assumed that t ∈ (T n−1 , T n ].

Risky portfolio valuation with netting
When considering the risky portfolio value with netting, the problem becomes nonlinear in the sense that the risky portfolio value is no longer the sum of the individual risky derivative values. In fact, there no longer exists "a risky value for a single derivative", since the valuation needs to be carried out on a portfolio level. Before we define the risky value of a netted portfolio, we need to define the if derivative j has been exercised prior to t, 1, else.
Similar to (9), we use the short-hand notation for A at initial date, T 0 , the exercise dates, T 1 , . . . , T N , The process A is H t −measurable but it is not enough to know X t = x in order to determine A t . The reason for defining A is that the exercise decisions for the netted risky portfolio are defined by This means that, at each exercise date, in addition to the current market state, we need to know which derivatives in the portfolio have been exercised prior to the current time, in order to make optimal exercise decisions. We denote, by T (t), the space of (X t,x , A t,α )−stopping times vectors, taking on , we denote by τ n j element j of a stopping times vector τ n ∈ T (t). The netted portfolio value with a risky counterparty, given market state X t = x, default state of the counterparty 1 D t = ν and portfolio state (exercise state of the derivatives in the portfolio) A t = α, is given by The optimal stopping strategies of the individual derivatives in the portfolio are no longer independent of each other, as in (5) and (6). Furthermore, the optimal strategy depends on earlier exercise decisions, meaning that in order to make the exercise decisions Markovian, we need to include information about earlier decisions. The reason for this is the non-linearity in the two sums inside the expectation in (19). Therefore, the optimal stopping strategies need to be computed for the entire portfolio simultaneously. To the best of our knowledge, this has not been done in an ordinary least squares setting before. However, it is discussed in a PDE framwork in the case of a portfolio of American swaptions in [23].
Remark 2.2. The value of the risky portfolio with netting, depends on the J risk-free derivative values and we are therefore required to approximate the risk-free derivative values. The reason for this is that, at default, the risk-free value of the portfolio is used as reference value in the close-out agreement (see Equation (19)). If we restrict our portfolio to derivatives with positive pay-off functions, then V j can be replaced by g j (by the definition of V j and the law of iterated expectations). Furthermore, in the restricted portfolio, the values with and without netting coincide.
An (X t,x , A t,α )−stopping times vector can be defined by where is element-wise multiplication and 1 J is the J−dimensional vector with only ones, (1, 1, . . . , 1) T . We denote element j of the stopping times vector by τ n We emphasize that each element of the stopping time vector depends on f and not only an element j which is the case without netting. Similar to (11), we introduce a short-hand notation, which simplifies the valuation functionτ We here use "τ ", instead of "τ " as in (11), to emphasize that the stopping time also takes A t,α as an argument. The netted risky portfolio value, given the exercise strategy obtained by decision function f , is then given by where we, again, remind ourselves that the exercise strategy is controlled by f , and for u ≥ t, the event For a netted portfolio, the optimal exercise regions, described in Section 2.2, are less trivial. Firstly, they become dependent on the state of earlier exercise decisions, A t = α t ∈ {0, 1} J . Secondly, the exercise region for derivative j ∈ {1, 2, . . . , J} is expressed under the condition that an optimal exercise strategy for the other J − 1 derivatives is applied. Therefore, we only describe the optimal decision function as belonging to the supremum over the space, D, of all measurable functions, f : where ν 0 = 1 (no default prior to or at t = 0) and a 0 = (1, 1, . . . , 1) T (no derivatives have been exercised prior to t = 0). We then assume that, given the state (t, Similar to (17), when we want to emphasize the particular choice of decision function, f A , we use the short-hand notationτ where it is assumed that t ∈ (T n−1 , T n ].

Credit valuation adjustment of a derivative portfolio
The formal definition of CVA is the difference between the risk-free and the risky portfolio value. Given models of the underlying market and default events of our counterparty, the above definition of CVA is straight-forward for a portfolio consisting of derivatives without optionality e.g., European options, barrier options etc. When it comes to portfolios consisting of derivatives with true optionality, e.g., the Bermudan options, American options etc. the standard procedure is not clear. In this section, we define the CVA for portfolios of derivatives with true optionality as well as some approximations, which simplify the computations. In the definitions of CVA, we use the portfolio valuations in terms of optimally chosen decision functions given in equations (14) and (24). The CVA at (t = 0, X 0 = x 0 ), with and without netting, respectively, are given by A commonly used approximation is to apply the same exercise strategy to the risk-free and risky portfolios. One such approximation is defined as The only difference between (26)- (27) and (28)- (29) is that in the latter the risk-free strategy is used also for the risky portfolios. One could also think of other definitions, e.g., using the risky strategies for both portfolios. This particular choice is motivated by the fact that f U and f A are, in general, dependent on f V through the close-out agreements in (13) and (22). Moreover, f V is a sub-optimal strategy for both risky portfolios leading to CVA ≤ CVA, and CVA Net ≤ CVA Net , which is beneficial for the bank (but certainly not for the counterparty). If we instead use only the risky descision functions, i.e., replacing f V with f U in (28) and f A in (29), we would obtain an underestimation of the CVAs, which would be unacceptable for the bank.
As mentioned in the Introduction, the bank is exposed to the risk of CVA losses, as a consequence of the MtM value of the CVA moving against the bank. We therefore want to follow the evolution of the CVA over time, to gain insights in its distribution. Of particular interest is the tail distribution of the CVA, for times between initial time and the maturity of the portfolio. To explore this, we define the dynamic versions of (26)- (29), which are stochastic processes depending on the market and portfolio state processes X and A. For t ∈ [0, T ], the dynamic versions of the CVAs (and their approximations) are given by the following random variables In the above, the CVA is conditional on that the counterparty has not defaulted prior to, or at, t (it does not make sense to calculate the CVA if the counterparty has already defaulted). From the above we can define the Expected value of the CVA (E-CVA), and for α ∈ (0, 1), the α−level of Value at Risk of the CVA (VaR-CVA) and Expected Shortfall of the CVA (ES-CVA), In a similar way E-CVA(t), ES-CVA α (t), E-CVA net (t), ES-CVA net α (t), E-CVA net (t) and ES-CVA net α (t) are defined. The expression for the ES-CVA looks complicated but is basically just the expected value of the α−tail of the CVA distribution. We focus on ES-CVA instead of VaR-CVA because it is a coherent risk measure and VaR-CVA is not. Remark 2.3. Since ES-CVA is a non-traded risk measure, it should ideally be computed under the real world measure P, see e.g., [25] for a detailed discussion. To be precise, (X t , A t ) should be generated under the P−measure and, the CVA, which is a tradeable asset, should be computed under the Q−measure. It is straight-forward to adjust the algorithms in this paper be able to compute ES-CVA under the P−measure, see [2] for details in the special case J = 1.

Exposure profiles
In this subsection we discuss the concept of exposure profiles for a portfolio of derivatives. The financial exposure (of the bank) is defined as the maximum amount the bank stands to loose if the counterparty defaults. The exposure profile is loosely defined as the distribution of the exposure over time. The exposures, with and without netting, are defined as where we recall that (A t ) j = I {τ j >t} with τ j being the exercise date for derivative j. Furthermore, for a portfolio without netting, the expected exposure (EE), and for α ∈ (0, 1), the potential future exposure (PFE) are defined as Both the expectation and the probability in (33) and (34) should be interpreted as conditional on X 0 = x 0 ∈ R d . The EE and PFE in the presence of netting, denoted by EE Net (·) and PFE Net α (·), and are obtained by instead using the netted exposure in (33) and (34).
If we assume a constant recovery rate R ∈ [0, 1), and that X and 1 D are independent, i.e., the default event of the counterparty is independent of the risk factors, then (28) and (29) can be written as which can be approximated as for some partition of [0, T ], with t 0 = 0 and t M = T . The above formulations require access to the density of default events, but may be more accurate, especially for large M and a small probability of default (with a simulation based approach, problems with a low probability of default can often be tackled with variance reduction techniques).

Algorithms
In the first part of this section, we present a neural network-based method to approximate the decision functions introduced in the previous section. The method generalizes the Deep Optimal Stopping proposed in [1] and extended in [2], which approximates stopping decisions for a single derivative, to be applicable also for portfolios of derivatives with early-exercise features. Furthermore, for the risky portfolios, the algorithm is extended to be able to deal with default risk of the counterparty. The algorithm is based on a series of neural networks, which are optimized backwards in time with the objective to maximize the expected discounted cash-flows.
In the second part of this section, the exercise policy obtained from the approximate decision functions is applied pathwise on realizations of the risk factors of each derivative in the portfolio to generate pathwise cash-flows. These cash-flows are used in a neural network based regression algorithm to approximate pathwise derivative values. These pathwise derivative values can then be used to compute important risk management measures.

Phase I: Learning exercise strategy
As indicated above, the core of the algorithm is to approximate decision functions, in order to obtain good approximations of the value of a portfolio of derivatives. We approximate the decision functions f V , f U and f A , with fully connected neural networks. To be more precise, let N = |T Π (0)|, for n ∈ {1, 2, . . . , N } and for Z ∈ {V, U }, the decision function f Z (T n , ·), is approximated by a fully connected neural network of the form f θn : R d → {0, 1} J , where θ n ∈ R qn is a vector containing all the q n ∈ N trainable parameters in network n. The decision function f A (T n , ·, ·) is approximated by similar neural networks, with the only difference that the input also includes information of which derivatives in the portfolio have been exercised prior to T n , i.e., f θn : Since binary decision functions are discontinuous, and therefore unsuitable for gradient-type optimization algorithms, we use as an intermediate step, the neural network F θn : R d → (0, 1) J . Instead of a binary decision, the output of the neural network F θn can be viewed as the probability 4 for exercise to be optimal. This output is then mapped to 1 for values above (or equal to) 0.5, and to 0 otherwise, by defining f θn (·) = a • F θn (·), where a is a component-wise round-off function, i.e., for j ∈ {1, 2, . . . , J}, and x ∈ R d , the j:th component of a(x) is given by (a(x)) j = I {x j ≥1/2} . For each Z ∈ {V, U }, our aim is to adjust the parameters θ 1 , θ 2 , . . . , θ N such that where we recall that Θ = {θ 1 , θ 2 , . . . , θ N }. For n ∈ {1, 2, . . . , N }, we define the sequence of neural networks, approximating the decision functions at exercise dates T n , T n+1 , . . . T N , by f Θ n := (f θn , f θ n+1 , . . . , f θ N ) T . Note that the input dimension for the neural networks is different when we want to approximate f V and f U compared to when we want to approximate f A . To avoid having to introduce an extra layer of notation, we use for all networks Θ to denote the set of parameters, and keep in mind that the dimension depends on the specific problem considered. Although the above provides a good intuition for what we want to accomplish, it is not clear in which sense we want the functions to be similar, or how to adjust the parameters to achieve this. To approach a more tractable form, from a computational perspective, for t ∈ (T n−1 , T n ], we insert (35) in (10) and (36) in (20) to obtain Note that (37) is a J−dimensional vector of X−stopping times and (38) is a J−dimensional (X, A)−stopping times vector, which depends on f Θ n on a structural level but also on the randomness of the stochastic process X t,x (and A t,α for (38)). For notational convenience, we use the short hand notationτ , when approximating f A ), and for element j ∈ {1, 2, . . . , J}, τ Θ n,j = τ Θ n j (orτ Θ n,j = τ Θ n j ). We are now ready to define our objective, which, for T n ∈ T Π (0), is to find θ n such that the expected future cash-flows are maximized. The cash-flows can be divided into three categories: 1. The cash-flows obtained by the derivatives exercised at the present time T n ; 2. The cash-flows obtained at later exercise dates prior to default of the counterparty; 3. The cash-flows obtained at default of the counterparty, according to the close-out agreement.
For n ∈ {1, 2, . . . , N } and j ∈ {1, 2, . . . , J}, we denote dimension j of decision function f θn by f θn j = f θn j , and F θn j = F θn j Given that no default has occurred prior to T n , (and the exercise state A n = α for the risky portfolio with netting) the expected cash-flows, that we want to maximize, are given below.
Risk-free portfolio: Risky portfolio without netting: Risky portfolio with netting: We want to optimize θ n , such that the above are as close as possible (in mean squared sense) to Π V (T n , X n ), Π U (T n , X n , 1) and Π A (T n , X n , 1, α), respectively.
Remark 3.1. The objectives for the risky portfolios, in (40) and (41), both depend on the risk-free valuation of the derivatives. Therefore, in order to approximate the risky decision functions, we first need to approximate the risk-free exercise strategy, and the risk-free derivative values. In the next subsection, we explain how V j can be approximated.
Although (39)-(41) are accurate representations of the optimization problems, they give us some practical problems. In general, we have no access to f V , f U and f A which controlτ V n+1 ,τ U n+1 and τ A n+1 . Another problem is that, in general, we have no access to the true distributions of the portfolio values for comparison. However, if TÑ is the maturity of derivative j ∈ {1, 2, . . . , J}, it is optimal to exercise as long as the pay-off value is positive, by the definition of the decision functions. We can therefore set f Recall that if T N is greater than the maturity of contract j, we have g j (T N , ·) ≡ 0. Since all components in (42) are known except for the decision function f θ N −1 , we want to find θ N −1 , such that a Monte-Carlo approximations of (42) is maximized. Given M ∈ N samples, distributed as X, which for m ∈ {1, 2, . . . , M } is denoted by x = (x t (m)) t∈[0,T ] , we approximate (42) by (43) The only unknown entity in (43) is the parameter θ N −1 in the decision function ) T . Furthermore, we wish to find θ N −1 such that (43) is maximized, since it represents the average cash-flow in [t N −1 , t N ]. Once θ N −1 is optimized, we use this parameter to set up a similar expression for the expected cash-flow on [t N −2 , t N ], which is maximized by finding an optimal θ N −2 . This procedure is then iteratively continued until also θ N −3 , θ N −4 , . . . , θ 1 are optimized. The procedure is similar for the risky portfolios, but based on (40) or (41) instead. This implies that we also need to sample default events of the counterparty. We denote by θ * n the optimized version of parameter θ n and the sequence of optimized parameters for the networks at exercise dates T n , T n+1 , . . . , T N are defined as Θ * n := {θ * n , θ * n+1 , . . . , θ * N }, and for notational convenience, we define the complete sequence of parameters as Θ * := Θ * 1 . Remark 3.2. Since we are considering a portfolio in which all the derivatives may have a different set of exercise dates, we have that for T n ∈ T Π (0), there are J Ex n ∈ {1, 2, . . . , J} derivatives that may be exercised. Therefore, we only need to compute J Ex n of the J dimensions of f θn and can by default set the remaining J − J Ex n dimensions of f θn to 0. This can be done by appropriately adjusting some weights and biases.
To keep the flow of the paper, the details of the algorithms and the parameters θ n are given in the Appendix.

Phase II: Learning pathwise derivative values and portfolio exposures
As mentioned in Remark 3.1, the risk-free derivative values need to be approximated pathwise in order to approximate the risky decision functions. Moreover, the pathwise derivative values are required to approximate the exposure profiles for the risk-free, as well as the risky portfolios. In this subsection, we focus on the risk-free portfolios, but the extension to risky portfolios is straight-forward.
We use the risk-free, stopping strategy from subsection 3.1 to generate pathwise cash-flows, which are in turn used to approximate the pathwise derivative values. Let T n−1 , T n ∈ T Π (0), for t ∈ (T n−1 , T n ], we denote the vector-valued discounting process and pay-off function, respectively, by (44) Using the notation above, we define the vector-valued cash-flow process as and for j ∈ {1, 2, . . . , J}, we denote the j:th element of Y t by Y t,j , and we emphasize that Y t is not H t −measurable.

Regression problems
In this subsection we use standard regression theory, see e.g., [28], to show that derivative values, exposures, and other entities of interest, can be formulated as the solution to certain minimization problems. We introduce the following notation for measurable functions 5 First, we recall a basic property of the regression function. Let X : [0, T ]×Ω → C 1 and Y : [0, T ]×Ω → C 2 be a stochastic process, which for t, u ∈ [0, T ], with t ≤ u, satisfies E 0 [|Y t | 2 ] < ∞. We define the regression function, which satisfies m(t, ·) ∈ arg min where · 2 is the Euclidean norm. It then holds, for x ∈ R a , that the regression function is given by the conditional expectation Using the notation from above, and by choosing C 1 , C 2 , X and Y in (47) and (48) wisely, we can approximate different entities related to e.g., the exposure profiles or the pathwise CVA. For instance the exposures, both with and without netting, can be approximated from the solution of a minimization problem of the form in (47). For T n−1 , T n ∈ T Π (0), let t ∈ (T n−1 , T n ] and denote t, · ), . . . V J (t, · )) T , it then holds that In (49), we have C 1 = R d , C 2 = R J , X = X and Y = Y and in (50), we have For s ∈ [0, T ] and for j ∈ {1, 2, . . . , J}, by (2), it holds that E 0 |Y s,j | 2 < ∞, and therefore also E 0 Y s 2 2 < ∞. Now, (49) holds trivially since by definition where we have used linearity of expectations and the fact that A t is H t −measurable.

Neural network based regression algorithm
Since the specific details of the neural networks are transferable from Appendix A.1 and A.2, this subsection is less detailed. The main idea is to represent D(R A ; R b ) (measurable functions from R A to R b , defined in (46)) by a parametrized neural network. A minimization problem of the form (47) can then be used as a loss function, which should be minimized by adjusting some set of trainable parameters. However, in general we have no access to τ V n for n < N , where N = |T Π (0)|. On the other hand, we can use the exercise strategy from Subsection 3.1, i.e., approximate τ V n byτ Θ * n = τ Θ * n,1 , . . . ,τ Θ * We define for n ∈ {1, 2, . . . , N }, the neural networks h Φn : R d → R J and h Φn : R d × {0, 1} J → R, which are parametrized by Φ IR n ∈ R u IR n and Φ PR n ∈ R u PR n where u IR n , u PR n ∈ N are the number of trainable parameters in each network. We use as loss functions, the empirical counterparts of (49) and (50), which are given by "DOS" in DOS-IR and DOS-PR refers to the fact that the deep stopping strategy used to obtain y(m) is generated by the DOS-algorithm. 'IR' and 'PR' are abbreviations for "individual regression" and "portfolio regression", and refer to the regression at the level of each individual derivative, and the regression at portfolio level given in (53) and in (54), respectively. The exact algorithm for computations of pathwise CVA in order to obtain ES-CVA is not presented in details here. However, it is straight-forward to adjust (53) and (54), to approximate pathwise CVA instead.
The only important adjustment to the structure of the neural networks (details in Appendix A.1) is that we want the output to be unbounded and therefore use the identity as scalar activation function in the output layers.

Combining Phase I and Phase II
Recall that the purpose for using regression at the level of each derivative was to be able to approximate the exposure of a portfolio of derivatives without netting agreement. If we consider derivatives with non-negative value, the definitions of exposures with and without netting agreements coincide. Therefore, only derivatives with non-negative values are considered in this paper, to be able to compare regression on derivative level with the regression on portfolio level. For T n−1 , T n ∈ T Π (0), let t ∈ (T n−1 , T n ], we define the following approximators where Φ IR n , and Φ PR n are parameters optimized by minimizing (53) and (54), respectively, and Θ * are parameters optimized according to the procedure described in the training procedure, described in Phase I, and I t ∈ {0, 1} J represents the exercise history of each derivative in the portfolio. Note that, even though Θ * does not appear explicitly in the right hand side of (55), it is crucial since the cash-flow vector y, used in (53) where i α is the index of the empirical α−percentile of the vector Π DOS-z t, x(1) Φ z n , Θ * , . . . , Π DOS-z t, x(M ) Φ z n , Θ * .

Numerical experiments
In the numerical experiments we use a Geometric Brownian Motion (GBM) to model the asset processes and an intensity model for default events of the counterparty. To be able to incorporate WWR, the default intensity is linked to the market state of the asset processes. The default event is triggered by an exogenous component, independent of observable market information. On the other hand, the intensity depends on the credit spread of the counterparty (observable from zero-coupon bonds) as well as a WWR-parameter. Our model choices are not necessarily used in practice but they serve the purpose of being easy to analyse. Especially the default model makes it straight-forward to analyze the effects of the credit spread of the counterparty and the WWR-parmeter. It should be pointed out that the algorithms described in this paper are model independent in the sense that they are fully data driven. This means that as long as we can sample (or in any other way obtain) market data and default events, we can train the neural networks and the computations below can be performed. In addition, the algorithms have also been implemented for a portfolio of Bermudan swaptions with dynamics following the one-factor Hull-White model. The results are similar, and are therefore not included in this section.

Default model
Following [29], we model a default event of the counterparty as where E 1 is a random variable, uniformely distributed on [0, 1]. Furthermore, the process (S t ) t∈[0,T ] is the geometric average of the d component of the dividend-free version of S, given bỹ For simplicity, without loss of generality, from now on, we assume no correlation between the components of the Brownian motions, i.e., ρ ij = 0 for i = j. The above is a one-dimensional GBM, which can be written asS It can be checked thatW = (W t ) t∈[0,T ] , is a 1-dimensional standard Brownian motion. By setting we obtainh The economic interpretation of the above is thath is the credit spread for the counterparty and b controls the wrong way risk (WWR).
Recall that the jump-to-default process, 1 D , is given by 1 D s ds (for details, see [29]).

Experiments
Contract details: We consider a portfolio of J = 8 derivatives, depending on an asset process in d = 2 dimensions.

Dynamics details:
For i ∈ {1, 2}, we set (s 0 ) i = 100, q i = 0.1, r = 0.05, σ i = 0.2, ρ ii = 1 and ρ 12 = ρ 21 = 0. Neural network details: We use M train = M reg = M reg = M = 2 20 , and for simplicity, the same structure and hyperparametersettings are used in all networks, i.e., all the networks in Phase I, and Phase II. We use training batches of size 5000, 3 hidden layers and 30 nodes in each hidden layer. Furthermore, the learning rate decreases step-wise, with equally sized steps after 100 training batches from 10 −2 to 10 −6 .

Risk-free valuation
The purpose of the risk-free valuation is two-fold. Firstly, since the risky derivative values are used as an input in the risky-valuations, they need to be accurate. To ensure accuracy, the risk-free values are compared to a well-established existing valuation method, namely the Stochastic Grid Bundling Method (SGBM), see [19] for details. Secondly, it is in its own, an interesting and challenging problem to compute the value of a portfolio of complex derivatives with early-exercise features, without having to do one computation per derivative.
As mentioned above, we compare the algorithms introduced in earlier sections with the SGBM. We emphasize that the values for each derivative needs to be approximated individually when using SGBM, i.e., we perform 8 regressions, one for each derivative. In Table 3, we compare the value at t = 0 for each derivative approximated with the portfolio version of the DOS-algorithm and the SGBM. In Table  3, we compare our values for each derivative at the initial time with the values obtained by SGBM. For the DOS-values, the neural network is trained five times, and evaluated on new, independent, samples and the average values are reported. For the SGBM, the regression is performed five times for each derivative, and the average values of the direct estimators are reported. It should be mentioned that, for j ∈ {3, 4, 5, 6, 7, 8}, the SGBM-values are biased high 7 and, for all j, the DOS-values are biased low. In Figure 1 to the left we compare EE, PFE 97. 5 Table 3: Valuation at the level of each derivative with the portfolio version of DOS, and the SGBM. The reference solution is computed by a binomial lattice model in [15]. and according to (59) and (60). To the right, we compare the derivative-wise EE approximated with SGBM and the DOS-IR based algorithm. In the DOS-IR based algorithm, the EE is approximated by evaluating (56) at x(1), . . . , x(M ) and computing the component-wise sample mean. We conclude that, although very different in nature, the SGBM and the two methods presented in this paper, agree on the values at time 0, the tail-distribution of the portfolio exposure over time (at least at the 97.5 and 2.5 percentiles), and the average values of each derivative over time. We emphasize, that we do not claim that one of the methods perform better than the others. For an analysis of the difference in performance between the methods (in the special case J = 1), we refer to [2].

Risky valuation
In this subsection we focus on the impact of the exercise policy for CVA computations. To be more precise, we investigate to what extent the CVA is overestimated when using the risk-free exercise policy, with and without netting, for different levels of WWR and credit quality of the counterparty. The contracts described in Table 2 all have positive pay-offs meaning that the corresponding derivative values are also positive. This eliminates the netting effect, and therefore, we add a derivative to the portfolio, which may take on negative values. The 9:th derivative is a European-type future with pay-off g 9 (t, S t ) = 2 × 80 − (S t ) 1 I {t=T } . By the martingale property we have that Furthermore, in the netted portfolio, we include an interest rate-free collateral, C = 35. The collateral is such that, at a default, the banks exposure is lowered by 35, and added to the close-out amount. If no default occurs before maturity, then the collateral plays no role.
In Table 4, the portfolio values with and without netting are displayed for different exercise policies and for different credit qualities and WWR-parameters. We see that for low values ofh (credit spread), i.e., for high credit quality of the counterparty, the risk-free exercise policy is a relatively good approximation also for the risky portfolios. However, when the credit quality decreases, the importance of the correct exercise policy increases. This is also reflected in Tables 5 and 6 in which    Table 6: CVA for a portfolio with netting. The relative error in the column to the right represents, in percentage, the overestimation of the CVA by using only the risk-free policy for early-exercise. 'Rel. ov.est.' is short for 'Relative overestimation'. the corresponding CVA-values are displayed. In practice, a counterparty with a bad credit quality is penalized twice; first by paying a larger CVA (justified), and secondly for paying a CVA which is more overestimated (not justified). The effect is even more significant when we look at expected shortfalls.
In Figure 2, we see that the ES-CVA, at 97.5% level, is overestimated by between 19% and 67% for portfolio without netting and 27% to 103% for the portfolio with netting. Bare in mind that for this particular choice of parameters, the overestimation of the CVAs are only 7.4% and 11.4%. All values reported in this the section on risky-valuation are results of only one experiment (no Monte-Carlo averages are used). The reason for this is that we are using many samples, and the variance of the results are therefore low. In addition, we have no reference values to compare with and the main point is that we obtain different values depending on what exercise strategy is used, not high accuracy of the estimated entities. The WWR-parameter b has a relatively small impact. This is not surprising since a positive b gives WWR for call options and Right Way Risk (RWR) for put options and a the other way around with a negative b. Since we have a portfolio consisting of a combination of puts and calls, we have a significant off-setting effect.
In this Appendix, we briefly explain the structure of the neural networks used in this paper. We present pseudo code for the algorithm used to find the exercise strategy for the risk-free portfolio (the extensions to the risky-portfolios are straight-forward). Furthermore, it is described how the time 0 value for a risk-free portfolio is computed.

A.1 Specification of the neural networks used
For completeness, we introduce all the trainable parameters that are contained in each of the parameters θ 1 , θ 2 , . . . , θ N −1 , and present the structure of the networks.
In this section, the following notation is used: • We denote the dimension of the input layers by D input ∈ N, and we assume the same input dimension for all n ∈ {1, 2, . . . , N − 1} networks. The input is assumed to be the market state x train n ∈ R d , and hence D input = d. However, we can add additional information to the input that is mathematically redundant but helps the training, e.g., the immediate pay-off, to obtain as • For network n ∈ {1, 2, . . . , N − 1}, we denote the number of layers 9 by L n ∈ N, and for layer ∈ {1, 2, . . . , L n }, the number of nodes by N ,n ∈ N. Note that N n,1 = D input ; • For network n ∈ {1, 2, . . . , N }, and layer ∈ {2, 3, . . . , L n } we denote the weight matrix, acting between layers − 1 and , by w n, ∈ R N n, −1 ×N n, , and the bias vector by b n, ∈ R ; • For network n ∈ {1, 2, . . . , N }, and layer ∈ {2, 3, . . . , L n } we denote the (scalar) activation function by a n, : R → R and the vector activation function by a n, : R N n, → R N n, , which, for x = (x 1 , x 2 , . . . , x N n, ), is defined by a n, (x) =    a n, (x 1 ) . . . a n, (x N n, )    ; • The output of the network should belong to (0, 1) J ⊂ R J , meaning that the dimension of the output, denoted by D output should equal J. To enforce the output to only take on values in (0, 1), we restrict ourselves to scalar-activation functions of the form a n,Ln : R → (0, 1).

A.2 Training and valuation
The main idea of the training and valuation procedure is to fit the parameters to some training data, and then use the fitted parameters to make informed decisions with respect to some unseen, valuation data independent of the training data. The training and valuation is described for the risk-free problem, but the procedure is similar for the risky portfolios.
Note that we use the continuous versions, F θ * n j , in the optimization phase, and the discontinuous versions f θ * n j when we update the cash-flows. The performance of the algorithm does not seem to be sensitive to the specific choice of the number of hidden layers, number of nodes, optimization algorithm, etc. Below is a list of the most relevant parameters/structural choices: • Initialization of the trainable parameters, where a typical procedure is to initialize the biases to 0, and sample the weights independently from a normal distribution; • The activation functions a ,n , which are used to add a non-linear structure to the neural networks.
In our case we have the strict requirement that the activation function of the output layer maps R to (0, 1). This could, however, be relaxed as long as the activation function is both upper and lower bounded, since we can always scale and shift such output to take on values only in (0, 1). For a discussion on different activation functions, see e.g., [26]; • The batch size, B n ∈ {1, 2, . . . , M train }, is the number of training samples used for each update of θ n , i.e., with B n = M train , the loss function is of the form defined in step 1 above. If we want all batches to be of equal size, we need to choose B n to be a multiplier of M train ; • For each update of θ n , we use an optimization algorithm, for which a common choice is the Adam optimizer, proposed in [27]. Depending on the choice of optimization algorithm, there are different parameters related to the specific algorithm to be chosen. One example is the so-called learning rate which decides how much the parameter, θ n , is adjusted after each batch.
Remark A.1. In the training for the risky portfolio with netting, the algorithm needs to be adjusted since it depends on the exercise history. This cannot easily be included since the algorithm is carried out backwards in time recursively and therefore, at exercise date T n , we do not yet know which derivatives have been exercised prior to T n . This is resolved by, at each exercise date T n , randomly assigning a state for α n (m) ∈ {0, 1} J , representing A n , for each sample m. Since the future cash-flows depend on α n (m), they need to be re-iterated for each m. This is done by iteratively, evaluating the already approximated decision functions at X for each exercise date greater than T n , until all the derivatives are exercised, or a default of the counterparty occurs. We then obtain for sample m, i.e., x val (m), the following stopping rulē The estimated portfolio value at t = 0 is then given by By construction, any stopping strategy is sub-optimal, implying that the estimate (63) is biased low. However, it should be pointed out that it is possible to derive a biased-high estimation of Π V (0, x 0 ) from a dual formulation of the optimal stopping problem, which is described in [1]. In addition, numerical results in [1] show a tight interval for the biased low and biased high estimates for a wide range of problems.