Stochastic optimal well control in subsurface reservoirs using reinforcement learning

We present a case study of model-free reinforcement learning (RL) framework to solve stochastic optimal control for a predefined parameter uncertainty distribution and partially observable system. We focus on robust optimal well control problem which is a subject of intensive research activities in the field of subsurface reservoir management. For this problem, the system is partially observed since the data is only available at well locations. Furthermore, the model parameters are highly uncertain due to sparsity of available field data. In principle, RL algorithms are capable of learning optimal action policies -- a map from states to actions -- to maximize a numerical reward signal. In deep RL, this mapping from state to action is parameterized using a deep neural network. In the RL formulation of the robust optimal well control problem, the states are represented by saturation and pressure values at well locations while the actions represent the valve openings controlling the flow through wells. The numerical reward refers to the total sweep efficiency and the uncertain model parameter is the subsurface permeability field. The model parameter uncertainties are handled by introducing a domain randomisation scheme that exploits cluster analysis on its uncertainty distribution. We present numerical results using two state-of-the-art RL algorithms, proximal policy optimization (PPO) and advantage actor-critic (A2C), on two subsurface flow test cases representing two distinct uncertainty distributions of permeability field. The results were benchmarked against optimisation results obtained using differential evolution algorithm. Furthermore, we demonstrate the robustness of the proposed use of RL by evaluating the learned control policy on unseen samples drawn from the parameter uncertainty distribution that were not used during the training process.


Introduction
Optimal control problem involves finding controls for a dynamical system such that a certain objective function is optimized. Traditionally, most research for solving the optimal control problems for non-linear dynamical systems uses optimal control theory which, in principle, finds optimal control by deriving Pontryagin's maximum principle or by solving the Hamilton-Jacobi-Bellman equation. These classical strategies are offline and require a complete knowledge of system dynamics making them unsuitable for dynamical systems with uncertainties [1]. Recently, model predictive control (MPC) -a feedback control based on real-time optimisation -has attracted increasing attention in stochastic optimal control research [2,3,4]. Alternatively, optimal control problems could also be solved using reinforcement learning (RL) approaches. Koryakovskiy et al. [5] provide a benchmark study for comparison between model predictive control and model-free reinforcement learning approaches where the authors show that RL results are comparable to MPC. Further, after a certain break-even point model-free reinforcement learning is shown to perform better than nonlinear model predictive control with an inaccurate model. Furthermore, as opposed to MPC, RL provides an extra advantage of generality since it doesn't need complete knowledge of model dynamics. So far, research on application of RL in optimal control problems is advancing rapidly in fields like manufacturing [6], energy [7] and fluid dynamics [8]. However, research focused on developing RL strategy based on assessment of its robustness against the uncertainties is still an open area, especially for cases where model uncertainties has a substantial effect on the optimal control.
In this study, we utilize a model-free RL algorithm to solve simulation-based robust optimal control problem where the model information is assumed to be partially available. Robust optimal well control in petroleum reservoir management [9,10,11] forms a perfect candidate for such problem. In this problem, the reservoir well control variables are optimized in order to maximize the sweep efficiency of injection fluid throughout the reservoir life cycle. The assumption of partially available model information is based on the fact that the reservoir field data is generally only available at well locations. We consider reservoir permeability field as an uncertain model parameter. Results are demonstrated for two state of the art model-free RL algorithms: proximal policy optimisation [12] and advantage actor-critic [13]. Although we utilize robust well control in the current study, the presented techniques are general and is applicable to a wide range of simulation-based nonlinear optimal control problems.
We designed two test cases that underscore effect of parameter uncertainty on optimal controls. For ease of execution and demonstration, we represent the optimal well control problem with an advection equation for tracer flow through porous media in which uncertain parameter, permeability, is treated as a random variable. In the first test case, we use a permeability field distribution where the location and orientation of a linear high permeability channel is uncertain. The second test case uses a spatially correlated permeability field to represent smoother permeability field where the log-permeability field is modeled with a conditional Gaussian distribution with an exponential kernel. We employed proximal policy optimisation (PPO) and advantage actor-critic (A2C) algorithms to learn the optimal well control policy in order to maximize the sweep efficiency in the domain. The RL policies are learned using a selected number of realizations of uncertainty distribution and its robustness is evaluated by applying the learned policy on a set of unseen realizations during training, drawn from the same distribution.
The outline of the rest of this paper is as following: Section 2 provides the problem description and how RL algorithms are utilized to solve robust optimal well control. Information such as algorithms and methodologies used in this paper are also briefed in this section. Section 3 details the model parameters for test cases chosen for this study. Further, the approach used for problem formulation for RL algorithms is explained. Results for the given test cases are presented in section 4. Finally, section 5 concludes with the research study summary and few future research directions.

Methodology
The process of single-phase flow in porous media is of importance to a variety of engineers and scientists who are concerned with problems ranging from the financial aspects of oil movement in petroleum reservoirs to the social problems of groundwater flows in polluted aquifers [14]. We demonstrate the proposed techniques in the context of subsurface flow in subsurface reservoirs using an advection equation for tracer flow through porous media. In this dynamical system, the tracer enters the domain and pushes the non-traced fluid out of the domain. Flow in and out of the domain is defined as the source and sink terms in the advection equation. In the context of oil movement problem, the tracer corresponds to water and the nontraced fluid corresponds to oil (or hydrocarbons) in the reservoir while the source and sink locations correspond to injector and producer wells, respectively. Bear in mind that the oil flow problem can be more correctly modeled as a two-phase problem (water phase and oil phase). However for ease of execution and demonstration we chose single-phase flow problems where water injection is modeled as tracer flow that pushes the oil (non-traced fluid) in the reservoir, much like a contaminant in a fluid. Despite this approximation, the presented methodology is general enough and can be similarly applied to two-phase flow problems.

Problem description for robust optimal well control
We model the reservoir water injection process with an advection equation for tracer flow through porous media over the temporal domain T = [t 0 , T ] ⊂ R and spatial domain X ⊂ R 2 . The tracer flow models water flooding with the fractional variable s(x, t) ∈ [0, 1], where s(x, t) represents the fraction of water to oil at x ∈ X and t ∈ T . The well controls a(x , t) represent the source and sink terms in this equation, where x ∈ X (where X ⊂ X ) correspond to set of well locations. Injector and producer flow controls are represented with a + (which constitutes of all the source locations in the domain and formulated as max(0, a)) and a − (which constitutes of all the sink locations in the domain and formulated as min(0, a)), respectively. The optimal controls a * (x , t) are the solution of following closed-loop optimisation problem: where, the objective function defined in Eq. (1a) represents the total oil flow out of the reservoir (oil production) and is maximized over the finite time interval T . The intigrand in this function is referred as Lagrangian term in control theory and is often denoted by L(s, a). The water flow trajectory s(x, t) is governed by an advection equation (Eq. (1b)) which is solved in couple with pressure equation −∇·(k/µ)∇p = a, where p(x, t) ∈ R is pressure. Porosity φ(x, ·), permeability k(x, ·) and viscosity µ(x, ·) are the model parameters. Permeability k represents the model uncertainty and is treated as a random variable that follows a known probability density function K with K as its domain. The initial and no flow boundary condition is defined in Eq. (1c), where n denotes outward normal vector from the boundary of X . Note that, the velocity v follows Darcy's law: v = −(k/µ)∇p. Constraint defined in Eq. (1d) represents the fluid incompressibility assumption along with the fixed total source/sink term c which represents water injection rate in the reservoir. The controls a(x , t) are discretized on time interval t 0 < t 1 < · · · < t m < t m+1 = T .

RL framework for robust optimal well control
We propose to use model-free reinforcement learning algorithms to solve the optimal control problem defined in Eq. (1). A common approach in RL is to model the task as a Markov decision process. The process is defined as a quadruple S, A, P, R , where S ⊂ R ns is set of all possible states with dimension n s , A ⊂ R na is a set of all possible actions with dimension n a . State is represented with the tracer variable s(x, t m ) and action with controls a(x , t m ) such that it follows the constraint defined in Eq. (1d). P : S × A → S is a transition function that follows Markov property. That is, it returns s(x, t m+1 ) as a function of control a(x , t m ) and state s(x, t m ). Such transition function can be obtained by discretizing Eq. (1b) which returns s(x, t m+1 ) by executing the controls a(x , t m ) when in the state s(x, t m ). The reward function R : S × A × S → R returns a real valued reward, r(t m+1 ) = R(s(·, t m ), a(·, t m ), s(·, t m+1 )) for a particular transition between the states. The reward function for the problem (Eq. (1)) is obtained by discretizing the objective function (Eq. (1a)) into control steps such that, The control policy, π : S → A, defines an action a(x , t m ) given the current state s(x, t m ) (also written as π(a|s)).
The goal of reinforcement learning is to find an optimal policy π * (a|s) that maximizes expected discounted return, G = M m=1 γ m−1 r(t m ), where immediate rewards r are exponentially decayed by the discount rate γ ∈ [0, 1] and M is the final control time step. Essentially, RL algorithms attempt to learn the optimal policy π * (a|s) from an initial policy, π(a|s), by exploring state-action space with what is referred to as agent-environment interactions. Figure 1 shows a typical schematic of such agent-environment interaction. The term agent refers to the controller that follows the policy π(a|s) while the environment consists of transition function, P, and reward function, R. The optimum solution can be obtained by following the policy π * (a|s) throughout the complete control trajectory (also referred as an episode in RL literature) as shown in figure 2. Thus, optimal result (which refers to optimum oil recovery for optimal well control problem), R π * (a|s) , is obtained by adding the reward r(·) at each time-step of such episode: Note that the optimal result, which is used in the rest of the paper to evaluate the policy, is not exponentially decayed with the discount factor γ. The policy π * (a|s) is said to be robust if it is able to achieve optimal results for a stochastic environment controlled by the uncertain parameter k, defining the permeability in the Darcy flow equation.
If we treat the parameter k as a deterministic fixed value, policy learning is fairly straightforward. The policy learned in such a way is termed as frozen policy (a term π * (a|s) a 0 π * (a|s) a 1 π * (a|s) a T Figure 2: optimal controls for complete control trajectory which refers to an episode in RL algorithms. state s(x, t m ) and action a(x , t m ) are denoted with shorthand notations, s m and a m , respectively used by [5]). In this study we aim to find a robust policy that accounts for the variability in k. To the best of our knowledge, so far, such policy is learned by simply incurring samples from the distribution K, in each episode of the learning process (also known as static domain randomisation method [15]). Such policy can be robust enough if the samples used in the learning process cover most of its domain K (In other words, when K is well explored). For this reason, robust optimal policy learning naturally requires higher number of agent-environment interactions as compared to that in learning frozen policy. This could be problematic if each agent-environment interaction (solving the governing Equations (1b), for instance) is computationally intensive, which is common in most simulation-based optimal control problems like optimal well control. Furthermore, samples incurred in this learning process often tend to be from the high probability region of the distribution domain causing the policy to be biased towards them. robustness of frozen policy: We denote frozen policy learned by keeping the parameter k fixed as π * (a|s; k). Lets define a distance function D : Figure 3: a wells spread choice of samples for some uncertainty distribution K (depicted as onedimensional for illustration purpose) to learn the robust optimal policy π * (a|s; k). Members of k are colored in white while the members of k are shown in grey.
returns the distance between k and k as D(k, k ). Naturally, the frozen policy can be applied to the simulation (or environment) that uses k as the parameter instead of k (denoted as π * (a|s; k ⇒ k )). Due to continuous nature of governing Equations (1b), effectiveness of the policy, π * (a|s; k ⇒ k ), is inversely related to the distance D(k, k ). We can define an acceptable near-optimal solution limit obtained with π * (a|s; k ⇒ k ) when k is in the neighborhood (δ) of k. In other words, we can say the policy π * (a|s; k ⇒ k ) can be considered robust when D(k, k ) < δ. This argument can be extended for multiple parameters: k 1 , k 2 , · · · , k l . In this case, we learn the policy, π * (a|s; k), by randomly choosing any one of the parameter value from the vector k = (k 1 , k 2 , · · · , k l ) at every episode in the learning process. Subsequently, the policy, π * (a|s; k ⇒ k ), can yield a near optimal value for the vector k = (k 1 , k 2 , · · · , k l ), given that min i D(k i , k j ) < δ, ∀j ∈ {1, 2, · · · , l}. If the vector k is chosen such that the union of neighbourhood of all its values k i , ∀i ∈ {1, 2, · · · , l}, cover the domain K, the policy π * (a|s; k) yields at least near optimal solution for any sample k ∼ K. Figure 3 shows an example of such choice of k values for the uncertainty distribution K. For such a well spread choice of k, the policy π * (a|s; k) can be said to be robust under the uncertainty distribution K.
Implementation: Although the above-mentioned policy π * (a|s) provides optimal solution for the problem defined in Equation (1), it is not applicable for systems with partially observable state space. For instance, in optimal well control problems reservoir information is generally only available at well locations. As a result we provide the agent with the available observation o(x , t m ) as its state. For this study, observation o(x , t m ) is represented with a set of saturation and pressure values at well locations x and time t m . Note that, with such representation of states, we break the underlying assumption of Markov property of the transition function. Here, we assume that transition between the observations approximately follow the Markov property.
We choose l well spread samples of uncertainty distribution to learn a robust policy. This is achieved with a clustering analysis (using k-means clustering method) of the domain K. The training vector k is constructed with samples of K which are located at the cluster centers. The policy, π * (a|s; k), is learned by randomly selecting the parameter k from the training vector k = {k 1 , k 2 , · · · k l }. Average training return is calculated by averaging the returns of policy π(a|s; k) on l simulations, each with a separate parameter k from k.
The robustness of this policy is assessed by applying it on l simulations, each with a new unseen sample, k ∼ K, as its parameter. The samples for evaluation are chosen randomly from each cluster. The evaluation vector k = {k 1 , k 2 , · · · k l } represents the set of such samples. Robustness of the policy π(a|s; k) is evaluated by monitoring the average evaluation return R π(a|s;k⇒k ) : In a nutshell, we explect to train the policy π(a|s; k) such that R π(a|s;k⇒k ) is maximized.
We employ PPO and A2C algorithms to solve this problem. The RL algorithm parameters are tuned in order to maximize the average returns, R π(a|s;k⇒k) . As the values of training vector, k, fairly represent the variety of the domain K, we expect convergence of average evaluation return value towards R π * (a|s;k⇒k ) by the end of the learning process. The baseline optimal results are computed using differential evolution (DE) algorithm [16]. DE algorithm is used to solve l optimisation problems as described in Equations (1) each with the parameter k replaced by a fixed value from evaluation vector k . The reference value for R π * (a|s;k⇒k ) is taken as average of these l solved optimal objective functions. Both the RL algorithms are tuned such that R π(a|s;k⇒k ) converges in the range of this reference value.

RL algorithms
We choose model-free RL algorithms to avoid any effect of model learning on policy learning. Two state-of-the-art policy-based algorithms (A2C and PPO) are used to solve the optimal control problem under consideration.

Advantage actor-critic algorithm
A2C [13] is a policy gradient algorithm that models the stochastic policy, π θ (s|a), with a neural network. Essentially, the network parameters θ are obtained by opti-mizing for the objective function, where,Â t is an estimator for advantage function at timestep t and the term,Ê t [· · · ], is empirical average over finite batch of samples collected through agent-environment interactions. Gradient estimator of policy network, ∇ θ J(θ), is obtained by differentiating Eq. (6) which is done with automatic differentiation algorithm. The advantage function estimator,Â t , is computed using generalized advantage estimator [17] which is derived from the value function V t . The value function estimatorV t is learned through a separate neural network termed as the critic network. Definitions of advantage and value functions are described in Appendix A. Algorithm 1 illustrates a broad outline for implementation of A2C algorithm in this study. In order to reduce computational time, the iterative data sampling for objective function is performed in parallel on N processors followed by a synchronous gradient update. Note that in every policy iteration in total T control steps are run where environment is reset with a new permeability sample from the k after every terminal step of the episode.
Algorithm 1 Policy Robustness Evaluation using A2C 1: Input: Number of actors N , and number of steps in each policy iteration T 2: Obtain training vector k, and evaluation vector k using cluster analysis of the predefined uncertainty distribution K 3: for iteration = 1, 2, . . . do 4: for actor = 1, 2, . . . , N do

5:
Run policy π θ in environment for T time steps (which corresponds to in total E episodes where the environment permeability is set to a sample from training vector k, at the beginning of every episode )

Proximal policy optimisation algorithm
If the gradient step in A2C is too large, the policy may astray which in turn will produce bad samples causing divergence in the solution. As a result, we have to select very small step size which slows the learning process. Schulman et al. [12] introduced PPO algorithm that make sure the gradient steps are small enough to make the algorithm data efficient. This is done by formulating the network objective function in terms of a ratio of two policies (old and new) using principle of importance sampling. Appropriate gradient steps are chosen by clipping the ratio of old and new policy within the range [1 − , 1 + ], where is generally a small fractional number. The modified objective function for policy network is defined as, where r t (θ) = π θ (a t |s t )/π θ old (a t |s t ) and π θ old (a t |s t ) is old policy. Algorithm 2 illustrates the implementation of PPO algorithm in the presented methodology. Run policy π θ in environment for T time steps (which corresponds to in total E episodes where the environment permeability is set to a sample from training vector k, at the beginning of every episode )

Differential evolution algorithm
We employ DE algorithm [16] as a baseline for assessing the results obtained using PPO and A2C. Essentially, DE is a population based stochastic optimisation algorithm which employs evolutionary ideas like crossover and mutation to find optimal arguments. In order to solve the optimisation problem (Eq. (1)) using DE, we represent the argument with a set of controls a = {a(x , t 0 ), · · · , a(x , T )} such that it follows the constraints defined in Eq. (1d). The fitness of such argument is computed with the objective function (Eq. (1a)) by solving the governing Eq. (1b). DE algorithm initiates its argument search with a set A of random arguments which is referred as population (i.e. A = {a 1 , · · · , a p }). Using a crossover criteria, certain arguments (say, i th argument in A: a i ) are evolved as , where a i is updated value for a i , a * is the best argument (i.e. the one corresponding to maximum fitness) in the population so far, F ∈ [0, 2] is mutation parameter, a r1 and a r2 are randomly selected arguments from the population. a i is replaced with a i if the fitness of a i is higher than that for a i . The optimum solution is obtained by repeating such evolution for a number of iterations or until a certain convergence criteria is met.
Note that DE algorithm's parameter search space is wider than that for RL. This is because the optimum parameters do not have to follow a mapping π(a|s). For this reason, we expect DE algorithm to achieve more optimal controls as compared to RL algorithms due to its potential to achieve global optima.

K-means clustering
We employ connectivity distance [18] measure in order to represent the variation in dynamical response of permeability samples. The connectivity distance matrix D ∈ R N ×N for a large number (N ) of samples of K is written by, where, x are a set of spatial locations in the domain X and s(x , t; k i ) refers to solution of governing Eq. (1b) with the uncertainty parameter k i when all control wells are kept equally open. In order to be able to visualize the connectivity dissimilarity among samples of K, we employ multi-dimensional scaling on the distance matrix D to obtain a set of N two-dimensional coordinates represented with d 1 , d 2 , · · · , d N .
In other words, coordinates d 1 , d 2 , · · · , d N correspond to samples k 1 , k 2 , · · · , k N of K such that it represents connectivity distance measure defined in Eq. (8) among its values. The coordinates d 1 , d 2 , · · · , d N are divided in l sets S 1 , S 2 , · · · , S l obtained by solving the optimisation problem: where µ S i is average of all coordinates in the set S i . The training vector k is a set of l samples of K where each of its value k i correspond to the one nearest to µ S i .

Numerical experiments
We evaluate the effectiveness of RL in solving robust optimal well control problems using two test cases representing two distinct permeability uncertainty distributions. Numerical solutions of the governing equations are obtained by using finite volume discretization. The pressure equation is discretized using two point flux approximation method while the saturation equation is discretized using implicit upwind scheme. Readers are referred to [19] for more details on numerical methodology. For both cases, the values for model parameters emulate those in the benchmark reservoir simulation case, SPE-10 model 2 [20].

model parameters
Reservoir simulation parameters for both the cases, corresponding to Eq. (1), are delineated in table 1. As per the convention in geostatistics, we assume that the distribution of log (k) is known and is denoted by G. As a result, we treat g = log(k) as a random variable in the problem description defined in Eq. (1). Uncertainty distributions for test cases 1 and 2 are denoted with G 1 and G 2 , respectively.  Figure 4a shows schematic of the first test case (inspired from the case study done by [11]). A total number  of 31 injectors are placed on the left edge of the domain while an equal number of producers are placed symmetrically on the right side. A linear high permeability channel connects from left to right side of the domain. The channel location is parameterized with its left and right distance (l 1 and l 2 ) from the top and channel width w. These parameters follow uniform distributions defined as, w ∼ U (120, 360), l 1 ∼ U (0, L − w) and l 2 ∼ U (0, L − w), where L is domain length. log permeability g is sampled from the uncertainty distribution G 1 : To be specific, log permeability g at a location (x, y) is formulated as: where x and y are horizontal and vertical distances from the upper left corner of the domain illustrated in figure 4a. The values for g high and g low (5.5 and -2, respectively) are inspired from Upperness permeability distribution specified in SPE-10 model 2 case (refer to Appendix B for details). Figure 6a illustrate various samples drawn from the distribution G 1 . Test case 2 (Spatially correlated smooth permeability distribution): We use test case 2 to represent uncertainty distribution of a smoother permeability field. Figure  4b illustrates reservoir domain for this case. It comprises of four producers located at four corners of the domain and an injector located at the center of the domain. The permeability distribution for this case is considered as a log normal distribution which is constrained with fixed values at well locations. As a result, log permeability g is sampled from the normal distribution G 2 : where, C(x, x ) is the co-variance matrix between unconstrained domain locations x and constrained locations x while µ g correspond to the constrained log-permeability value at the well locations. The co-variance matrix is calculated using an exponential kernel as: We choose correlation length l as 240 ft (20% of domain length) and the variance amplitude σ as 2.5. The values µ g , σ and l were chosen to fit permeability distribution in the same range of that in Tarbert case specified in SPE-10 model 2 case (refer Appendix B). Examples of samples drawn from the distribution G 2 are shown in figure 6b.

RL problem formulation:
Both, PPO and A2C, algorithms attempt to learn neural network parameters θ to learn the policy π θ (a|s). We choose five step episode which is obtained by dividing the temporal domain T into five control steps. The optimisation potential of the problem can be improved if we choose a higher than 5 control steps. However, we choose 5 control steps for the ease of execution and demonstration. The episode steps are denoted with t m where m ∈ {1, 2, · · · , 5}. State is represented by observation o(x , t m ) which is a vector of saturation and pressure values at all well locations. However, since the saturation at injectors is always constant (one), we don't include it in the observation. As a result, the observation vector is of the size 2n p +n i (i.e. 93 for test case 1 and 9 for test case 2) which forms the input of the policy network π θ (a|s). Action a(x , t m ) is represented with a vector the size of number of control wells n p +n i (i.e. 62 for test case 1 and 5 for test case 2). In order to maintain constraint defined in Eq. (1d), we represent the action vector with weights, w i s, such that 0.001 ≤ w i ≤ 1 ( i.e. action vector is written as, a(x , t m ) = (w 1 , · · · , w np , w np+1 , · · · , w np+n i )). Using these weights, flow through ith producer, a − (x i , ·), is computed as, Similarly, flow through ith injector, a + (x i , ·), is written as, The reward function defined in Eq. (2) is normalized by dividing it with total pore volume (φ × lx × ly) in order to obtain the reward in the range [0,1]. The normalized reward represents recovery factor or sweep efficiency for oil movement problem in petroleum reservoir. We use clustering strategy explained in section 2.5 where we choose total number of samples, N , and clusters, l, to be 1000 and 16 for both uncertainty distributions, G 1 and G 2 . Training vector k is obtained with samples k 1 , · · · , k 16 each corresponding to a cluster center. Figure 5a and 5b show cluster plots for samples drawn from G 1 and G 2 permeability distributions, respectively. Permeability samples for test case 1 are distributed in the shape of an acute angle where samples in the vertex region correspond to high permeability channel at the central region, samples on the left arm correspond to high permeability channel in the upper region while samples in the right arm correspond to high permeability channel in the lower region of the domain. For test case 2, samples with more or less axisymmetric high permeability region are located in the central area in figure 5b (e.g., cluster 1). The samples corresponding to eccentric high permeability regions are located outside as shown with examples k 13 (lower left region), k 3 (upper left region), k 15 (upper right region) and k 0 (lower right region) in figure 5b. In order to represent well spread domain of G 1 and G 2 distributions, the 16 samples, each randomly chosen from a cluster, forms the evaluation vector k . These evaluation samples are shown in figure 6a and 6b for test cases 1 and 2, respectively. Figure 7 outlines the general schematics for agent-environment interactions in PPO and A2C algorithms for robust optimal control problem. Since these algorithms are stochastic in nature, we provide training and evaluation returns as a mean corresponding to three distinct seed values. These results are benchmarked against DE algorithm optimisation results. Parameters used for all algorithms along with the confidence range of learning plots are presented in Appendix C.

Results and discussion
We refer to the control policy in which all wells are equally open as the base policy. When the first test case is operated with base policy, most of the water flooding take place in the high permeability channel causing poor sweep efficiency in the low permeability region. Naturally, the optimal policy is to restrict the flow through wells which are in the region nearby high permeability channel. Using DE algorithm, we obtain 16 optimized solutions for optimal control problem defined in Equations (1) where each value in evaluation vector k is treated as fixed permeability k. These results act as a reference to optimal solutions obtained using PPO and A2C algorithms. Reference value for mean evaluation return R π * (a|s;k⇒k ) is obtained by averaging DE results of these 16 problems. Note that DE algorithm is not a suitable  : RL algorithm agent-environment interaction schematics to learn robust optimal well control policy method to solve the robust optimal control problem since it can provide optimal controls only for certain permeability samples as opposed to PPO or A2C algorithms where we try to learn the policy that is applicable to all samples of permeability distribution. However, DE results are used as the reference since they provide the upper bounds achieved by direct optimization on sample by sample basis. Figure 8 shows plots for training (R π(a|s;k⇒k) ) and evaluation (R π(a|s;k⇒k ) ) returns versus total number of episodes for PPO and A2C learning process. As can be seen, PPO and A2C algorithms successfully learned the robust optimal policy and their average evaluation returns R π * (a|s;k⇒k ) are within the range of DE results. We also present results for a frozen PPO policy trained using a fixed permeability located at index 1 in the training vector k (indicated with dotted green line in figure 8). We note that the training vector k for frozen PPO case only comprises of a single permeability realization. This frozen policy is not robust as it performs poorly on unseen permeabilities as demonstrated when we plot R π(a|s;k⇒k ) value in its learning process. Furthermore, learning plot for PPO algorithm with full state representation is illustrated with red dotted line. In this case, we provide the agent with saturation values at each grid point in the domain (i.e. with a vector of length 61 × 61 = 3721). Policy learning with full state information are in the same range of that with only well observation state representation. In other words, information of well observations is enough to form optimal policy for this case. Figure 9 plots optimum recovery factors (in %) corresponding to each evaluation permeability in the vector k . Results of PPO and A2C policy are comparable to the DE results which are independently optimized for each permeability field. Figure  10 illustrates the optimum controls corresponding to evaluation permeability at third and fifth indices of k . The controls for injectors and producers are shown in blue and red colored circles, respectively. Note that the radius of the circle at certain well location is proportional to flow through that well. That is, the radius of the circle at certain well location is proportional to the flow control opening of the corresponding well. As can be seen in figure 10, the optimal control policy to restrict flow controls in the high permeability region is successfully learned using PPO and A2C.
In the second test case, all reservoir parameters and well locations are axisymmetric except the permeability field. As a result, there is an imbalance in the flow direction from the central injector. The optimal flow control policy is to govern the well controls so that balanced amount of sweeping can be maintained in all four quadrants of the reservoir. For instance, if water sweeps uniformly in all quadrants of reservoir except the upper left, the optimal policy should increase the flow through upper left producer while restricting the flow through rest of the producers (i.e. govern the controls to cope with the imbalance in the upper left quadrant). For the five-spot case under consideration, the optimal policy has in total 10 modes: four due to imbalance in single quadrant and six due to imbalance in a pair of quadrants.     Figure 11: Test case 2: monitoring plots for average training return R π(a|s;k⇒k) (on left) and evaluation return R π(a|s;k⇒k ) (on right) for learning process in PPO and A2C. The evaluation return value is compared with the optimisation results obtained using DE confusion in RL policy learning: Policy π θ (a|s) is learned through numerous agentenvironment interactions experienced with 16 permeability field instances in training vector k. By definition, the optimal policy returns the action a that corresponds to maximum return episode from current state s. Since the first state s 0 is same for all permeability fields (according to initial condition defined in Eq. (1c)), the first action π θ (a 0 |s 0 ) is always the one that correspond to a certain permeability field in training vector k which produces maximum total return. So if we imagine k 5 to be such a permeability which happens to follow one of the ten optimal policy modes (say mode 7). The first action a 1 will always be the one that correspond to mode 7 policy. This is obviously undesirable when we apply this policy on permeability fields which correspond to another mode of optimal policy. In order to avoid this confusion in policy learning, we train RL policies from second step onward. Since the second step of the episode is different for different permeability fields, RL policy learning doesn't face this confusion anymore. By default we treat the first action to follow base policy (i.e. all wells open equally). Note that the first action for test case 1 RL optimal policies is also identical for all cases (refer to figure 10). However, since the optimal policy's nature is not modal, the resulting sub-optimality is not as much prominent for this case. Figure 11 shows learning plots for training and evaluation returns for PPO, A2C and DE algorithms. Similar to test case 1 results, PPO and A2C algorithms success-  fully learn robust optimal policy. RL policies learned with well observations show results in the same range with the PPO policy with full state representation. As expected, the frozen policy's lack of robustness can be seen in evaluation return learning plot. PPO, A2C and DE optimisation results for evaluation permeability fields in k are compared individually in figure 12. RL policies successfully capture the optimal policy behaviour as experienced with DE results. Figure 13 illustrate control policies for evaluation permeability fields at ninth and twelfth indices of k . For instance, optimal policy for k 9 which refers to increasing flow through producers in the lower region of the domain is clearly observed in its RL policies. Computational complexity is the major limitation of using RL in simulation based control problem. For all three optimisation methods (PPO, A2C and DE), we employ multiprocessing to reduce total computational time. RL algorithms use multiple CPUs to run episodes in parallel while GPUs are used for neural network back propagation computations. Depending on resource availability and parameter tuning, different number of CPUs and GPUs are used on case basis (refer Appendix C for number of CPU used in each case). Furthermore, parallelism behaviour is also varied between RL and DE algorithms. In RL algorithms, simulations are run in parallel just as in the case of DE. However, in RL algorithms, neural networks are backpropagated synchronously at the end of each iteration which causes extra computational time in waiting and data distribution. In order to compare computational efforts irrespective of computational resources and parallelism behaviours, it is therefore, fair to compare number of simulation runs which is a major source of computational cost. For both cases, RL algorithms were let run for 60000 episodes which correspond to 60000 simulation runs and each such algorithm was run for three seed values (In total 3 × 60000 = 180000 simulation runs). For the first test case, DE algorithms consisted of 750 generations each comprising 305 population size and since DE algorithm was used for all 9 samples in k , the total number of simulation runs is 2058750 (750 × 305 × 9). Similarly, for the second test case, DE comprised of total 135000 simulation runs (750 generations × 20 population size × 9 samples). Parameters and computational resources used for all algorithms are delineated in Appendix C.

Conclusions
We present a case study for using model-free RL algorithms to obtain robust optimal control policy for optimal well control problems. This policy is learned under the assumption that the system is partially observable and is governed by a system of nonlinear partial differential equations. The robustness of these policies were obtained using a domain randomisation scheme that uses only few samples from a predefined uncertainty distribution by utilizing cluster analysis. Further, the optimality of these policies were successfully benchmarked against reference solutions obtained by direct optimisation using DE algorithm. We consider the current framework as a first attempt towards application of narrow AI to the field of subsurface flow control where data is only available at the well locations.
In the current study we made the following key assumptions: • the optimal control problem is formulated in the form of Eq. (1) which comprises of an objective function (Eq. (1a)), a governing equation (Eq. (1b)), initial/boundary conditions (Eq. (1c)) and constraints (Eq. (1d)), • parameter uncertainty distribution is predefined, • transition between the partial observations of the system approximately follows the Markov property, • effectiveness of the optimal policy π * (a|s; k ⇒ k ), is inversely related to the distance D(k, k ).
As a result similar techniques to those presented here could be applied to other simulation-based applications as long as these assumptions are met.
In the current study, we train RL policies with a large number of simulation runs. This can be computationally intractable for large scale models with long simulations run times. In future studies, we aim to address this issue by utilizing fast surrogate modeling techniques to accelerate the reinforcement learning process.

Appendix A. Definition of value and advantage functions
In RL, the policy π(a|s) is said to be optimal if it maps the state s t with an action a t that correspond to maximum expected return value. These return values are learned through the data obtained in agent-environment interactions. Following are some definition of return values typically used in RL: Value function is the expected future return for a particular state s t and is defined as, where E π [· · · ] denotes expected value given that the agent follows the policy π. As short hand, we denote V (s) at state s t as V t .
Q function is similar to value function except that it represent the expected return when the agent takes action a t in the state s t . It is defined as, Advantage function is defined as the difference between Q function and value function and is denoted by A t at state s t and action a t .

Appendix B. Permeability uncertainty distribution parameters
Model parameters of case studies chosen for this paper were inspired from the SPE 10 model 2 parameters. First test case represents channel like permeability distribution which consists of a linear high permeability channel passing through a low permeability domain. The values of high and low permeabilities in this distribution were selected in reference to the Upperness formation in SPE10 model 2. Figure B.14 shows the plot of Upperness log permeability distribution and an example of log permeability field in test case 1. As can be seen, the high and low log permeability values in test case 1 are chosen from the peaks of Upperness log permeability distribution.
Second test case represents a smoother, spatially correlated permeability distribution often found in geoscience literature. The log permeability distribution is formulated with Equations (9), (10), (11) and (12). The correlation length l was selected to be 240 ft in order to consider 20% of the domain height. The idea is to choose a correlation length that is smaller than the quadrants of a domain. The distribution amplitude σ and µ g is chosen as 2.5 and 2.4, respectively. These values were chosen in order to match the log permeability distribution of test case 2 to match with Tarbert formation from SPE10 model 2. Figure B.15 shows comparison of a test case 2 log permeability field realization along with the superpositioning of test case 2 and Tarbert case permeability distribution.

Appendix C. RL algorithm parameters
We use stable-baselines 3 library [21] for PPO and A2C algorithms. Parameters used for PPO and A2C are tabulated in table C.3 and C.4, respectively, which were tuned using trial and error. The parameters used to obtain the frozen policy using PPO algorithms are same as those used in PPO parameters presented in these table. The DE algorithm is executed using python's SciPy library [22]. Its parameters are delineated in table C.5. For PPO algorithms with full state representation, same parameters (from table C.3) were used except the network MLP layers and learning rates: layers [3721, 4000, 2000, 800, 300, 61] and learning rate 1e-5 for test case 1 and layers [3721, 4000, 2000, 800, 300, 4] and learning rate 5e-6 for test case 2. These parameters are tuned in order to obtain the minimum variance in the learning plots. Figure C.16 demonstrate the spread of the learning plot for PPO and A2C for the text case 1 and 2. The code repository for both the test cases presented in this paper can be found on the link: https://github.com/atishdixit16/rl_robust_owc.