A divided and prioritized experience replay approach for streaming regression

Graphical abstract


Introduction
In supervised learning, generalizability is a primary focus during the development of a model. During training, the labels are available, and are used to supervise the learning of a function that maps from inputs to an output. The goal is for the model to learn a function that gives accurate predictions for unseen observations, for which labels are not available. This is the essence of generalization, and relies on the assumption that the unseen observations come from the same distribution as the training observations. However, this assumption fails for many real-word problems. Shifts in independent or dependent variables, an evolving underlying process, and dependence on variables not included in the model are all examples of nonstationarity, which is harmful to the predictive power of such models [1,2] . Nonstationarity occurs to some extent in most real world data sets, and has been a motivating factor for the paradigm of streaming learning, where a model is presented with a stream of data on which to continuously learn from in an online fashion. This allows adaptation to changing data distribution, although the approach is prone to catastrophic forgetting [3] . In the setting of streaming learning, the goal is to leverage newly available data to adapt to changing environments while still performing well on previous observations [4] . These two objectives might be conflicting, giving rise to the stability-plasticity dilemma [5] , asking how one can stay stable to irrelevant events, while plastic to new information. This problem has been addressed in several ways, perhaps most commonly by use of experience replay, where new observations from the data stream are mixed with older observations as they become available [6] .
Streaming data can occur in many situations, and several problems can be solved by learning from these streams. Classification with dynamic selection of appropriate window [7] , and classification using a streaming random forest [8] have been investigated. Clustering of data streams [9] , multitask learning with a global loss function [10] , and multiple output linear regression [11] are other examples. The challenges of streaming learning have been discussed in many earlier publications, with emphasis on managing catastrophic forgetting. Reducing the risk of catastrophic forgetting has been approached in several ways, among them regularization, Kirkpatrick et al. [12] , Li and Hoiem [13] , where weight updates are constrained so previously learned relationships are not erased, and ensembling methods, where multiple models are trained, and their outputs are combined with some form of majority voting [2,14,15] . Various experience replay configurations have also been developed for this purpose, among them stream clustering methods to retain valuable information [16] , and prioritization of samples to train on [17] , where the former has with benefit been applied to streaming classification, and the latter to reinforcement learning. These differ from the aforementioned methods, as they focus on which observations to retain and train on rather than on the model itself.
In the current work, a divided and prioritized experience replay approach for streaming regression is presented to mitigate the effects of catastrophic forgetting while allowing adaptation to nonstationarity. We adopt both the philosophy of retaining relevant knowledge in the replay, along with that of prioritization. A deep neural network (DNN) is trained and validated on historical data. This model serves as a baseline model which is deployed for streaming learning during operation, using this approach. To demonstrate its effect, the method is applied to a real-world dataset, and benchmarked against a standard sliding window approach. The method was first presented in Arnøet al. [18] as a case study. In the current work, however, a thorough comparison to the standard sliding window approach for streaming regression is given, with focus on rare, important events, and discussions of areas where the standard sliding window fails.
The paper is structured as follows: Section 2 presents the methods used in this work, and Section 3 presents our results in a case study, comparing our method with a standard sliding window. Lastly, Section 4 offers conclusions.

Method
The selected model architecture for this work was a DNN. DNNs consist of multiple layers of interconnected neurons with nonlinear activation functions. This allows extraction of latent, possibly nonlinear features within the data. First, we provide notation for the DNN: To make a prediction on a single observation of inputs x i , the input is mapped to ˆ y i by forward propagation through the DNN: Eq. (5) describes the linear part of the forward propagation, mapping from one layer to the next throughout the DNN. This starts at a [0] i = x i . At each hidden layer, these linear combinations are passed through nonlinear activation functions g [ l] , as described in Eq. (6) . In this work, the leaky ReLU activation functions given in Eq. (7) are used, where γ is the slope in the left half plane. Lastly, the output layer outputs the prediction, which for this work is a real number, resulting in a regression layer. This is described in Eq. (8) .
For learning on a mini-batch of size m b , inputs x b and outputs y b are used. The forward propagation for the mini-batch is given by: where 1 ∈ R 1 x m b is a row vector of ones broadcasting the bias term to each observation of the minibatch.
The weight initialization is He normal [19] , which pulls the weights in each layer from a truncated normal distribution with mean μ = 0 and standard deviation σ = 1 s [ l −1] . The weight initialization for layer l is then given by: The biases, b [ l] , are initialized as zeros. The optimizer used was Adam optimization [20] , which adaptively estimates appropriate momentum for the gradient updates.
The parameters w [ l] and b [ l] are iteratively updated by means of a gradient-based optimizer to minimize some cost function describing a distance between true labels and predictions. This is done by finding the gradients of the cost function J, w.r.t the trainable parameters, ∂J ∂w and ∂J ∂b . For this work the cost is the mean squared error between the true labels and the predictions, defined by: For supervised learning, the procedure of iteratively updating the trainable parameters, along with other hyperparameters such as model architecture, is typically done repeatedly in a validation process to assess the model's ability to generalize to unseen observations. However, the ability to generalize relies on the assumption that the unseen observations come from the same distribution as the training data, which for many real-world datasets is an invalid assumption. The target variable may depend on features not included in the model, shifts may occur in the independent or dependent variables, or the underlying process may evolve. These can all be contributors to poor generalization, and are called nonstationarity. In streaming learning, the goal is to bridge the gap between data distributions by adapting to new available observations. For this to work, labels must be available so that supervision of the updates can be achieved. In addition to adaption to changing distributions, it is of interest to "remember" older, relevant observations.
To stay stable to older observations while being plastic to new ones, the prioritized n −bin experience replay was developed. This experience replay configuration allows retention of observations spanning the prediction space y ∈ [ y min , y max ] by splitting it into bins. We denote the buffer as D ∈ R n x N , where n is the number of bins, and N is the capacity of each bin. When observation ( x i , y i ) becomes available from the data stream, it will be placed in a bin after assessment of which bin in the prediction space y i belongs to. Subsequently, the oldest observation in the same bin is discarded. Using this configuration, observations spanning the prediction space are retained, eliminating bias towards the distribution of the latest available observations, which occurs in the standard sliding window. Additionally, the mini-batch sampled from the experience replay for further learning is sampled by prioritization using the softmax function, so that each observation is assigned a probability of being sampled for training by: where C = nN is the total number of observations in the replay. Assigning the probabilities based on the softmax of the model's mean squared error on the observations results in added focus on observations for which the model performs poorly, as they are more likely to be sampled. By combining the prioritized n -bin with the DNN using the Adam optimizer, we obtain our streaming learning algorithm, given in Algorithm 1 . β 1 , β 2 and ε are tunable hyperparameters for Adam optimization, V dw and V db are biased first moment estimates, and S dw and S db are biased second raw moment estimates. Superscript c denotes their bias-corrected counterparts. α is the learning rate.

Problem description
The method presented in this work was developed in order to estimate the density of drilled lithology. This is traditionally measured using the density logging tool, which is a specialized logging while drilling (LWD) tool. Determining the density of the drilled lithology is of interest for several reasons, among them best-practice selection of drilling parameters to ensure a safe and efficient operation. Especially, accurate separation of high-density and low-density lithology can reduce the risk of dysfunctions like buckling, severe doglegs, washout and vibrations, resulting in lost time. However, the density log is delayed due to its placement behind the bit, making mechanical drilling parameters the earliest indicators of change in drilled lithology, although it is very difficult for humans to directly interpret density from these.
To eliminate the density log delay, we propose to estimate a virtual density log using parameters available at the bit, i.e. mechanical drilling parameters. Since the true label of the density log is available after the distance from the log to the bit is drilled, we can pair the delayed density log with drilling parameters measured earlier at the same depth to obtain complete input/output observations that can be used to further supervise updates to our model during operation. This turns the problem into a streaming regression problem with delayed labels. Drilling data from wells on a field operated by Equinor was used for this work. After data cleaning and removal of irrelavant observations, the training set contained approximately 740 0 0 0 observations from 5 different wellbores, while the validation set contained 365 0 0 0 observations from one wellbore. Lastly, the test set contained 227 0 0 0 observations from one wellbore. As per standard convention for deep learning, the data was normalized and scaled so that each feature had zero mean and unit variance.
Several mechanical drilling parameters are available during drilling. v is the drilling velocity (ft/hr), w is the weight on bit (lb), T is the torque (lb-ft), φ is the drillstring rotation (rpm). The driller may directly control v , w , and φ, while T is dependent on a variety of factors, among them rock properties, w , and φ. A metric commonly monitored during drilling is the mechanical specific energy, U ms , which quantifies the energy required to remove a unit volume of rock. It is independent of the driller's actions, is different for different lithologies [21] , and is given by: where a b is the bit area (in 2 ). To account for weakening of the rock ahead of the bit due to flow through the nozzles, the hydraulic mechanical specific energy [22] , U hms , can be defined by: where η is the hydraulig energy reduction factor, p b is the bit pressure drop at the nozzle (psi), and q is the flow rate of drilling fluid (gpm). From the available mechanical parameters, we define the input vector of predictors for the DNN, x , as:

Pre-training and validation
Pre-training and validation of the baseline model was performed in an informal search. Training was performed using Adam optimization, where the training set was randomly shuffled, and divided into mini-batches of size m b . The calculated gradients for each mini-batch are noisy estimates of the true gradients of the entire data set, and this additive noise is useful for avoiding getting stuck in poor local minima or saddle points early in training, and for improving generalization [23] . Neural network architecture and training configuration hyperparameters were iteratively tuned on the training and validation set split. Upon completion of this process, the streaming hyperparameters were tuned iteratively on the validation set. These hyperparameters are related to the streaming learning during operation. The density log limits on this field lies in the range 2.0-2.7 (g/cm 3 ). Tuning of the experience replay parameters resulted in 3 bins with limits at 2.115 (g/cm 3 ) and 2.535 (g/cm 3 ), effectively dividing the prediction space so that one bin retains observations below 2.115 (g/cm 3 ), the second between 2.115 (g/cm 3 ) and 2.535 (g/cm 3 ), and the last retains observations above 2.535 (g/cm 3 ). These parameters can be seen in Table 1 , which summarizes the hyperparameters. We can also see that the learning rate is decreased by one order of magnitude for the streaming learning, which was necessary in order to facilitate stability.

Test set results
After pre-training and validation of the baseline model, it was deployed for streaming regression during operation on the test set, where the density logging tool is placed 20 m behind the bit. This was done using both the prioritized n -bin experience replay, and a standard sliding window for comparison. As summarized in Table 1 , the prioritized n -bin replay consisted of 3 bins, each containing 256 observations at all times. The standard sliding window used for comparison contained the same amount of observations in total, 768. Although the algorithm is run on data in time domain, the density log (along with other LWD tools) are most interesting in depth domain. For this reason, the presented results are converted to depth domain with equidistant points at 1 m resolution by downsampling. At every integer depth, observations within 0.5 m are averaged. Since detection of the hard stringers is important, we evaluate both approaches in terms of mean absolute error (MAE) separately for low density ( < 2 . 35 (g/cm 3 )) and high density ( ≥ 2 . 35 (g/cm 3 )) observations. Fig. 1 illustrates measured and estimated at-bit density vs. depth on the entire test set for both methods. We define a false high density estimate a low risk error. If the driller takes action, lowering T and increasing w based on such an estimate, the drilling will simply be sub-optimal. Conversely, we define a false low density estimate as a high risk error. If hard stringers are not detected, and action is not taken, risk of dysfunctions such as buckling, severe doglegs, washout and vibrations are increased. On the test set as a whole, the prioritized n -bin experience replay leads to a 22% increase in MAE for true low density observations and a 22% decrease in MAE for true high density observations, compared to the standard sliding window.
We wish to investigate these results in more detail to provide some insight into the effect of applying the prioritized n -replay, both on low density and high density observations. First, we perform paired, two-tailed t -tests to investigate the statistical significance of the changes in MAE, through obtaining p-values. The null hypothesis becomes H 0 : μ 1 = μ 2 , where μ 1 is the population mean absolute error for the standard sliding window approach, and μ 2 is the population mean absolute error for our method. For this test, we select a significance level of α 0 = 0 . 05 . To quantify a standardized effect size , we also calculate Cohen's d. This value, in combination with α 0 , can in turn be used to calculate the statistical power, 1 − β, where β is the probability of a type II error. Table 2 summarizes the results of our statistical analysis, where m is number of observations, MAE s is the mean absolute error using a sliding window, MAE n is the mean absolute error using our method, and MAE = MAE s − MAE n . Through the t -tests, we confirm the statistical significance at our chosen significance level. Observing Cohen's d show that for true low density observations, our method leads to a standardized effect size of d = −0 . 3021 (where the negative sign signifies worsening), while for the true high density observations, d = 0 . 4578 . Interpreting these along a continuum, as proposed in literature [24] , where 0.2, 0.5, and 0.8 are low, medium and high effect sizes, we see that our approach leads to a small to medium worsening on low density observations, and a medium to large improvement on high density observations. Figs. 2 -5 are zoomed plots on the test set. In Fig. 2 , at 2755 m, a hard stringer occurs. Using the standard sliding window, this event is completely undetected. We can see at 2779 m that the model accurately detects the next stringer. Here, 24 m further down, high density observations from the missed stringer has been passed by the density logging tool (20 m behind), and these observations are available for training in the sliding window. We can assume that the model misses the first stringer since the sliding window only contains observations from the earlier low density observations, resulting in catastrophic forgetting . Using our approach, we can see that also the stringer missed by the standard sliding window is detected. In Fig. 3 , we can see that the sliding window misses the stringer at 3157-3170 m, along with the one at 3250 m. Our approach accurately detects these events. At the same time, we observe that some observations at approximately 3050 m and 3285 are overestimated. In Fig. 4 , stringers at 3500 m and 3523-3536 m are missed by the sliding window, but detected by our approach. Lastly, Fig. 5 shows missed stringers by the sliding window at 4450-4530 m, along with a poor transition to low density observations at approximately 4850 m. These are all improved using our approach. However, we observe some false high estimates at 4650-4700 m.
Since the density log is naturally divided into low-and high density observations, we rephrase the problem into a binary classification problem. Although this is an oversimplification, it adds to the previous analysis in terms of detection of hard stringers, and high risk/low risk errors, as defined previously. As can be seen from m in Table 2 , less than 10% of the observations in the test set are stringers, which makes this an unbalanced data set. Fig. 6 shows the resulting confusion matrices for both approaches, dividing the observations into low density and high density observations as previously. We see that the sliding window is very accurate in prediction of low density observations, with an accuracy rate of 0.97. However, only an accuracy of 0.55 is achieved for high density observations. Using the prioritized n -bin, the accuracy on low density observations is slightly decreased, to 0.92, while detection of high density observations is greatly improved, scoring an accuracy of 0.8. The balanced accuracies for the sliding window and the prioritized n -bin are 0.76 and 0.86, respectively. Due to drillstring compression and elongation, measurement error on depth can occur. To account for this, a 1 m acceptance was implemented, meaning that if a prediction is within 1 m of the correct label, it is accepted as correct. The confusion matrices using the 1 m acceptance are shown in Fig. 7 . We observe that the results for both approaches are improved. Here, the balanced accuracies are 0.89 for the sliding window and 0.95 for our approach.

Conclusions
A divided and prioritized experience replay suited for streaming regression has been presented, making use of known ideas such as retention of relevant observations along with prioritization. This makes the model less biased to the distribution of the latest available observations, and results in more frequent sampling of observations where the model performs poorly.
Comparison to a standard sliding window has been made on real-world data. From our results, we can deduce that the standard sliding window results in forgetting of old, rare events, leading to failure to detect them. Especially in cases where exclusively common events have been observed, the model becomes biased towards these observations, failing to detect new, important events. The presented n -bin method, on the other hand, retains observations from the entire range of the prediction space, resulting in more accurate estimates for these observations at a small cost in accuracy on the common events. Also, some false detections are observed. In addition to analyses on the regression formulation, a simplified rephrasing to a binary classification problem has been performed. These results divide the observations into two classes to provide added insight into the improved performance on the rare events.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.