Stream-based active learning with linear models

The proliferation of automated data collection schemes and the advances in sensorics are increasing the amount of data we are able to monitor in real-time. However, given the high annotation costs and the time required by quality inspections, data is often available in an unlabeled form. This is fostering the use of active learning for the development of soft sensors and predictive models. In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. Several query strategy frameworks for regression have been proposed in the literature but most of the focus has been dedicated to the static pool-based scenario. In this work, we propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner, which must instantaneously decide whether to perform the quality check to obtain the label or discard the instance. The approach is inspired by the optimal experimental design theory and the iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points. The proposed approach is evaluated using numerical simulations and the Tennessee Eastman Process simulator. The results confirm that selecting the examples suggested by the proposed algorithm allows for a faster reduction in the prediction error.


Introduction
The term big data seems to be ubiquitous in many fields of application, and industrial production is no different. However, in production, this can be somewhat misleading as it often refers to process data that is obtained through automated data collection schemes with minimal manual interference. Product-related data is usually scarcer particularly in high-volume manufacturing due to costs of inspection. This creates an imbalance in the amount of available data that can at times be quite substantial. Yet in many cases, predictive modeling relating process variables to product characteristics is sought after. Therefore, it will be beneficial to guide the data collection schemes for product characteristics through a real-time sampling methodology. In current production environments, sampling of the product characteristics is often performed at regular time intervals or at random. However, this approach can be ineffective as the informativeness of the observations at the time of sampling is not taken into account. This problem is reinforcing the interest of researchers and practitioners in active learning. Active learning-based sampling schemes use an instance selection criterion to strategically select data points that allow a faster reduction of the generalization error [1]. Over the last decades, many active learning approaches have been proposed, but most of the focus has been dedicated to the pool-based scenario [2]. Pool-based active learning refers to a circumstance in which a large amount of unlabeled data is collected all at once and made available to the learner, which can then select offline the data points to be labeled with a greedy approach [3].
In real-time applications for high-volume production, where samples are processed at a fast pace, evaluating all the available instances before making a choice might not be realistic. In these cases, the learner might only have a short time frame to make the sampling decision. Indeed, if a sample is not selected for the quality check, it might get lost in the downstream process and no longer be traceable. This is particularly relevant in high-volume production, where tracing individual parts is a challenge. Also, in a chemical process, we might not be able to measure the level of the variable of interest once a component undergoes a specific treatment. In these contexts, a much more sensible scenario is represented by stream-based active learning, which is sometimes referred to as selective sampling [4]. Stream-based active learning investigates a scenario where instances are processed one at a time and the learner has to determine immediately whether to keep the instance and query its label or discard it. The task is very similar to the one described by a notorious statistical riddle, the secretary problem [5], where an observer sequentially interviews a certain number of applicants and, after each interview, a decision must be made on whether the applicant is hired or not. An exhaustive survey about stream-based active learning has been proposed by Lughofer [6], who classified existing online active learning methods by taking into account the data processing functionality, the model class (regression or classification), and many other relevant properties. The survey reveals how streambased active learning methods have been mostly developed in the classification field. Regression models, on the other hand, are extremely useful in the development of soft sensors for hard-to-measure process variables or in quality control problems where a product's characteristic is measured on a continuous scale. That is why active learning in conjunction with regression models is capturing the interest of many researchers [7][8][9][10].
In this paper, we focus on the use of linear regression models. These models are well suited for stream-based active learning as they can easily be trained on a small number of observations, being composed of a small number of parameters. This property is also very useful if we want to efficiently retrain the model each time the design is augmented by including an additional observation [11]. Moreover, despite recent advances in terms of interpretability for deep learning models, linear regression models are still amongst the most easily interpretable models. Indeed, their parameters offer a straightforward quantitative contribution of each specific feature, and their input features are directly derived from the empirical observations [12]. Besides the direct interpretation that comes from the signs and magnitudes of the coefficients, linear models can also be used to construct confidence intervals on the parameter estimates and variable selection can also be easily incorporated into such models [13]. Recently, additional variable selection methods for linear regression models have been suggested by Zhang et al. [14]. Being able to offer a robust feature importance analysis is particularly important in industrial problems, where practitioners and engineers might need to timely intervene in specific parts or components of the process to ensure safety and operational efficiency. The simplicity of these models and the low number of parameters that require tuning is also beneficial to foster their adoption and use in applications. Finally, linear regression models allow us to build on the optimal experimental design theory and leverage the criteria that are typically used to design highly efficient experiments. Despite the focus of this paper being dedicated to linear models, nonlinear models proved to be extremely useful in a wide variety of applications. In particular, deep learning models are very effective in dealing with complex high-dimensional data to perform tasks like image recognition, shape extraction, and pose recovery [15][16][17][18][19].
In this work, we propose a novel strategy to perform stream-based active learning with linear models. Given the impossibility to rank observations in real-time, we provide an algorithm that only uses unlabeled data to set a threshold on the informativeness of data points. Unlabeled data is also exploited in a semi-supervised manner to increase the predictive performance [20]. We show how the proposed approach outperforms random sampling and state-of-the-art methods.
The remainder of this paper is organized as follows. In Section 2, we define some basic concepts and discuss related works focusing on active learning for regression. Section 3 introduces the proposed sampling strategy. In Section 4 we test our approach using numerical simulations; the Tennessee Eastman Process data is also used to evaluate its performance on a typical industrial process. Finally, Section 5 provides some conclusions.

Preliminaries
The active learning problem is defined by an imbalance between the availability of process variables x ∈ R p and the corresponding labels y ∈ R. In many circumstances, industrial processes are characterized by the presence of easy-to-measure process variables, which are collected through automated collection schemes, and hard-to-measure variables, whose values are difficult to track during routine operations. Large plants, measurement delays, and environments hostile to the survival of measuring devices are all situations where hard-to-measure process variables are commonly encountered [21]. Similar situations can be addressed by utilizing soft sensors based on predictive models to forecast the true values of hard-to-measure variables. For modeling purposes, we assume that the true underlying relationship between the process variables and the product information or hard-to-measure variable can be expressed with a linear model of the form y is a n × 1 vector of response variables, X is a n × p model matrix, β is a p × 1 vector of regression coefficients, and is a n × 1 vector representing the noise, with covariance matrix σ 2 I. Here n represents the total number of observations and p the number of process variables (as well as the number of parameters in a model with main effects only and no intercept). If the predictors and the response are not centered, an intercept term may be added to the model. In that case, the size of the model matrix becomes n × (p + 1), and β a (p + 1) × 1 vector. When k ≥ p observations are available to the learner, we can obtain a least squares estimator for β using such that the fitted linear regression model will be given by y = X β and its residuals by e = y− y. A key distinction between the experimental design approach and streambased active learning concerns the assumption we make about the randomness of the process variables. In design of experiments, the x vectors are assumed to be fixed while in this case we assume that X is composed by random vectors, as the individual observations are sampled from a process subject to random variation and we are not able to set the precise location of the incoming data points. However, conditional on the observed X variables, (x 1 , . . . , x p ), Equation 2 still applies. It should be noted that the coefficients β determined using Equation 2 may not be stable if the data matrix X is affected by multicollinearity. To deal with this issue and achieve robust results, a solution might be to use a ridge estimate for the coefficients, β ridge = X X + λI −1 X y. An alternative approach to tackle multicollinearity is to perform a pre-whitening of X to remove the dependencies between the components. We assume a small, labeled training set is initially available and can be used to fit the first regression model, as is common practice in active learning applications [9,22,23]. The number of observations provided to the learner usually corresponds to a modest fraction (e.g., 5%) of the total number of instances available [24,25]. After the first model has been built, the learner is granted a certain operational budget b to augment the design matrix by including additional observations. Some approaches focus on this problem in a pool-based context, in which the total number of observations n is represented by a closed and static set U and the label of a specific data point can always be queried. Among these approaches, query-by-committee (QBC) [22] suggests building an ensemble of regression models trained on bootstrap replica of the original training set. Once the ensemble, or committee, has been built, the variance of the predictions made by the committee members is computed for each unlabeled observation x ∈ U. This metric, also referred to as ambiguity, is used to rank the instances belonging to the unlabeled set U by prioritizing the data points with the highest variance. Expected model change maximization (EMCM) [25] is another noteworthy study that focuses on the observations that impact the most the current model's parameters. The model change is defined as the difference between the current model parameters and the parameters obtained after fitting the model on the augmented design, including the unlabeled observation x ∈ U that is currently under evaluation. Because the learner does not have access to the true label for that data point, it estimates it using the mean prediction of a bootstrap ensemble, as the one employed by QBC. Another offline approach, inspired by statistical process control, combines the Hotelling T 2 statistic and the squared prediction error of a principal component regression (PCR) model to obtain a sampling evaluation index [9]. Besides the fact that all these methods focus on the pool-based scenario, it should be noted that the approaches that use ensembles may not be well suited for the online scenario, given the higher computational cost associated with training and updating the models.
Optimal experimental theory is another field of research that is intrinsically related to active learning [26,27]. Optimal designs aim to reduce the cost of experimentation by proposing design matrices that allow a robust parameter estimation with the minimum number of runs. The most commonly employed optimality criteria are Doptimality [28] and A-optimality. Important properties of a design can be derived from the moment matrix, or information matrix, which is defined as where N represents the total number of runs. The moment matrix specifies the distribution of points in space and can be used to describe the design geometry. In a 2 k factorial design, where variables are expressed in coded units (−1, +1), the moment matrix is equal to the identity matrix I k , as the columns of the design are orthogonal.
In an orthogonal design, all the parameters can be estimated independently of one another [29]. D-optimal designs try to pursue such property by focusing on good model parameter estimation. Inverting the moment matrix we obtain the scaled dispersion matrix given by This matrix contains the variances and covariances of the estimated coefficients of the regression model, scaled by N/σ 2 [27]. Indeed, if the k observations used to estimate β are i.i.d. and ∼ N 0, σ 2 I , we have It can be demonstrated how by increasing the determinant of M, the variances and covariances of the model coefficients are reduced, leading to a better estimation of β.
A D-optimal design is attained by maximizing the determinant of the moment matrix. Formally, we are seeking the design D * that satisfies A-optimality is another important optimality criterion that tries to achieve good parameter estimation by minimizing the sum of the individual variances of the coefficients. This is achieved by the design D * that satisfies as the variances of the coefficients can be found on the diagonal of the scaled dispersion matrix multiplied by N/σ 2 . It should be noted that A-optimality does not consider the covariances between coefficients.
Recently, the concept of A-optimality has been extended to stream-based active learning [30,31]. That is, the approach has been extended outside the design of experiments framework, assuming X is composed of random vectors and the observations are sequentially drawn. Riquelme et al. [31] show how to set a threshold to perform online active learning for linear regression models by minimizing the sum of the individual variances of β. They state that, in order to achieve A-optimality and minimize the trace of the inversed scaled dispersion matrix, the eigenvalues of the moment matrix should be as balanced as possible. This is because the eigenvalues of X X represent the trace of X X, which is also given by the sum of the norm of the observations. For this reason, they propose a norm-thresholding algorithm that pursues A-optimality by selecting observations with large, scaled norm. The scaling step can be ignored when whitening is used to remove dependencies. Finally, the design is augmented with the observations x whose norm exceeds a threshold Γ given by where α is the ratio of observations we are willing to label out of the incoming data stream. This value is strongly dependent on the budget b and the sampling rate used to collect the data. Another noteworthy approach focusing on stream-based active learning for regression tasks has been suggested by Lughofer and Pratama [32]. In this paper, the authors propose a single-pass selection criterion that takes into account ignorance about the input space, uncertainty in predictive model outputs, and uncertainty in model parameters. The main difference with our approach is that Lughofer and Pratama focus on the use of Takagi-Sugeno (TS) fuzzy models [33], combining adaptive error bars for the model output and A-optimality for the variances of the estimated parameters.
Conversely, our method relies on statistical linear regression and tries to combine the exploration of lesser-known input space regions with accurate parameter estimates by employing the idea of D-optimality.

Proposed approach
In this work, we try to improve the approach proposed by Riquelme et al. [31] by moving from A-optimality to D-optimality. We believe that taking into account the covariance between the estimates of the model coefficients might be particularly advantageous with large datasets and models, where many factors might be active and influence the response. To adapt the D-optimality criterion to stream-based active learning, we start from the connection between D-optimality and prediction variance (PV) highlighted by Myers et al. [27]. The PV at a point x (m) is the variance of the predictor y x (m) , which corresponds to Var x (m) β , and is given by where represents the data point where the variance is being estimated, expanded to the model form. We can also express the variance in a scale-free form using the scaled prediction variance (SPV), which is computed as It should be noted that the SPV is a quadratic form of the inverse moment matrix M −1 , as it can also be written as x (m) M −1 x (m) . Since SPV considers the total number of runs N , it can be used to assess the quality of a design on a per observation basis. In the online scenario, we are not interested in comparing designs of different sizes but rather we investigate the individual contributions of incoming data points to the current design. In this circumstance, we can discard N and use the unscaled prediction variance (UPV), which is calculated as As anticipated in Section 2, we are already given an initial random design that contains some labeled examples, which is being used to fit an initial model. Then, we are interested in augmenting our design by iteratively selecting observations from a continuous stream. Pursuing D-optimality, we aim at collecting observations that allow us to maximize the determinant of the moment matrix M. If we consider that the current design is composed by k observations, we can decompose the numerator of the moment matrix (Equation 3) before the design is augmented by including the (k +1)th observation as we can then express the determinant of X k X k as the product of the determinant of the numerator of the augmented moment matrix and a second term as in It should be noted that the second term of the above equation is a scalar, irrespective of the number of variables p and the number of observations k. From there, we can observe that From the properties of the hat matrix, which is generally defined as H = X X X −1 X , we know that 0 ≤ h jj ≤ 1 is true for each element h jj of H [34]. It follows that x k+1 X k+1 X k+1 −1 x k+1 ≤ 1. Hence, we can conclude that the determinant of the new, enlarged, training set is maximized by seeking observations x that maximize X k+1 X k+1 X k+1 −1 X k+1 . That is, we will only select points that maximize the UPV. This may be explained by the fact that a data point for which we have a large prediction variance represents a less known region of the input space, and the regression model will highly benefit from its inclusion in the design. From Myers et al. [27] we have that maximizing x k+1 X k+1 X k+1 −1 x k+1 is equivalent to maximizing x k+1 X k X k −1 x k+1 , which is the UPV using the fitted model before the new point has been added to the training set. Finally, following the norm-thresholding approach, we can set an upper control limit on new observations as In practice, as suggested by Riquelme et al. [31], when we start to observe the data points coming from the process, we allocate a first initial set of points to estimate the distribution of x k+1 X k X k −1 x k+1 . In this work, we used kernel density estimation (KDE) with a Gaussian kernel. The initial set is also used to estimate the sample covariance matrix Σ. By performing an eigenvalue decomposition we can then express Σ as UΛU , where U is an orthogonal matrix, whose ith column corresponds to the ith eigenvector of Σ, and Λ is a diagonal matrix with the eigenvalues of Σ on the diagonal. The incoming observations x can then be whitened using Before the whitening step, data can be centered and scaled using the sample mean and variances obtained from the initial set. In industrial contexts, when a lot of unlabeled process data is available in the form of a historical database, this step can also be performed offline. In this case, by fitting a principal component analysis (PCA) model to the large unlabeled dataset and using it to transform the incoming observations, we could improve the predictive performance using a semi-supervised PCR as suggested by Frumosu and Kulahci [20]. The use of semi-supervised classification models has also received some attention in active-learning problems [35][36][37]. Indeed, semi-supervised learning and active learning are both techniques that deal with scarcity of labels. However, they do so in two different ways. With semi-supervised learning, we try to get the most out of the currently available unlabeled data, whereas with active learning we try to acquire new data in the most effective way.
Algorithm 1 describes the complete stream-based active learning procedure with the proposed approach, which might also be referred to as conditional D-optimality (CDO).
Algorithm 1 Stream-based active learning using CDO Require: an initial random design X; a data stream S; a warm-up length w; a sampling rate α; a budget b Set W = ∅ warm-up set to estimate Σ and Γ i ← 1, c ← 0 c represents the currently used budget Estimate the covariance matrix Σ of W and perform eigendecomposition Σ as UΛU Whiten the initial design by computing Z = Λ −1/2 U X Whiten the warm-up observations by computing V = Λ −1/2 U W Estimate Γ using KDE on V with the desired sampling rate α using Equation 15 with Z and V while c ≤ b and i ≤ |S| do Observe the ith data point Ask for the label y i and augment the labeled dataset: Z = Z ∪ z i c ← c + 1 pay for the label Update threshold Γ to measure the UPV of the enlarged design else Discard x i end if i ← i + 1 end while An alternative representation of the CDO active learning routine is reported in the flowcharts in Figures 1 and 2. The first flowchart depicts the warm-up phase, which is represented by the first 10 steps of Algorithm 1. The warm-up set is very important for the algorithm and serves two main purposes. First, it allows to estimate the covariance matrix of the data, which is later used for whitening the incoming observations. Secondly, it provides a set of unlabeled observations that can be leveraged to estimate the distribution of the UPV. The primary purpose of the whitening step is to address the multicollinearity issue in linear regression modeling, which can be aggravated when dealing with real-world data. The whitening step also ensures comparability with the norm-thresholding approach. Indeed, the norm-thresholding method without whitening would require computing a weighted norm to deal with dependencies between the components. The second flowchart represents the instance selection phase, the core of the active learning strategy. At this stage, we compute the UPV for the new observation sampled from the stream and we compare it to a pre-defined threshold. If the UPV computed at this point exceeds Γ, we query its label and include the labeled example in the training set. After the inclusion of the new point, a new threshold is estimated. The threshold is found by applying Equation 15 to the whitened warm-up set V. That is, X k is substituted by Z, the currently labeled training set after whitening, and x k+1 is given by each unlabeled data point belonging to V. By doing so, we obtain a one-dimensional array that has the same cardinality as the number of observations in V. These statistics are then used to approximate the distribution of the UPV using KDE and determine the α-upper percentile.

Experiments
In the experiments, we compare the proposed method to the norm-thresholding approach and random sampling. The methods are tested using numerical simulations and data from a chemical process simulator. All the approaches start from the same labeled training set and then they iteratively augment the design until the budget constraint b is met. The performance of the models is expressed, in predictive terms, by the root mean squared error (RMSE) of the predictions on a separate test set of n observations

Numerical simulations
To analyze the validity of the proposed method in the stream-based scenario, multiple datasets were created, each with a different dimensionality in terms of the number of process variables p. Within each dataset, incoming observations x are distributed according to a joint multivariate normal distribution N p (0, Σ 0 ), where Σ 0 is given by σ 2 I, with σ 2 = 1. We ran 50 simulations for each number of p and, for each simulation run, the true coefficients are generated as β ∼ U (−5, 5). It should be noted that β has the same dimensionality as x. This means that, using a first order model, a coefficient for each process variable needs to be estimated. The noise is given by ∼ N (0, 1). For each scenario, an initial random design X is assumed available to the learner. We selected p + 2 number of observations for the initial design, as k ≥ p observations are needed to uniquely estimate β. The learning curves reported in Figures 3 and 3 show the difference between the RMSE obtained with the two active learning strategies, using random sampling as the baseline. For each learning step, the percentage RMSE difference reported in the plots is obtained by computing (RMSE Active Learning − RMSE Random )/RMSE Random * 100. This allows us to display a scale-free performance metric while comparing the different scenarios. The methods are tested using b = 50 and with different levels for the α shown in Equations 8, and 15. In the case of random sampling, α represents the probability of selecting an incoming observation. That is, each time a new sample arrives, a number s ∼ U (0, 1) is generated and the data point is only selected if s ≥ 1 − α. The warmup length w was set to 500 observations and it is being used by all the methods to estimate the covariance matrix, which is used for whitening the observations in a semisupervised fashion. Moreover, it ensures comparability between the three strategies by setting the same starting points for the data streams. The models have been fitted without the intercept term as both process variables and outcome are centered. Figure 3 shows the performance when using an α equal to 10%. The x-axis reports the learning steps, which correspond to the inclusion of an additional observation to the training set. Indeed, when the design is augmented, the model is updated and new predictions are obtained for the same separate test set. It should be noted that the RMSE obtained in the first learning step is the same for the three methods, as all the models start from the same random design. It can be seen how the performance of the two active learning methods converges to the one obtained through random sampling as the number of labeled examples in the training set increases. Instead, when the number of labeled examples is lower, active learning proves to be particularly convenient. However, the proposed approach dominates the other strategies in all the scenarios. Furthermore, it should be noted how the norm-thresholding algorithm seems to worsen when more and more parameters need to be estimated. Instead, CDO consistently provides enhanced predictive performances. We believe this may be due to the fact that, by imposing a threshold on the norm, A-optimality seeks only points that are far from the design's center, without ensuring a distance between the data points that have already been collected. CDO, on the other hand, emphasizes points that correspond to a poor prediction, which is more likely associated with a design area that the learner has not thoroughly explored. As a result, we are less prone to acquire the labels of data points in locations where we have already collected a significant number of observations. It should be noted that in real-time applications the improvement offered by active learning is not as large as the one that can be obtained in offline scenarios, where we can deterministically maximize the desired optimality criterion over a closed set of observations. Moreover, by setting α = 10% we are not being too demanding in terms of selecting observations with large norms for the A-optimality or high prediction variances for CDO. In Figure 4, we try to widen the gap with the random strategy by lowering α, in this case up to 0.01. By raising the threshold, we can be more demanding in terms of the desirability of the selected instances. The only drawback is that the algorithms will need to span more observations to achieve the desired size for the augmented design and meet the budget constraint. We believe this may not represent an issue since data is nowadays collected at very high sampling rates. However, in the final decision concerning the level of α, practitioners will need to make a trade-off between the desired prediction improvement and the time required to select the new labeled examples. Figure 4 reports the learning curves obtained using a smaller α. As expected, the enhancement obtained using the proposed strategy is increased with respect to the passive random sampling. However, it is worth noting that the improvement is more evident when the number of parameters is smaller, as the gain obtained in the high-dimensional cases was already significant with α = 10%.
Finally, we analyze the computational time required by the two active learning strategies. To this extent, we introduce a measure called average decision time, which quantifies the time required to decide whether to query the label of an unlabeled observation or discard it. The results obtained on the numerical simulations, for different number of process variables, are reported in Table 1. Both active learning strategies are highly efficient and do not require a high computational time. According to the CDO strategy, at each iteration we are simply computing the UPV for the new data point, which requires less time than computing the norm of the new observation. It should be noted that the average decision time is lower because the inverse of the whitened moment matrix, Z Z −1 , does not need to be computed at each iteration.
However, it must be updated when the design is augmented by including an additional 13 Fig. 4 Percentage difference in RMSE between random sampling and the active learning methods, using α=1% (50 simulations). From an operational point of view, the average decision time is a highly relevant metric and it is closely related to the specific sampling frequency of the process. Indeed, to allow for a timely instance selection, the decision time should be strictly lower than the expiry date of the unlabeled data point, which is given by the time window where it is possible to query its label.

Tennessee Eastman Process
The Tennessee Eastman Process (TEP) is a commonly used benchmark in industrial and chemical engineering research and it has been thoroughly investigated in terms of process dynamics and control [38][39][40][41][42]. Recently, it has been also used to validate active learning or soft sensor modeling approaches [43][44][45][46][47]. It was initially published in 1993 [48] but since then it has been further developed and improved. For this study, we used a recently released MATLAB simulator to generate the data [49,50]. We generated 50 datasets with the process running in normal operating conditions, using a sampling rate of approximately 1 minute. Figure 5 depicts the TEP flowchart, which shows how the process is primarily composed of a reactor, a product condenser and separator, a stripper, and a compressor. Fig. 5 The TEP piping and instrumentation diagram [49].
The TEP, like many other industrial processes, includes some easy-to-measure process variables whose real value can easily be monitored online, and some hardto-measure variables, which are difficult to track during routine operations. Datadriven soft sensors are often developed to predict the latter in real-time. However, training regression models frequently necessitates a large number of labeled examples, and conducting quality inspections on chemical products may be costly and timeconsuming. For this reason, optimizing the sampling strategy using active learning is highly desirable. The 16 process variables shown in Table 2 are often used as predictors for the hard-to-measure process variables when testing active learning or soft sensor modeling approaches on the TEP. In most cases, the response variable is one of the composition measurements, such as the purge or product streams [43,45,46]. In this work, we selected two purge streams (Stream 9A and Stream 9E) and two product streams (Stream 11D and Stream 11E) as the response to be predicted using the easy-to-measure variables.
As in the case of the numerical simulations, 50 datasets have been generated, and the average RMSE results are presented in the learning curve plots in Figures 6 and  7. Most of the experimental parameters correspond to the ones used in the numerical study. The number of observations allocated to the first training set is equal to p + 2, which in this case corresponds to 18. The warm-up length w is equal to 500 and the budget b is set to 50. The main difference from the models used in Section 4.1 is that, in this case, all the models include the intercept term. We can see in Figure 6 how the results obtained in Section 4.1 are still valid with data coming from a realistic industrial process simulator. Indeed, both the random and norm-thresholding approaches are Table 2 Variables of the TEP used as predictors in the regression models.

Number
Process Variable Code  1  A feed  XMEAS1  2  D feed  XMEAS2  3  E feed  XMEAS3  4 A and C feed XMEAS4 5 Recycle flow XMEAS5 6 Reactor feed rate XMEAS6 7 Reactor temperature XMEAS9 8 Purge rate XMEAS10 9 Product separator temperature XMEAS11 10 Product separator pressure XMEAS13 11 Product separator underflow XMEAS14 12 Stripper pressure XMEAS16 13 Stripper temperature XMEAS18 14 Separator steam flow XMEAS19 15 Reactor cooling water outlet temperature XMEAS21 16 Separator cooling water outlet temperature XMEAS22 outperformed by the proposed strategy. With regards to the level of α, the behavior observed in the numerical study does not seem to be altered and, as the threshold is raised, the performance gap between random sampling and active learning strategies widens.
16 Fig. 6 Percentage difference in RMSE between random sampling and the active learning methods, using α=10% (50 simulations). The plots in Figure 8 show the residuals related to the first composition measurements analyzed, stream A of the purge. For illustrative purposes, the residuals refer to a smaller test set, composed of 100 observations. The first plot (a) shows the residuals obtained with the first random design, which is common to all the compared approaches. The remaining plots (b-c) illustrate the residuals obtained after five learning steps with each strategy. In general, we can see how the predictive performance improves when more observations are included in the design. However, the predictions obtained with the proposed strategy are significantly better than the ones obtained with random sampling and norm-thresholding. Indeed, it should be noted how the RMSE obtained with the fifth model using CDO is 55 percent lower than the RMSE obtained with random sampling, and 23 percent lower than the RMSE obtained with the alternative active learning scheme. Finally, the improvement of CDO from the initial RMSE is higher than 65 percent. It should be noted how a simple linear regression model fitted on a small training set can achieve compelling prediction results when the labeled examples are appropriately selected. This is true even when testing our approach on data from the TEP, which is characterized by highly nonlinear relationships.  Figure 9 shows the predictions obtained for stream D of the product. In this case, to offer an additional view, we compared the models obtained after 10 learning steps. It can be seen how the behavior of the different schemes follows the same trend observed in Figure 8. Indeed, after 10 iterations, the RMSE obtained with CDO is 18 percent lower than the one obtained by norm-thresholding and 30 percent lower than the one obtained with random sampling. From the initial design, the RMSE is reduced by more than 60 percent with CDO. Fig. 9 Residuals of the Stream 11D predictions: with the initial training set (a) and after augmenting the design with 10 additional labeled examples with the different methods (b-d) (one simulation with α = 1%).

Conclusion
In many industrial processes and real-life applications, data is often abundant only in an unlabeled form. Moreover, the prohibitive cost required by quality inspections and the time required by manual annotation makes it unfeasible to label each data point with its quality characteristic. In these cases, active learning can significantly improve the predictive performance of regression models by smartly selecting the instances to include in the training set. In situations where many observations are sequentially processed, it is necessary to provide a real-time sampling strategy for selecting the most informative instances. In this paper, we propose an optimal strategy for performing stream-based active learning with linear regression models. Two case studies, one using numerical simulations and the other one using the TEP, show that the proposed approach offers improved predictive performance and reduces the prediction error faster.