FAILURE RATE REGRESSION MODEL BUILDING FROM AGGREGATED DATA USING KERNEL-BASED MACHINE LEARNING METHODS

The problem of regression model building of equipment failure rate using datasets containing information on number of failures of recoverable systems and measurements of technological and operational factors affecting the reliability of production system is considered. This problem is important for choosing optimal strategy for preventive maintenance and restoration of elements of process equipment, which, in turn, significantly affects the efficiency of production management system. From a practical point of view, of greatest interest is the development of methods for regression models building to assess the impact of various technological and operational factors controlled during system operation on failure rate. The usual approach to regression models construction involves preselecting the model structure in the form of a parameterized functional relationship between failure rate and affecting technological variables followed by statistical estimation of unknown model parameters or training the model on datasets of measured covariates and failures.The main problem lies precisely in the choice of model structure, the complexity of which should correspond to amount of data available for training model, which in the problem of failure rate modeling is greatly complicated by lack of a priori information about its dependence on affecting variables. In this work, such a problem is solved using machine learning methods, namely, kernel ridge regression, which makes it possible to effectively approximate complex nonlinear dependences of equipment failure rate on technological factors, while there is no need to pre-select the model structure. Preliminary aggregation of data by combination of factor and cluster analysis can significantly simplify model structure. The proposed technique is illustrated by solving a practical problem of failure rate model building for semiconductor production equipment based on real data.

Introduction. Among the problems of production management, an important place is occupied by preventive maintenance, which is an effective tool for reducing costs associated with failures and interruptions in production, its repair and restoration, as well as maintenance costs [1]. Preventive maintenance ensures to determine when maintenance should be carried out by estimations and predictions of degradation state of production items and expected time for occurrence of failures. Preventive maintenance is also play in important role for production and information and control system dependability ensuring.
In preventive maintenance implementation, failure rate is widely used as important metric for evaluating equipment reliability, wherein failure rate prediction is an effective way to maintaince decision making [2].
The problem of estimating and predicting failure rate using statistical data of failure streams has been studied in detail in the literature. From a practical point of view, of greatest interest is the development of methods for predicting failure rate using their models, which take into account the influence of technological and operational factors controlled during the operation of the system. This approach is implemented by building multivariate regression models to estimate the relationship between affecting factors (predictors or covariates) and failure rate. A common approach to building regression models involves preselecting of model structure as a parameterized functional relationship between the output variable and the affecting variables, followed by statistical estimation of unknown model parameters or model training model on datasets of measured covariates and outputs. The main problem lies precisely in the choice of model structure, the complexity of which should correspond to the amount of data available for training model, which in the problem under consideration is greatly complicated by the lack of a priori information about the dependence of failure rate from affecting factors. In this paper, the problem of failure rate regression model building is solved in machine learning framework using kernel regression method, which makes it possible to effectively approximate nonlinear dependences of equipment failure rate from technological factors. The advantage of proposed approach is the possibility of effective approximation of complex failure rate multivariable dependence from affecting factors, while there is no need to pre-select model structure, complexity of which under such an approach is determined only by amount of training data available.
Preliminary data aggregation using factor and cluster analysis can significantly simplify regression model structure. At the same time, the use of learning on clusters method made it possible to significantly reduce the complexity of resulting model and ensure its smoothing to prevent the effect of "spatial oscillation".
The proposed technique is illustrated by solving a practical problem of failure rate model building for semiconductor production process based on real data.
State of the art. The problems of failure rate model building have been considered in detail in literature; the main attention was paid to the development of methods for constant or slowly changing over time failure rate estimating using only data on failure statistics, building models of the change in the failure rate over time and its trend prediction using time series models [3].
A further development of such an approach was development of methods for failure rate predicting using their models, which take into account various affecting technological and operational factors controlled during production system operation. This approach is implemented by building regression models of failure rate, the foundations of which were laid in fundamental work [4]. Subsequently, methods for failure rate regression models building were studied in works [5][6]. Regression analysis provides a set of statistical methods to estimate the relationship between affecting (explanatory) variables and failure rate as a model outcome. Explanatory variables (predictors or covariates) are variables which might affect a response variable, in reliability context, failure rate.
The usual approach to building regression models includes a preliminary selection of model structure in the form of a parameterized functional dependence between output variable and explanatory variables (covariates), followed by unknown models' parameters statistical evaluation or model training using datasets of measured covariates and failures. The main problem lies precisely in the choice of model structure, the complexity of which must be consistent with amount of data available for model training, which in the problem under consideration is greatly complicated by the lack of a priori information about the dependence of failure rate on affecting variables. Recently, to building regression models, methods of machine learning and neural networks have begun to be used [7]. In this paper, to solve this problem, we use kernelbased machine learning method [8] developed to build high-dimensional multivariate nonlinear regression models.
Problem statement. Consider the problem of building a regression model of failure rate using dataset, containing information on the number of failures of recoverable systems in certain time intervals and the results of measurements of technological and operational factors affecting system reliability. A set of of restored items are considered, moreover, on each time interval n with length  the number of failures { ( ), 0,1, 2,...} n fn  is observed. At the same time, a set of technological and/or operational factors (covariates in terminology of reliability regression It is assumed, that at each observation interval affected factors relative variations are insignificant and they can be replaced with a sufficient degree of accuracy by their average value over the interval, thus will be considered as feature vector. We will interpret the number of failures at current observation interval as the results of "measurements" the failure rate for corresponding value of feature vector at this interval. In this way, the problem of failure rate model building is reduced to restoring unknown dependence   (1) parameters estimate can be obtained as a solution of following optimization problem: which is well known Materials and methods. In this regard, to solve the problem of failure rate model building, it is advisable to use kernel-based machine learning method [8], which removes the problem of choosing of a system of coordinate functions. Let us choose coordinate function such that their we obtain optimization problem (5) solution in the form Taking into account that Using kernel approach allows approximating complex multidimensional dependencies, however, since the number of kernels used is determined by number of points in training dataset, the resulting restored dependency may have the property of "spatial oscillation". To prevent this undesirable effect, it is advisable to carry out additional smoothing of restored dependence on data cloud.
To do this, we use an additional regularization of optimization problem (5), using data graph model [9] defined over graph nodes [10] and reflect data cloud geometric structure. In this way, additional regularization term, smoothing the evaluated function ()  w, x on training dataset with edge weights ij v is the following: Thus, graph regulariser (10) ensures to find restored function ()  w, x which has similar values for close data points from training dataset.
In accordance with [11] approach, let's take manifold regularization term as data depended semi-norm At that, to construct a smoothing regulariser, we will use data-depended kernels, corresponding to semi-norm x, x derived in [11], are given by: Such a choice of kernel functions implements the smoothness assumption with respect to training dataset geometric structure.
Obtained transformed kernels can be used to modify failure rate kernel-based regression model: x κ x I Κ y Thus, dependence (15) describes obtained smoothed kernel-based regression model, that determines relationship between failure rate and affecting factors.
Case study. The proposed technique was investigated using real data from semiconductor manufacturing industry, which was provided by Michael McCann and Adrian Johnston [12]. The study uses Semiconductor Manufacturing (SECOM) dataset available from University of California, Irvine Machine Learning Repository UCI Machine Learning Repository: SECOM Data Set. The SECOM dataset contains data collected during semiconductor manufacturing process and represents a series of monitoring signals from sensors at process measurement points. Each example of SECOM dataset represents a single production entity with associated measured features (affected technological factors), the labels represent a simple pass/fail Yield in production quality control (Yield =  1, pass, no failure, and Yield = +1, fail), and associated date time stamp.
Training dataset contains 1567 examples, each with 590 measured features, and only 104 themes represent yield failures.
Two datasets are provided [12]: -secom_data.csv, contains time series of features measurements from sensors; -secom_labels.csv, contains time units and for each time unit output result (yield pass/fail).
-Calculations for numerical experiments were carried out using free software machine-learning library Scikit https://scikit-learn.org/stable/.
The presence of a large number of observations and their heterogeneous structure and statistical properties predetermine the need of preliminary data cleaning and preprocessing.
Some results of preliminary analysis of SECOM data are illustrated in fig. 1. Proportion of failures in total set of observations ( fig. 1, a) indicates, that dataset is unbalanced, there are only 7% "failed" items. Analysis of statistics of failures distribution at fixed time intervals ( fig. 1, b) confirms the hypothesis of its Poisson distribution, and founds the assumption of failure rate constancy at certain time intervals.
Analysis of empirical distributions of affecting technological factors showed that in the most cases features are distributed approximately normally, for example, histograms visualization of measurements from first 9 sensors are presented at fig. 2. Features covariance matrix heatmap is presented at fig. 3, the number of factors having positive or negative covariance values of more than 70% is equal to 262. At the stage of data cleaning, pairs of features with a covariance of more than 70% were removed; after this manipulation, we get a total of 212 features.

Fig. 3. Features covariance matrix heatmap
Preliminary analysis showed that training dataset has significant redundancy and contains many collinear feature vectors. Thus, there is a need to aggregate data, which is a prerequisite to reduce model complexity and training time and increase model performance.
Data aggregation was carried out in three stages. At the first stage, the entire set of observations was divided into successive intervals, for each of which the observed number of failures was fixed, and the feature vectors were replaced by their current average value over the intervals. At the second stage, the reduction in the number of features was performed by method of factor analysis, which performs a linear transformation of original feature vectors into the space of hidden factors of a smaller dimension. Experimentally, the number of hidden factors, which aggregate original features, was chosen as 5, the structure of correlations between the initial features and found hidden factors is illustrated by corresponding covariance matrix, which is shown in fig. 4.  As a performance metric for built regression models mean absolute mean absolute error (MAE) was chosen. The results of computational experiments show that clusterbased smoothed model demonstrates slight decrease in accuracy (MAE=0.0211 compared to MAE=0.0199 for base model), but avoids "spatial oscillation" as well as overfitting, and is more stable to outliers and noise.
Conclusions. The results obtained indicate, that proposed method for failure rate kernel-based regression model building using machine learning technique provides an opportunity to analyze the influence of a large number of technological and operational affecting factors and to successful approximate complex dependences of reliability indicators with high accuracy. Further development of proposed approach is expedient in terms of developing methods and algorithms for failure rate and failures themself forecasting based on time series of affecting factors measurements in real time. To build a predictive time-series model, it is also advisable to use kernel machine learning methods.