tsrobprep - an R package for robust preprocessing of time series data

Data cleaning is a crucial part of every data analysis exercise. Yet, the currently available R packages do not provide fast and robust methods for cleaning and preparation of time series data. The open source package tsrobprep introduces efficient methods for handling missing values and outliers using model based approaches. For data imputation a probabilistic replacement model is proposed, which may consist of autoregressive components and external inputs. For outlier detection a clustering algorithm based on finite mixture modelling is introduced, which considers time series properties in terms of the gradient and the underlying seasonality as features. The procedure allows to return a probability for each observation being outlying data as well as a specific cause for an outlier assignment in terms of the provided feature space. The methods work robust and are fully tunable. Moreover, by providing the auto_data_cleaning function the data preprocessing can be carried out in one cast, without comprehensive tuning and providing suitable results. The primary motivation of the package is the preprocessing of energy system data. We present application for electricity load, wind and solar power data.


Introduction
The increasing availability of high frequency data measured, recorded and processed in real-time leverage the potential benefit of data analytics in day to day business. However, as the data is usually measured by sensors and real measurement stations, measurement and transmission failures are inevitable so that a data preprocessing is required. Focusing on energy systems, two aspects are crucial, firstly, the accrued data is often immediately processed so that fast and automatic algorithms are needed. Secondly, the data is usually unlabelled so that the detection requires purely data-driven approaches. Hence, the data cleaning and preprocessing requires an automatic, robust and mainly data-driven procedure, which is easily tunable and applicable in real-time to provide verified data for a safe operation of the energy system.
In the literature several R packages are available, which address the task of data cleaning and preparation. On the one hand, there are several packages specifically designed for dealing with missing values [1]. Here, the packages are mainly focused either on data imputation in multivariate data by utilizing inter-attribute correlations as done in Amelia [2], mice [3] and VIM [4], or on univariate data by employing time dependencies as done in imputeTS [5], zoo [6] and forecast [7]. On the other hand, there are several packages specifically designed for dealing with outlying data. Here, among others HDoutliers [8], mvoutlier [9], and tsoutliers [10] can be named. However, packages combining both outlier detection and data imputation methods for time series data in a robust and efficient way, as implemented e.g. in forecast [7] and imputeFin [11] for univariate data, are rare.
The presented package tsrobprep is specialized on dealing with outliers and missing values in time series data in a robust and simple applicable way. It offers as core functions an algorithm for modelling missing values and an algorithm for detecting outliers. To enhance usability the package comes with three additional side functions, a robust decomposition algorithm, an algorithm to impute the modelled values, and a wrapper function to apply the former named functions at once. The first core function for modelling missing values is based on a two-step approach, which allows to deal with autoregressive effects and given external inputs. The second core function for outlier detection is based on a feature based clustering approach with finite (Gaussian) mixture models. Both core functions provide a probabilistic framework such that for data imputation a whole distribution can be modelled and for outlier detection the probability being an outlying data point can be provided. Focused on robustness, efficiency and usability the package tsrobprep is perfectly suited for cleaning various time series data sets in a univariate and multivariate setting, as done e.g. in [12]. The package is available online in the CRAN repository.
The paper is structured as follows. Section 2 describes the utilized methods in detail and Section 3 presents the usability and data preprocessing potential. Section 4 compares the methods to competitors from literature. Finally, Section 5 discusses the impact of the package and concludes the paper.

Software description
For dealing with missing values, the model_missing_data function is proposed for handling univariate and multivariate time series data. It takes the time series data to be cleaned as an argument and returns the modelled missing values to be imputed. The function uses a two-step model-based approach to replace missing values. First, the data is decomposed to trend, seasonal and external input components, as well as to their interactions. This is done using the robust_decompose function that is available in the package. Then, on the remainder we utilize the following regression where Target τ is either the expected value or a τ -quantile, Y t is the time series remainder with the missing data to be replaced, and X i,t are the decomposition components (including external regressors) to allow in a multivariate framework for utilizing beside time-dependencies also inter-attribute correlations. Additionally, L is a set of lags for the autoregression, and I is a set of the decomposition components. By allowing for positive and negative lags in L, past and forthcoming (if available) values can be utilized for modelling. By default, the set of lags L takes reasonable values, depending on the provided vector of seasonalities S = (S 1 , . . . , S K ) and the input time series data. Precisely, L contains all lags except those which involve multiple seasonal parameters and their negative values of a multi-seasonal AR(p + 1, p, . . . , p) with seasonalities S and p depending on data size. For example for S = (24, 168) and p = 1, this is L = {1, 2, 24, 25, 168, 169, −1, −2, −24, −25, −168, −169}.
If the user provides a matrix of external inputs, the algorithm allows to select the most appropriate inputs by user-tunable thresholds based on the linear correlation between the externals and the explained data. To utilize not only the inter-attribute information and autoregressive effects also lags for the externals can be applied for modelling. In case of the Target being the expected value the weighted lasso implemented in function glmnet in the glmnet package [13] can be applied. Otherwise, a quantile regression implemented with the fast Frisch-Newton algorithm in function rq.fit.fnb in quantreg package [14] can be applied. Moreover, the algorithm uses recursive replacement by default, which improves the modelling accuracy in the case of larger gaps, but it may also cause an error accumulation, so we give the user the option to switch it off.
For dealing with outlying data, the second core function detect_outliers is proposed for handling univariate data. It takes the time series data to be cleansed and returns the position of the detected outliers, the probability of being an outlying instance, the applied feature matrix and the specific cause for an outlier assignment in terms of the applied features. By recursive application within the wrapper function auto_data_cleaning also multivariate data sets can be handled. The algorithm can be tuned to control robustness, computational time and detection sensitivity. By utilizing specifically designed features for time series data, various patterns can be identified. As features, the gradient, absolute gradient, relative gradient, deterministic seasonal trend, seasonal gradient and absolute seasonal gradient are applied. For example, the absolute gradient tracks two-sided jumps with regard to the closest neighbours, the absolute seasonal gradient works the same, however, applied to the closest neighbours in terms of the provided seasonality S. The relative gradient in turn tracks two sided jumps in relation to the local seasonal variance to account for moderate spikes in low volatility periods. By applying an unsupervised learning approach the procedure is well suited to handle unlabelled and unstructured data. The approach is based on model-based clustering with finite mixture models as outlined in [15]. It is assumed that the multivariate distribution of the feature space is a mixture of G components, the likelihood for a mixture model can be defined as where y is the observation vector y 1 , ..., y n , f k is the density (assumed to be multivariate normal (Gaussian)), θ k are the parameters of the k-th component in the mixture and π k is the probability that an observation belongs to the k-th component. Note that G k=1 π k = 1. Within the mixture one component is intended to represent outlying data. This component is introduced by artificially augmenting the number of outliers in the data set by blowing up the covariance matrix of the mixture. For modelling the data as a Gaussian finite mixture with different covariance structures and numbers of mixture components the Mclust function in mclust package [16] is applied. For estimation the expectation-maximization algorithm initialized by hierarchical model-based agglom-erative clustering is used as outlined in [15]. To speed up the computational time and robustness of the method, the estimation is carried out repeatedly on varying data subsets. To provide the specific cause for an outlier assignment, each observation, which has been identified as an outlying data point is shifted to the corresponding feature and multivariate feature combination mean, respectively. Then, the probability of being outlying data is recomputed allowing to name a specific feature or a combination of features, which caused the outlier assignment.
The auxiliary function impute_modelled_data eases the usage of the package. It takes as argument a tsrobprep object, i.e. the object returned by the model_missing_data function, imputes the replaced values for given quantiles τ and returns a list of matrices (each element of the list corresponds to each quantile). The auto_data_cleaning function combines everything described above. First, it replaces missing values (if there are any) with a median forecast. This step is required as the outlier detection algorithm does not allow for missing values. Second, unreliable outliers are detected so that in the third step, the detected outliers and the initially modelled missing values are modelled and imputed in one cast. The repetitive modelling of the initially modelled missing values is reasonable as potential outliers might have negatively influenced the modelling process in the first run. The workflow of the function is presented in Figure 1.

Illustrative Examples
We present the potential of the package by using its core functions on three time series data sets: the electricity load in Great Britain, the solar generation in Germany (TransnetBW control zone), and the wind onshore generation in Italy (Sardinian control zone). All these data sets span the date range from January 01, 2015 to June 30, 2020. The first time series is half-hourly, i.e. contains 48 values per day, the second quarterhourly, i.e. contains 96 values per day, and the third is hourly, i.e. contains 24 values per day. The utilized data is freely available at ENTSOE Transparency platform [17]. The considered time series are plotted over time in Figure 2. We can clearly see that the time series of electricity load in the United Kingdom exhibits many outliers that are very likely unreliable. The other data sets do not seem to contain such values, at least based on the first sight. All data sets contain missing values, what is hard to notice in Figure 2 due to the high frequency of the data.
For a better illustration, a subset of each series containing either missing values, outliers or both is depicted in Figure 3. The time series of electricity load in Great Britain   The cleansing of the named series was done with the auto_data_cleaning function. The first time series, electricity load in Great Britain, and the third time series, wind onshore generation in Italy, were handled in a univariate setting without external inputs with S = (48, 7 · 48) and S = 24, respectively. To account for the unreliable zeros during the day shown in Figure 3 the consider.as.missings parameter has been used to recognize zeros as missing values only if the whole day consists of zeros. The second series, the solar generation in Germany, has been handled with S = 96 in a multivariate setting such that external regressors as plotted in Figure 4 were utilized for data modelling and imputation. Additionally, we set the whole.period.missing.only parameter to TRUE to consider zeros as missings only if the whole day is zero, and we set replace.recursively to FALSE to avoid error accumulation during night hours. All the other arguments were set to their default values. The models were estimated for quantiles τ ∈ {0.025, 0.05, 0.25, 0.5, 0.75, 0.95, 0.975}.
The results provided in Figure 3 are very promising. This accounts in particular for the first data set, where only autoregressive effects were used. It can be seen that the outliers were smoothly replaced, i.e. not only the outliers themselves were replaced, but also their direct neighbours. The cleaning of the other two data sets is also satisfying and the values seem to be reasonable.
Although the algorithm works well and robust, it has to be noted that specific causes of an anomaly (e.g. technical failures, true behavioural issues or nature phenomenons) are not always clear distinguishable and detectable. An example of such situation is presented in Figure 5. Here, the algorithm is able to detect the anomalous negative spike. However, due to the fact that this spike corresponds to a partial solar eclipse, the detection is considered as false positive. By providing external inputs, i.e. generation in the other German control zones, the algorithm is able to correct this misclassification by imputing very similar values. However, this example shows us that even though the provided package works robust, a supervision by humans, in particular when dealing with outliers and extreme values, is recommended.  Time (year 2015) Energy (MW) Solar generation in Germany (TransnetBW) Figure 5: A drop in the solar generation in Germany that may seem to be unreliable while it was caused by a natural phenomenon, i.e. the partial solar eclipse. Red lines indicate 95% confidence intervals, orange 90%, green 50%, and blue indicates the median.

Performance evaluation against competitors
To verify the potential of the proposed package, its functions are compared and evaluated against benchmarks from the literature. The performance of model_missing_data is evaluated in an imputation study. Here, the electricity load in Great Britain ranging form January 01, 2019 to December 31, 2019 is used. To allow a performance evaluation the named data set is manipulated by replacing existing values by missing entries. Observations are removed in continuous blocks. Assuming that values are missing at random (MAR) such that the likelihood of an observation missing depends on whether observations closer in time have also been missing. This sampling approach is implemented by a log-normal distribution parametrized with µ = 12 and σ = 6. To account for a varying number of missing values, a share of 0.01, 0.05, 0.1, 0.2 and 0.5 missing values is considered. Each simulation is repeated 20 times to allow a robust evaluation. As most methods in the literature are restricted to univariate time series data, the imputation study is implemented in a univariate setting. As competitors the impute_AR1_t function of the imputeFin package [11], the na.StructTS, na.locf and na.aggregate functions of the zoo package [6], the na_ma, na.kalman and na_interpolation functions of the imputeTS package [5], the prophet function of the prophet package [18] as well as the ts_impute_vec of the timetk [19] package are applied. All methods are provided with default settings and named in accordance to the package and function name. As evaluation measures the mean absolute error (MAE) and the computational time (single-core on Intel Core i5-8250U CPU 1.80GHz) are used. The results are shown in Table 1. It can be seen that the proposed function model_missing_data dominates all considered benchmarks for each simulation in terms of the MAE. In terms of the computational time, the model_missing_data function performs moderate in comparison to the benchmarks.
Focusing on the second core function detect_outliers and the wrapper function auto_data_cleaning, a confident evaluation is not straight-forward as the definition of outliers can not be easily narrowed down in a generally valid way. Hence, the performance is presented illustratively by comparing the cleansed data set returned by the auto_data_cleaning function with the results of two benchmark procedures from the literature, namely the tsclean function of the forecast package [7] and the impute_AR1_t function of the imputeFin package [11]. As data set the electricity load in Great Britain without any manipulation ranging from January 01, 2018 to December 31, 2019 is used. Two examples are shown in Figure 6. It is observable that the proposed procedure auto_data_cleaning detects all three anomalies in the data. The outlying instances on the left side of the Figure are clearly technical failures and the outlying data on the right side of the Figure is at least questionable. By considering the lower plots of Fig-0 Table 1 and Figure 6 are promising and underline the suitability, especially, for energy system data.

Impact and conclusions
So far the available R packages for time series data preprocessing are based on either simple and not robust or on complex and very slow algorithms. The introduction of the tsrobprep package fills this gap by providing a robust and efficient method to replace missing values and handle unreliable outliers. Moreover, by providing the function auto_data_cleaning, the package is simple applicable and provides already with the default parameter setting satisfying results. More advanced users can take advantage of the big set of tuning parameters to further adapt the functions to their needs.
The software has been already used in the purpose of academic research [12] and industrial projects in the area of power plant electricity generation. Moreover, the software is written in such a manner that it can be utilized on a daily basis with almost no effort.
The package is open-source, available online in the CRAN repository, well-documented and constantly developed. The presented examples have shown that the package might be a powerful tool in various time series data preprocessing exercises.

Conflict of Interest
No conflict of interest exists: We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.