Automating the Smoothing of Time Series Data

Modelling requires comparison of model outputs to measurements, for calibration and verification. A key aspect data smoothing is to “filter out” noise. Often, data must be adjusted to a model’s time step (e.g. hourly to daily). For noisy data, LOWESS/LOESS (Locally Weighted Scatterplot Smoothing) is a popular piecewise regression technique. It produces a “smoothed” time series. LOWESS/LOESS is often used to visually assess the relationship between variables. The selection of LOWESS tuning parameters is usually performed on a visual trial and error basis. We investigate the so-called robust AIC (Akaike Information Criteria) for automatic selection of smoothing. Robust Pearson correlation coefficient and mean-squared error are employed to determine the polynomial degree of piecewise regression. The exclusion of outliers is attempted using a Hampel outlier identifier


Introduction
LOWESS is a powerful non parametric technique for fitting a smoothed line for a given data set either through univariate or multivariate smoothing [1]. It implements a regression on a collection of points in a moving range, and weighted according to distance, around abscissa values in order to calculate ordinal values. "LOWESS" and "LOESS" are acronym for "Locally Weighted Scatterplot Smooth" because, for data smoothing, locally weighted regression is used. Furthermore, a robust weight function can be used to compensate for undue influence of extreme points. Differentiation of a regression model depends on the way it is used: a linear polynomial is used for LOWESS whereas a quadratic polynomial is used for LOESS [2]. From the literature review, most authors consider LOWESS/LOESS same, but they are different. LOWESS is derived from term "locally weighted scatterplot smoothing" whereas according to Potts LOESS stands for "Locally Estimated Scatterplot Smoothing" [3].
LO(W)ESS is one of the most widely used method for data smoothing and trend estimation. Its graphical representation helps to visualize the overall trends in a time series and identify times of changes. As an example, in 2012 presidential elections [4], LOWESS fit is used to predict the presidential candidate. Moreover, Niblett [5] used LOWESS fit for the evolution of legal rules. There are many more other examples where LOWESS fit has been used for trend estimation. This is the easiest way to communicate trends especially for the non-technical people.
There are numerous techniques for data smoothing: splines, beziers, kernel and polynomial regression. Local regression provides some attractive features, according to Cleveland [6]. LOESS fit is extremely informative when the data set is very large [7]. Moreover, it is used to solve the problems of precision, noise filtering, and outliers and is known to adapt well to bias problems, as opposed to these other methods. Also, LO(W)ESS is computationally efficient [8]. Our experiments with polynomial regression and Bezier curves show that if we increase the degree of polynomial, it increases undulations in the curve fit being attempted, and we cannot estimate the actual picture. Moreover, increasing degree of polynomial often creates the problem of data over-smoothing. If outliers are present in the dataset, robust LOWESS/LOESS (rLOWESS, rLOESS) procedure is used to overcome the problem of distorted values. The presence of outliers can be detected using a Hampel identifier. It is considered to be the most efficient and robust method for identification of outliers [9]. A data point is considered as an outlier, based on Hample Identifier (HI), if it goes beyond ±3σ where the variance estimate σ is calculated by subtracting median absolute deviation and median. It signals the index of the outlier as the final outcome.
The selection of smoothing parameter, α, is often entirely based on a "repeated trial" basis. Some researchers argue that it should lie between 0.2 and 0.8 while others consider 0.5 as an ideal parameter value [10]. There is no specific technique for selection of the exact value of α. Selection of α may lead to "over-smoothing" or "under-smoothing" of data, does not necessarily provide good information for LO(W)ESS fit. Figure 1 shows a sample LO(W)ESS fit using different smoothing parameters. The LO(W)ESS fit which follow the almost all the data points is called "under smoothing" or "over-fitting" whereas if does not follow the data and produce a smooth line is called "lack of fit" or "under-smoothing".
For calculating smoothed values from the robust method, there are two extra steps; firstly, it requires the calculations of residuals from LOWESS/LOESS and then robust weight function is applied using the bisquare function [11]. Once the regression function values are calculated with flexible weights and polynomial degree, LOESS fit is complete [11]. Robust AIC (Akaike Information Criteria) is used for the selection of the best fitted smoothing parameter, α, and akaike weights are used to evaluate the best selected model [12]. Moreover, we can use the robust Pearson correlation coefficient for the selection of degree of polynomial, λ.
We have noted from the literature that there are nine different methods being used for selection of α in non-parametric regression, but, according to Aydin [13], Improved Akaike Information Criteria (AICc) and Generalized Cross Validation (GCV) are the best methods based on small and large datasets [14,15]. The best model is selected is the one with the smallest AIC score. Tharmaratnam [16] proposed the robust version of AIC which produces promising results for model selection in the presence of outliers compared to non-robust AIC. We will be using AIC M-estimator, robust AIC, for α. The best selected model can be estimated using Akaike weights [12].
In this paper, we are using synthetic as well as real data [17], to illustrate the usefulness of our proposed method.

LOWESS analysis for noisy synthetic linear data
We have generated noisy linear data using (1).
An artificial dataset with 50 data points is generated for the experiment. Normally distributed noise, e, with zero mean and 0.01 variance has been added to produce outliers in the linear data, whereas X is generated uniformly.

LOWESS/LOESS analysis for Rock Creek River
The data set used for our research analysis is from Heidelberg University [17]. It is collected from Rock Creek River for 10 different parameters and we will analyze "Suspended Solid" data collected hourly but intermittently between 1982-2013.
Calculating the mean of stratified samples does not necessarily provide an accurate picture of the data. Our data readings vary from hourly to daily data and therefore, weights should be considered to calculate average. We will reduce the values using Flow-Weighted Mean Concentration (FWMC) for accuracy and consistency which should give a clear picture of actual loadings.
The flow-weighted average is considered to be an accurate method for use in calculating average for stratified samples in which the readings have different time intervals, varying from hourly to daily readings [18,19]. These different weights should be considered for calculating average. Calculating FWMC does not have effect of missing data [19]. Equation (2) shows the formula to calculate flow-weighted average is [18,19].
where q i =flow in the i th sample, c i =concentration of the i th sample, t i =time window for the i th sample

Proposed Methodology
The steps for appropriate selection of smoothing parameters for LO(W)ESS are shown in Figure 2. The experimental steps for selection of α and λ are as follows:

Analyze the presence of outliers in the data
In order to detect the presence of outliers, a Hample Identifier (HI) is employed.

Differentiate LO(W)ESS from rLO(W)ESS
If outliers are present in the data set rLO(W)ESS is used otherwise, LO(W)ESS is used for analysis.

Examine the presence of a monotonic relationship
This step is extremely important to identify which degree of polynomial is used, linear or quadratic. If X and Y show a monotonic relationship, then λ=1 otherwise, λ=2 [11]. The strength of correlation can be measured based on values give in Table 1. For automatic selection of λ, weighted Pearson correlation, r w , or Mean Squared Error (MSE) can be used [20,21]. The best fit polynomial is considered to be the one for which MSE is less or having high r w value. The r w is more suitable for testing correlation compared to Pearson correlation [20]. The Pearson Correlation coefficient is calculated for the express purpose of identifying whether the data is monotonic.

Computational steps for LOWESS/LOESS
In the literature, the selection of smoothing parameter, α, is often entirely based on trial and basis. Some researchers argue that it should be between 0.2 and 0.8 while other considers 0.5 as an ideal starting point [10]. There is no specific technique for selection of the exact value of α. Random selection of α may lead to over-smoothing or undersmoothing of data, which in turn does not provide good information for LOWESS fit. The LOWESS/LOESS fit which follow the almost all the data-point is called "under-smoothing" or "over-fitting" whereas if does not follow the data and produce a smooth line is called "lack of fit" or "under-smoothing". The step by step calculation of LOWESS/LOESS and rLOWESS/rLOESS are as follows [1,10,22].
A: Compute tricube weights, equation 7, using scaled distance. These weights are calculated for set of numbers in local neighbourhood.

( ) ( )
B: Run weighted least square regression for those set of numbers.
C: If outliers are present in data; calculate residuals, median of residuals and robust weight, equation 8, using robust weight function.
D. Run weighted least regression using robust weights.

Scatterplot smoothing
For demonstration of the LOWESS procedure, we will first examine a synthetic dataset and finally real life data, the suspended solids parameter, from Rock Creek River. Experiment using synthetic data: Figure 3 shows the scatterplot for our synthetic data. We will examine our dataset with the proposed methods, given in section on proposed methodology.
Step 1 Presence of outliers: The Hample Identifier (HI) detected two outliers at index 21 and 46.

Step 2 LO(W)ESS vs. rLO(W)ESS: Since two outliers are detected
Equation (3) shows the formula to calculate MSE and Equation (4) for r w .
where Y is observed values, Ŷ is the predicated values and n is the number of values.
where x, y are the set of variables, w i is the weight, w x , w y are the weighted mean of x and y variables.

Differentiating LOWESS, LOESS, rLOWESS and rLOESS
The selection of best LO(W)ESS model can be performed as follows: Select the best smoothing parameter using RobustAIC Tharmaratnam [16] proposed the robust version of AIC, which produces promising results for model selection in the presence of outliers compared to non-robust AIC. RobustAIC score is computed for all the values of α between 0.1 and 1. The model with lowest AIC score is considered as the best value of α based on the dataset. Section below on Computational Steps for LOWESS/LOESS provides the details of computing LO(W)ESS/rLO(W)ESS fit based on the smoothing parameter. The best selected model can be estimated using Akaike weights [12]. The algorithm for calculating Akaike weights is as follows: c.
Compute Akaike weights for each model and normalized relative likelihoods   Step 3 Identify monotonic relationships: The selection of appropriate α and λ is extremely important. Based on experimental results for linear data, the value of r w =0:896 whereas r=0.582. This indicates that there is strong relationship; therefore, λ=1 should be used for analysis. Similarly, the calculation value for MSE for λ=1 is 0.00184 whereas for λ=2 is 0.00188. Again, MSE confirms that λ=1 is the best for analysis.

Step 4 Differentiating LOWESS, LOESS, rLOWESS and rLOESS:
Since a monotonic relationship exists and presence of outliers is detected in Step 1, rLOWESS is chosen for the analysis.
Step 5 RobustAIC for parameter selection: Table 2 shows the RobustAIC score calculated from the equations given in section computational steps for LOWESS/LOESS robustAIC selects the best value for α based on the data. Based on Tharmaratnam robust AIC [16], Table 2 illustrates that α=0.3 is the smallest AIC score. Table 3 shows the best selected model using akaike weights, calculated from step 5 in section on proposed methodology. Figure 4 shows the rLOWESS fit based on α=0.3, black line is the original data and blue is the rLOWESS fit. In this situation, the original data to which the noise was "added" is not recovered. This is an artificial dataset. The criteria do not recover the noise-free data, but rather they smooth the dataset according to the noisy data as presented, some of which began as an added noise, and which is retained in the smoothed line.

Experiment on real data:
Step 1 Presence of outliers: Based on data set, HI detected 28 outliers in the dataset Step 2 LO(W)ESS vs. rLO(W)ESS: Since outliers are detected in the dataset, rLO(W)ESS is chosen.
Step 3 Identify monotonic relationships: Based on the analysis, the value of r w =0.0191 shows a very weak relationship; rLOESS is indicated.
Step 4 Differentiating LOWESS, LOESS, rLOWESS and rLOESS: Since monotonic relationship does not exist and presence of outliers is detected in Step 1, rLOESS is chosen for the analysis.
Step 5 RobustAIC for parameter selection: RobustAIC selects the best value for α based on the data, as described in section on experiment

Conclusion
LO(W)ESS is widely used in different application areas such as for normalization and accessing non-linear relationships between variables and considered as one of the important member of nonparametric regression in statistical circle. It is unfortunate that despite its wide application area, the important parameters are selected on trial and error basis. Over-smoothing and under-smoothing is neither acceptable nor desirable in such situations. Over-smoothing divulges trend but ignores local variations whereas under-smoothing results in too many local variations.
An automatic approach for selection of smoothing parameters for LO(W)ESS fit has been proposed and tested. The degree of polynomial and presence of outliers is used to select the type of LO(W)ESS. Also, the best value of smoothing parameter is chosen based on the least value of AIC values. AIC with mm-estimator is employed for the  selection of best model for smoothing parameters that works well in the presence of outliers. Also, weighted MSE is used for the estimation of best degree of polynomial. The accuracy of the aforementioned methodology has been tested and demonstrated using experimental results.
In the first experiment, an artificial data set has been generated to test if the proposed method works as per expectations. It is known in advance that the data is linear with the presence of outliers in it. A real data set from Rock Creek River is examined using the proposed method. Our proposed method is able to automate the smoothing parameter and degree of polynomial. At the same time, it eliminates the problem of over-smoothing and under-smoothing of data. The approach is flexible and easy to implement in a variety of situations.