A Robust Skewed Boxplot for Detecting Outliers in Rainfall Observations in Real-Time Flood Forecasting

. The standard boxplot is one of the most popular nonparametric tools for detecting outliers in univariate datasets. For Gaussian or symmetric distributions, the chance of data occurring outside of the standard boxplot fence is only 0.7%. However, for skewed data, such as telemetric rain observations in a real-time ﬂood forecasting system, the probability is signiﬁcantly higher. To overcome this problem, a medcouple (MC) that is robust to resisting outliers and sensitive to detecting skewness was introduced to construct a new robust skewed boxplot fence. Three types of boxplot fences related to MC were analyzed and compared, and the exponential function boxplot fence was selected. Operating on uncontaminated as well as simulated contaminated data, the results showed that the proposed method could produce a lower swamping rate and higher accuracy than the standard boxplot and semi-interquartile range boxplot. The outcomes of this study demonstrated that it is reasonable to use the new robust skewed boxplot method to detect outliers in skewed rain distributions.


Introduction
More real-time flood forecasting systems in China, particularly in the large basins where there are many remote gauge stations, now use telemetry systems to transmit the rainfall signals of rainfall stations because telemetry systems can provide timely, dense, and labor-saving hydrological information for remote rainfall stations [1].However, it has been shown that telemetric rainfall information includes inevitable outliers caused by instrument malfunction, human-related errors, and/ or signal acquisition errors resulting from signal leaks, and collisions or disturbances in the process of signal transmission, in addition to random errors normally distributed with zero mean and a small variance [1][2][3].e outliers have an unknown distribution with a much greater variance and appear to be inconsistent with the remainder of the dataset and are relatively large in magnitude [4,5].erefore, outliers should be treated differently [6], and in this paper, observations containing outliers were called abnormal data.
In real-time flood forecasting systems, rainfall observations represent the main input and determine the accuracy of the forecasting results.e presence of abnormal data can lead to unreliable forecast conclusions.e need to increase the accuracy and reliability of telemetric rainfall data has prompted researches on the construction of a robust method to efficiently detect these abnormal data before they have been entered into the hydrologic models [7].
One of the most frequently used nonparametric tools to detect outliers for a univariate dataset is based on the concept of the boxplot.e method, suggested by Tukey [8], has come into common use and has been studied extensively (see for example [9][10][11][12][13][14][15][16]).An observation is considered as "potential" abnormal data when its value does not belong to the interval (the fence): (q 1 − 1.5 * IQR, q 3 + 1.5 * IQR), where q 1 and q 3 are the first and third quartiles, respectively, and IQR is the interquartile range, i.e., IQR � q 3 -q 1 .e standard boxplot is fitted to normal or symmetric distributions in particular.For Gaussian data, the probability of lying below (q 1 − 1.5 * IQR) or above (q 3 + 1.5 * IQR) is 0.0035 (0.35%) each.
However, the real hourly rainfall observations are often not normal or symmetric.When we apply the standard boxplot for real hourly rainfall observations, the percentage of data outside the standard boxplot fence becomes excessively high.As an example, we selected the real hourly rainfall data of ood events from 1988 to 2008 measured at the Wuyigong rain gauge in the Qilijie basin in Southeastern China.
e standard boxplot of the datasets is shown in Figure 1.
e gures for other rainfall stations were not included because they were nearly identical to that for the Wuyigong Station.
It is clear that the underlying distribution of the rainfall dataset was skewed to the right.Up to 13.04% of the observations were above q 3 + 1.5 * IQR.Clearly, it would not be correct to classify them all as real abnormal data.
To cope with this, several adjusted boxplot methods have been proposed in case of skewed data.Kimber [17] suggested the use of the semi-interquartile range (SIQR) rather than IQR, i.e., the fence of the SIQR boxplot is de ned as q 1 − 3 * (q 2 − q 1 ), q 3 + 3 * (q 3 − q 2 ).
e SIQR boxplot has also been applied to the real hourly rain observations from the Wuyigong rain gauge.e boxplots of the two methods are shown in Figure 2.
e SIQR boxplot adjusts itself to the right skewness, compared with the standard method (Figure 2).e SIQR method expands the upper boundary slightly, and consequently less data are detected as outliers.However, the adjustment is not enough [18].e probability of lying outside of the SIQR fence was 9.29%.It was still highly risky to identify these values as real abnormal data.e poor performance of the SIQR was further analyzed in Section 3.
Carling [19], Schwertman et al. [14], and Schwertman and de Silva [20] suggested replacing q 1 and q 3 in the fences with the median q 2 .ey also suggested replacing the constant 1.5 with the functions that combine sample size with skewness.ese approaches can achieve a prespeci ed swamping, i.e., the potential of misclassifying an uncontaminated observation as real abnormal data [21].e methods perform well when the distribution is a lambda distribution.It is not clear how they perform with other distributions.Finally, the functions in these methods depend on the sample size, and the procedures require some characteristics of the uncontaminated distribution, which is often di cult to estimate for the real-time hourly rainfall datasets.
e aim of the paper was to construct a new boxplot method that was robust to outliers and sensitive to skewness.
e new method was independent of the sample size and performs well with the rainfall distribution.It can reduce swamping and rapidly detected abnormal telemetric rainfall data before they were entered into the real-time ood forecasting model.is paper was organized as follows.In Section 2, we detailed the proposed procedure that included a robust measure of skewness in the construction of the fences.Section 3 illustrated the di erences between the standard, SIQR, and robust boxplots for uncontaminated as well as abnormal datasets.Finally, we provided conclusions in Section 4.

Study Catchments.
ree catchments located in southeastern and southern China were selected.ey were the Qilijie catchment in Fujian Province; the Lushui reservoir catchment in Hubei Province; and the Yitang reservoir catchment in Guangdong Province.
e three catchments showed some similar hydrological characteristics, such as excessive precipitation and a humid climate.e areas of the three selected catchments varied, collecting together the basin characteristics in di erent sizes.
e Qilijie catchment, as a representative of a large basin, covered 14,787 km 2 with 43 telemetric rain gauges; the Lushui catchment was representative of a middle basin covering 3,960 km 2 with 13 telemetric rain gauges; and the Yitang reservoir catchment represented a small watershed, covering 251 km 2 with 6 telemetric rain gauges.

Data. Historic hourly rainfall observations from 1988 to
1998 from all the rain stations were compiled and used.e datasets were considered as nonoutliers or normal data.e total number of all the rainfall records was greater than 200,000. 2 Advances in Meteorology 2.3.Methods

Robust Skewness.
e asymmetry of a distribution can be described by the skewness coefficient.
e classical skewness coefficient depends on the second and third empirical moments of the datasets.However, the moments are sensitive to outliers.erefore, the classical skewness coefficient could be strongly affected by outliers.Even a single outlier can make it easy to distort and make it difficult to interpret [22].To overcome this problem, the medcouple (MC), introduced in Brys et al. [22], was chosen to estimate the skewness of rainfall observations.e datasets were sorted in ascending order, i.e., p 1 ≤ p 2 ≤ • • • ≤ p s , where s is the number of the datasets.
e MC of the datasets is defined as where m is the median of the observation samples, and for all p i ≠ p j , the kernel function h is given by For the special case p i � p j � m, the kernel function can be estimated as follows.Let n 1 < n 2 < • • • < n k denote the indices of the observations that are tied to the median m, i.e., p n l � m for all l � 1, 2, . . ., k. en e MC equals the median of all h(p i , p j ) values for which p i ≤ m ≤ p j .According to the definitions in equations ( 1) and (2), MC is based on the quantiles and is therefore not as vulnerable to outliers as the classical skewness.It is clear that MC always lies between −1 and 1.A distribution that is skewed to the right has a positive value of MC, whereas it becomes negative for a left skewed distribution.Finally, a symmetric distribution has a zero MC.As shown in Brys et al. [22], MC is robust to resisting outliers and sensitive to detecting skewness.It has a bounded influence function and a breakdown value of 25%, which means that MC can resist up to 25% of outliers in the data.

Combination the Robust Skewness with the Boxplot
Fence.To construct the boxplot fences for the skewed data, we propose to insert medcouple (MC) into the boxplot method.e constant 1.5 in the standard boxplot fence is replaced by some functions related to MC, such as f l (MC) and f u (MC).e new robust skewed fence is defined by Let f l (0) � f u (0) � 1.5 to equal the standard boxplot at symmetric distributions.When distributions are asymmetric, f l (MC) and f u (MC) can be used to adjust the fence to fit the skewness.Using different functions, f l (MC) and f u (MC) cause the effects of the adjustment to be different.In Section 2.3.3,we compare the differences between the different functions.

Comparison of Different Functions.
ree types of simple functions including only a few parameters were selected, which are important for operational real-time flood forecasting systems.
To determine the values of a, b, c 1 , c 2 , d 1 , d 2 , g, f, the expected percentage of observations beyond the robust skewed fence (equation ( 4)) was set to 0.7%, which was similar to the rule of the standard boxplot of the Gaussian distribution.According to the rule, the fence boundaries must reach where q α and q β are the αth and βth quantile of the distribution, respectively, and α � 0.0035 and β � 0.9965.Combining equations ( 5)-( 7) into equation ( 8), we can obtain Advances in Meteorology ln 2 3 Based on the historic hourly rainfall records, equations ( 9)- (11) were constructed for each rain gauge.
e parameter values of the three functions could be derived using linear least squares estimation.

Results and Discussion
3.1.Data Preprocessing.In humid catchments, the gap between the maximum and minimum of the hourly rainfall records was large.Analyzing the datasets together will make it di cult to correctly detect outliers.To overcome this issue, the data records must be preprocessed.First, the hourly areal mean rainfall (HAMP) was calculated using the Tyson polygon.en, the hourly areal mean rainfall (HAMP) was divided into three groups: (0, 1 mm] (Group 1), (1 mm, 2 mm] (Group 2), and (2 mm, +∞) (Group 3).Finally, the rainfall records for each station are divided into the three groups based on the HAMP.e analysis was repeated for each group.

MC Results.
e average results for MC for the three groups are listed in Table 1.
It was clear that the average MCs are greater than zero.e rainfall datasets of the three groups were right skewed.e average MCs were less than 0.5, which demonstrated that the distributions were not extremely skewed.It was risky to use the standard and SIQR boxplot to detect outliers for skewed rainfall distributions.

Results of the Function Comparisons.
To compare the behavior of the three di erent functions, they were applied to tting the lower and upper boundaries of the fence.e t of the linear, quadratic, and exponential functions is displayed in Figures 3 and 4. ln((2/3)((q 1 − q α )/IQR)) is set as the vertical axis in Figure 3, and ln((2/3)((q β − q 3 )/IQR)) is set as the vertical axis in Figure 4. e special de ned vertical axis is the same as in equation ( 11) that allows exponential functions to form a straight line.
When MC was greater than 0.43, the linear function decreases abruptly and fully separated from the observation samples (Figure 3).At the same time, the quadratic and exponential functions t the samples better.e t of the exponential function was slightly better than that of the quadratic function (Figure 4).e exponential function was simply needed to determine fewer parameters than the quadratic function.As a result, the exponential function was selected to conduct the robust boxplot fence.e new robust skewed boxplot fence is de ned by e parameters in the new fence are estimated using the rainfall records from the three basins mentioned above.To simplify the practical application of the new method, the estimated values are taken by rounding up g −3.96 and f 3.35 to g −4 and f 3. Note that, rounding up the values to the nearest smaller integer yields a smaller fence and   Advances in Meteorology consequently a more robust model.e new robust skewed fence is Note that, although MC is a robust estimator, it can be affected by outliers, particularly at high percentages of outliers.To reduce the effects of a high percentage of outliers on MC and new boxplot fences, a low percentage of outliers (≤5%) is considered.
is low percentage of outliers coincides with the characteristics of the telemetric rainfall observations.

Performance of Noncontaminated Data.
In this section, we compare the performance of the standard boxplot, SIQR [17], and the proposed robust skewed boxplot using real data without outliers, the real rain observations from the three basins.By calculating the swamping rate (SR) (the proportion of "good" data identified as outliers) [23] and accuracy (the proportion of outliers and "good" data identified correctly) [23], the differences in the three boxplots were analyzed.e SR and accuracy are defined as SR � the number of good data identified as outliers total number of good data * 100, accuracy � number of outliners correctly identified + number of good data correctly identified total number of data sets * 100.
When there is no outlier, we know that the total number of the good data was equal to the total number of datasets.In this case, SR plus accuracy equaled 1.0.Table 2 lists the average SR and accuracy for the three methods.Figure 5 displays the average SR of every rain gauge for the standard boxplot and robust skewed boxplot for the purpose of clarity.
Based on equation ( 14), SR is an estimator of method risk.From Table 2 and Figure 5, the results clearly showed that the robust skewed boxplot had a lower SR than the standard boxplot and SIQR without outliers.SR of the proposed boxplot was 1.3% and was close to the portion outside of the standard boxplot fence for Gaussian data (0.7%).SIQR was slightly superior to the standard boxplot; however, the SR was still far greater than 0.7%.e accuracy of the robust boxplot was the highest in the three methods.
e standard boxplot performed much worse than the SIQR.
We also observed that the proposed robust skewed boxplots for each rain gauge yielded much better SR values on the skewed uncontaminated observations than the other boxplots.is was due to the fact that the robust skewed boxplot used MC to adjust the fences to the skewed data.
e detailed information on the boxplots is shown in Figure 6. e rain observations (HAMP ∈ (1, 2]) at Wuyigong Station were selected to demonstrate that the proposed method was able to adjust the fence to the skewed data.e results are presented in Figure 6.MC equaled 0.52, and the distribution was skewed to the right.
It is clear that the robust skewed boxplot yielded the larger upper boundary than standard boxplot and SIQR, and it was adjusted to better reflect the right skewed data.e proposed boxplot identified fewer large good data points as the upper outliers.At the same time, the new method had less adjustment of the lower fence than the other methods that may result in the smallest good data being marked as the lower outliers (Figure 6).However, for the flood forecasting system, the influence effects of the larger good data identified as outliers on the forecasting results were far greater than those of the smallest good data identified as outliers.e robust skewed boxplot had practical value.

Performance under Contamination.
We now compared the robustness of the three boxplots using the contaminated data.To understand the detailed information of the outliers, synthetic datasets were generated by superimposing the following upper outliers on the real rain observations in Section 2.2:  Advances in Meteorology where r is a random number and T is a constant that controls the maximum of e. L is the frequency of outliers, for example, L 20 means that the outlier percentage is 5%.By adjusting T and L, outliers of di erent magnitude and frequency could be generated.We generated outlier samples and ran the experiment 1000 times for every T (T 5 mm, 10 mm, 20 mm, 30 mm, 40 mm, 50 mm, and 100 mm) and L 20.
e MC of di erent T values at Wuyigong Station in the Qilijie basin is listed in Table 3.We obtained comparable results for the other rain gauge stations.
e results in Table 3 showed that MC changes little when T increased.
is demonstrated that MC was not in uenced evidently by the outliers.
e average performance of the three boxplots with outliers is shown in Table 4 when T 40 mm.
Under contamination, the total number of good data did not equal the total number of the datasets, and SR plus accuracy did not equal 1.0.
Because of based on quantiles, the three boxplots all had the ability to resist outliers and they maintained robust results for noncontaminated and contaminated data (Tables 2 and 4).Compared with the standard boxplot and SIQR, besides the quantiles, the new boxplots used the MC that was robust to outliers and sensitive to skewness to construct the new fences.By moving the upper boundary up, the proposed method had a much lower SR and higher accuracy.is illustrated again that the proposed boxplot accounted su ciently for skewness.
To further analyze the e ects of di erent values of L on performance, we ran the experiment 1000 times for L 10.
e average performance of L 10 is listed in Table 5 (T 40 mm).
e outlier frequency changed from 5% to 10%, and the performance of the three boxplots varied a little.It was clear that the size of the outliers and the frequency had only a small e ect on SR and accuracy.However, the results required some restrictive conditions, for example, the proportions of the outliers could not be too high.

Conclusions
e standard boxplot is a popular nonparametric method to detect outliers in data series.Unfortunately, when it is used on skewed data, such as hourly rainfall series, the probability of identifying good data as outliers was high.erefore, a MC that was not only robust to outliers but also sensitive to skewness, and di erent simple function styles were produced to adjust the standard boxplot fence to t the skewed hourly rainfall distributions.e exponential function was then selected based on comparisons.
e comparison of the results using uncontaminated and abnormal data showed that the proposed method had robust performance and a lower risk of identifying good data as outliers, compared with the standard boxplot and SIQR.
In ood forecasting systems, the decision to eliminate data as outliers is a serious matter and should not be taken lightly.
e unusual good observations often provide valuable information.erefore, a more conservative approach (less risk of identifying good data as outliers), such as the robust skewed boxplot method, is reliable for practical applications.

Data Availability
e rainfall observation data used to support the ndings of this study were supplied by the branch of hydrology and water resources investigation bureau of Fujian Province under license and so cannot be made freely available.Requests for access to these data should be made to the branch

Figure 3 :Figure 4 :
Figure 3: e t of the lower boundary.

Figure 5 :
Figure 5: e average SR of each rain gauge.

Table 2 :
Performance of the three boxplots without outliers.

Table 3 :
MC of di erent size outliers at Wuyigong Station.

Table 4 :
Performance of the three boxplots with T 40 mm.

Table 5 :
Performance of the three boxplots with L 10.