Estimating Tukey Depth Using Incremental Quantile Estimators

The concept of depth represents methods to measure how deep an arbitrary point is positioned in a dataset and can be seen as the opposite of outlyingness. It has proved very useful and a wide range of methods have been developed based on the concept. To address the well-known computational challenges associated with the depth concept, we suggest to estimate Tukey depth contours using recently developed incremental quantile estimators. The suggested algorithm can estimate depth contours when the dataset in known in advance, but also recursively update and even track Tukey depth contours for dynamically varying data stream distributions. Tracking was demonstrated in a real-life data example where changes in human activity was detected in real-time from accelerometer observations.


Introduction
Several attempts have been made to provide a desirable ordering of multivariate data and the concept of depth has become very popular. Depth gives a center-outward ordering and a wide range of depth measures have been developed (Mosler;, such as depth based on distance metrics (Mahalanobis, spherical, projection and oja), weighted mean depths (Dyckerhoff and Mosler; and depth based on halfspaces and simplices (Tukey;1975;Zhang;2002;. The concept has further been extended to measure the depth of regression models (regression depth)  and the depth of functional data (López-Pintado and Romo;.
The depth concept has been used to develop a range of new statistical models and methods for classification and clustering (Kim et al.;Jörnsten;, functional autoregressive models (Martínez-Hernández et al.;2019), characterization of multivariate data (Serfling; such as multivariate kurtosis (Wang and Serfling;2005), multi quantile regression 2012b,a) and robust principal component analysis (Mozharovskyi;2016). Depth has been applied to a wide range of disciplines such as economy (Kim et al.;Kosiorowski and Zawadzki;, health and biology (Williams et al.;2008;2015), ecology (Cerdeira et al.; and hydrology (Chebana and Ouarda; to name a few. The earliest and most popular depth measure is Tukey depth (Tukey;1975). The Tukey depth of a point is defined as the minimum probability mass carried by any closed halfspace containing the point. However, computation of Tukey depth for higher dimensions is computationally demanding limiting its application 2019). Kong and Mizera (2012) defined halfspaces such that a specific portion of observations are on one side of the halfspace. Kong and Mizera (2012) showed that contours with a specific Tukey depth can be estimated from the intersection of such halfspaces over different directions. Such contours can again be used to estimate the depth of any point.
To apply this result to estimate α-depth contours in dimension p, then the positions of O(c p−1 ), c > 1 halfspaces must be estimated requiring estimators that are both memory and computationally efficient. In this paper we suggest to use incremental quantile estimators , 2019. These estimators only need to store a single value in memory, O(1), and only need to perform a single operation per observation resulting in a computational complexity of O(n) for n observations. In comparison, traditional quantile estimators have a memory requirement of O(n) and a O(n log n) computational complexity. Thus streaming quantile estimators are ideal to estimate high dimensional α-depth contours. However, the computational efficiency comes with a price and traditional estimators provide more precise estimates based on the same observations. We will demonstrate that incremental quantile estimators are useful when the data is known in advance, but the main application is for streams of data since the estimators are able to recursively update and even track the true positions of the halfspaces. Tracking will be demonstrated in a real-life data example where changes in human activity are detected in real-time from accelerometer observations. The QEWA and CondQ incremental quantile estimators, in Hammer et al. (2018) and Hammer et al. (2019), document state-of-the-art tracking performance, but these estimators are based on generalized exponentially weighted averages and are not robust to outliers which are common in real-life data streams. Among quantile estimators being robust to outliers, the Deterministic Update Based Multiplicative Incremental Quantile Estimator (DUMIQE) in Yazidi and Hammer (2017) and the ShiftQ estimator in Hammer et al. (2019) document state-of-the-art performance 2019), and thus the α-depth contour estimators in this paper will be based on these estimators.
The paper is organized as follows. In Section 2 the concept of depth is introduced including some results to compute Tukey depth. Section 3 provides an efficient procedure to estimate Tukey depth. Section 4 presents performance metrics that will be used to evaluate the algorithm and Sections 5 and 6 present synthetic and real-life data experiments.

The Concept of Depth
Let X = (X 1 , . . . , X p ) T represent a p-dimensional stochastic vector with probability distribution P . Let D(x, P ) denote the depth function of a point x with respect to the probability distribution P . A high (low) value of the depth function refers to a central (outlying) point of the probability distribution. A general depth function is defined by satisfying the natural requirements of affine invariance, maximality at center, monotonicity relative to deepest point and vanishing at infinity (Zuo and Serfling;2000).
The most used and popular depth function is Tukey depth defined as the minimum probability mass carried by any closed halfspace containing the point where U refers to the set of all vectors with unit length. Define the α-depth region with respect to Tukey depth, D(α), as the set of points whose depth is at least α The halfspace depth regions are closed, convex, and nested for increasing α. The boundary of D(α) is known as the α-depth contour. Following Kong and Mizera (2012), for any unit directional vector u ∈ U, define the directional quantile as where F u T X (x) refers to the univariate cumulative distribution function of the projection of X on u. Define the halfspace which is bounded away from the origin at distance Q(α, u T X) by the hyperplane with normal vector u. Consequently P (X ∈ H(α, u)) = 1 − α for any u ∈ U. Kong and Mizera (2012) proved that the α-depth region in (2) equals the directional quantile envelope Tukey depth may not be defined for depths above some threshold and the intersection becomes empty.

Efficient Estimation of Tukey Depth
For a given multivariate dataset, the result in (5) invites to a simple procedure to estimate α-depth regions. Simply select a set of directional vectors, u i , i = 1, . . . , n u , and estimate the directional quantiles as given in (3) from the dataset. To get a fully functional algorithm, four issues will be discussed below.
1. Estimating directional quantiles. As pointed out in the introduction, we suggest to use incremental quantile estimators. A prominent example is the DUMIQE algorithm which can update directional quantile estimates as follows when an observation x n is received ) The update is quite intuitive. If the sample u T i x n is above (respectively below) the current estimate, increase (respectively reduce) the corresponding directional quantile estimate. The tuning parameter λ > 0 controls the update size. For example, if the distribution of X n changes rapidly with time, a high value of λ should be selected to efficiently track the directional quantiles. The α-depth region can now be estimated with the intersection D n (α) = i ∈ 1,...,nu H n (α, u i ), k = 1, . . . , K where halfspaces are defined from the directional quantile estimates Often it is useful with joint estimates of multiple α-depth contours for example to efficiently estimate the depth of any point. The ShiftQ incremental algorithm makes joint quantile estimates for multiple probabilities and in particular ensures that the ordering of quantile estimates 2019). Thus using ShiftQ, the resulting α-depth contours will not intersect.
2. Estimating depth of any point. If multiple α-depth contours are estimated, the depth of any point w can be estimated by checking which of the halfspaces w is within to find which of the α-depth regions w is within. The naive algorithm runs in O(n u K) time, but can be sped up. If w is outside H n (α j , u i ) for some direction u i and depth α j , w will be outside all regions with smaller depth, D n (α k ), k = 1, . . . , j. By checking halfspaces for decreasing depth will substantially speed up the computations.
3. Selecting directional vectors. Generation of uniformly distributed directional vectors is simple: Let Z 1 , . . . , Z p be independent standard normally distributed stochastic variables and define Z = (Z 1 , . . . , Z p ) T . Then U = Z/ Z 2 will be uniformly distributed on the unit sphere, where · 2 refers to the Euclidean norm. Intuitively, it makes sense to use directional vectors that are more equidistantly spread on the unit sphere. A simple approach is to generate many uniformly distributed directional vectors, N u , and secondly filter out directional vectors that are closer than some threshold. The approach is however computationally demanding, O(N 2 u p 2 ), but only needs to be done in the initialization of the algorithm. There are other algorithms to generate fairly equidistantly spread direction vectors, see e.g spiral algorithms (Saff and Kuijlaars;1997). We have not evaluated the potential of these algorithms. The optimal distribution of direction vectors also depends on the shape of the multivariate distribution, e.g. directions where the α-depth contours have strong curvature, more directional vectors should be used. One can imagine to recursively update the directional vectors as one learns more about the distribution. We have not looked into this.
4. Convergence. D n (α) consists of two approximations compared to D(α) namely the finite number of directional vectors and the quantile estimates. Thus for D n (α) to converge to D(α), first, the directional vector selection procedure must cover the unit sphere when the number of directional vectors goes to infinity and, secondly, the directional quantile estimates must converge to the true directional quantiles, when the number of observations goes to infinity. By using the simple procedure above to select uniformly distributed directional vectors, the first requirement is satisfied. Further Yazidi and Hammer (2017) and Hammer et al. (2019) prove the second requirement.

Performance Metrics
We suggest to measure error along lines l i , i = 1, . . . , n v going trough the center of the true distribution and outward in uniformly distributed directions v i , i = 1, . . . , n v ( Figure 1). This approach scales well with dimension p. We suggest two error measures: 1. Depth error: Let w i,k denote the point of intercept between the line, l i , and the envelope and compute the true depth at this point, D( w i,k , P ). The error is computed using mean absolute depth error (MADE) over all the lines l i , i = 1, . . . , n v and again average over envelopes To compute MADE for higher dimensions, the true depth must to be computed for a large set of points w i,k . For non-elliptic distributions this is computationally demanding and was limited to p ≤ 6 in the experiments. For elliptic distributions, and in particular multivariate normal distributions, the true depth of any point can be computed analytically and thus MADE was computed up to dimension p = 10 in the experiments. Details are given in Appendix A.1. Of course, if we knew that the observations were multivariate normally distributed, other depth measures such as Mahalanobis depth would be more natural, but the computations are only used to evaluate the performance of the algorithm for high dimensions. 2. Euclidean distance: Along each line, l i , compute the point of intercept between the line and the true α-depth contour of depth α k , denoted w i,k . Compute the error as the average Euclidean distance (ED) where w i,k still refers to the intercept between line l i and the envelope. Further take average over envelopes

Synthetic Experiments
In this section the performance of the algorithm in Section 3 is evaluated in several synthetic experiments. The experiments focus on streaming data, except in Section 5.2. In Section 6 the algorithm is demonstrated in a real-life data example. All computations where run on a Dell PowerEdge R815 server with 64 1.8 GHz AMD CPU processors and Linux Ubuntu operating system version 16.04. The experiments were implemented in R (R Core Team; 2019), but with the most computer intensive parts in C++ integrated using Rcpp (Eddelbuettel and François;Eddelbuettel;.  Directional quantiles were estimated using DUMIQE with decreasing values of the tuning parameter, λ n = 1/n. We see that fairly good estimate is achieved with 200 observations and that the error is minimal with 2000 observations. Let X refer to the multivariate normal distribution defined above and define the lognormal distribution Y = exp(X). Y is both highly unsymmetrical and highly heavy tailed. The results are shown in Figure 3. Due to the flexibility of the depth concept, the method performs equally well for non-elliptic distributions.

Synthetic Experiments -Static Data Stream
In supplementary material S.1 a few examples of joint estimation of multiple α-depth regions using the ShiftQ algorithm are shown. A link to the supplementary material is found at the end of the document. The results show that multiple depth regions can efficiently be estimated for both Gaussian and non-Gaussian distributions.
Considered now joint estimation of α-depth regions for α = 0.05, 0.2 and 0.4 and for p > 2. Table 1 shows results for standard multivariate normally distributed observation. More detailed results are given in Figures 5 and 6 in supplementary material S.1. A link to the supplementary material is found at the end of the document. CPU time refers to the computational time needed per α-depth region to obtain estimates with a given precision using a single CPU core. The number of directional vectors (and thus CPU time) increases with p and estimation precision. The algorithm performs very well. For example for dimension p = 10 a MADE less than 0.02 this is obtained in about 1.5 seconds. MADE < 0.01 could be reached in shorter CPU time than what is shown in Table 1 using a higher number of directional vectors, but is not explored. Now assume that X = (X 1 , . . . , X p ) T is a multivariate normally distributed variable with zero expectation vector and strong dependencies Cov(X i , X j ) = exp(−0.2|i − j|), i, j = 1, . . . , p The results are shown in Table 2. More detailed results are given in Figures 9 and 10 in supplementary material S.1. By comparing Tables 1 and 2, we see that the number of directional vectors and CPU time needed increase when the variables of X are dependent. Let X still represent the multivariate normally distributed variable with covariances (12). Table 3 shows results for the multivariate lognormal distribution Y = exp(X). More detailed results are given in Figures 11 and 12 in supplementary material S.2. Tables 2 and 3 show that a specific level of MADE is reached faster for the lognormal distribution than for the multivariate distribution documenting that the procedure efficiently can characterize non-Gaussian distributions.

Synthetic Experiments -Offline Setting
In this section, we compare the performance of the incremental quantile estimator, DUMIQE, with state-of-the-art offline quantile estimators to estimate α-depth regions when data is known     The second and third columns shows the CPU time (in seconds) and the number of directional vectors used to obtain an mean absolute depth error (MADE) less than 0.05, respectively. The fourth and fifth and the sixth and seventh columns show the same to obtain MADE less than 0.02 and 0.01, respectively.
in advance. State-of-the-art offline quantile estimators are based on using weighted averages of consecutive order statistics is the jth order statistic of the sample, m a constant and N the sample size. We use m = α+1 3 and δ = N α + m − j and define α[k] = k−1/3 N +1/3 . The sample quantiles can be read from a linear interpolation between the points (α[k], y[k]), k = 1, . . . , N . This the method referred to as Type 8 in the quantile function in R and is the one recommended by Hyndman and Fan (1996).
We consider the multivariate normal distribution case with covariance matrix as given in (12), sample sizes N = 500, 2000, 10 4 and 5 · 10 4 and dimensions p = 2 and p = 3. For p = 2 and p = 3, we used 1500 and 7500 directional vectors, respectively, which were sufficiently many for the estimation error essentially to be due to the performance of the quantile estimators.
The results are shown in Table 4. We see that the estimation errors using DUMIQE are about 1.5 time that of the offline estimator. If fewer directional vectors were used, the differences in estimation error were substantially reduced. Further, the computational time of the offline estimator are about ten times that of the DUMIQE estimator. In other words, if computational time or memory usage are not an issue, the offline estimator combined with a large amount of directional vectors will give the most precise estimates from the samples. Else incremental quantile estimators are preferable even for offline settings.

Synthetic Experiments -Dynamically Changing Data Streams
In this section we consider the problem of tracking α-depth regions of dynamically varying data streams. Figure 4 illustrates the problem. In each panel, the expectation vector of the data   Hyndman and Fan (1996) to estimate α-depth contours for α = 0.05, 0.2 and 0.4. MADE, ED and CPU refers to the error measures in (9) and (10) (multiplied by 10 3 ) and CPU time used (in seconds), respectively. N refers to the sample size.
stream distribution moved from the bottom left of to the upper right. At the same time the correlation, changed from strongly positive, 0.8, to strongly negative, −0.8. For the 10 3 samples case (first row), the algorithm was able to track the α-depth regions satisfactory. With 10 4 samples the estimates improve significantly and with 10 5 observations, the estimates are very close to the true contours. With 10 4 and 10 5 samples, 50 directional vectors give better and smoother estimates than 10 directional vectors. Evaluation for p > 2 is given below. Due to the computational burden of evaluating estimation error of non-elliptic distributions, the analysis was restricted to Gaussian distributions. Let X n = (X n,1 , . . . , X n,p ) T be multivariate normally distributed with where ψ i , i = 1, . . . , p are independent uniformly distributed variables on the interval [0, 2π] ensuring that the marginal expectations are out of phase. Covariance between X n,i and X n,j is Cov(X n,i , X n,j ) = 0.4 sin 2π where ψ is uniformly distributed on the interval [0, 2π]. Tables 5 to 6 show results tracking α-depth regions for α = 0.05, 0.2 and 0.4 for periods T = 10 3 and T = 10 4 under optimal choices of the tuning parameter 5 . More detailed results are given in Figures 11 and 12 in supplementary material S.2. For T = 10 3 MADE is around 0.05 and estimation error does not decrease with increasing number of directional vectors which may seem surprising. The reason is that if the quantile estimates are poor, the intersections of the resulting halfspaces in (7), do not necessarily become better by adding more halfspaces. For T = 10 4 MADE is between 0.02 and 0.03. The optimal number of halfsspaces increases with dimension, but not dramatically.
The algorithm is computationally very efficient. For dimension p = 5 the algorithm can optimally process 10 4 to 10 5 observations from a data stream every second on a single CPU processor.
By using more equidistant directional vectors, we expect reduction in tracking error. Consider the dynamic case above except that the directional vectors are chosen more equidistantly. Directional vectors were generated using the filtering procedure in Section 3 with N u = 10n u .
The results are shown in Table 7 and more detailed results are given in Figure 14 in supplementary material S.2. By comparing Tables 5 and 6 with 7, we see that for T = 10 3 and p = 2  Table 5: Tracking of α-depth regions for α = 0.05, 0.2 and 0.4 for the distribution characterized by (13) and (14) with a period T = 10 3 . The columns 'Freq' refers to how many times per millisecond the algorithm can update an α-depth region when running on a single 1.8 GHz CPU processor.  Table 6: Tracking of α-depth regions for α = 0.05, 0.2 and 0.4 for the distribution characterized by (13) and (14) with a period T = 10 4 . The columns 'Freq' refers to how many times per millisecond the algorithm can update an α-depth region when running on a single 1.8 GHz CPU processor.  Table 7: Tracking of α-depth regions for α = 0.05, 0.2 and 0.4 for the distribution characterized by (13) and (14) using fairly equidistant directional vectors. Tracking error is measured using MADE. The left and right columns show results for T = 10 3 and 10 4 , respectively. Dimension is p = 2.
T = 10 4 , minimum MADE is reduced from 0.045 to 0.040 and from 0.0226 to 0.0216, respectively. However more importantly, by using equidistant vectors, the best results are obtained using fewer directional vectors. For both T = 10 3 and T = 10 4 , the optimal number of vectors are reduced from 25 to 10. Finally we observe significant improvement if only five directional vectors were used. Using equidistant directional vectors adds an additional computational cost in the initialization of the algorithm, but will result in gained peak performance and fewer directional vectors, and thus less computation time and memory, needed during tracking.

Real-life Data Examples
In this section we use the algorithm on a real-life dataset related to activity change detection. A second real-life data example related to real-time event detection using Twitter data is given in supplementary material S.3. We demonstrate how the algorithm can be used to detect outliers and events and perform classifications in dynamic settings. For example related to event detection, by characterizing a data stream distribution with multiple depth contours, in practice any change in the data stream distribution can be detected. Not only changes in common properties such as expectation and covariance structure, but also changes in shape such as a change from an elliptic to a non-elliptic distribution.

Activity Change Detection
Activity recognition is a highly active field of research where sensory information is used to automatically detect and identify activities of users. E.g. to detect sedentary lifestyle and prompt the user to perform healthy exercises. We will focus on identifying changes in activities using observations from accelerometer which is available on almost any smart cell phone today.
We consider an accelerometer dataset from the WISDM (Wireless Sensor Data Mining) project (Kwapisz et al.;. Accelerations in x, y and z directions where observed, with a frequency of 20 observations per second, while users were performing the activities walking, jogging, walking up a stairway and walking down a stairway. A total of 36 users were observed and the dataset contains a total of 989 875 observations. Current research focuses on supervised approaches where historic and annotated activity observations are used to train an activity classification model. E.g. Kwapisz et al. (2011) trained models such as decision trees and neural networks. However such an approach is highly sensitive to any temporal changes in the data, e.g. if the user changes to an activity that is not part of the training material, becomes fitter, sick etc. In this example we rater take an unsupervised approach and the goal is to detect whenever the user changes activity. Since we receive 20 accelerometer observations per second, it is important that the streaming approach is computationally efficient.
Change detection is useful as part of a supervised scheme. Whenever a change is detected, the observations from the last activity can be classified and the supervised classifier retrained. If the supervised learner is sufficiently uncertain about the last activity, it may in real time ask the user for feedback. Figure 5 shows in gray x, y and z acceleration for an arbitrary user. The red lines show when the user changed activity. Acceleration distributions are fairly stationary within an activity, but with some gradual and abrupt changes. The users often changed activities as often as every 30 second making this a challenging tracking and change detection problem. Figure 6 shows scatterplots of the accelerometer observations for two arbitrary sessions with minimal temporal trend. The simultaneous acceleration distributions vary a lot between sessions and are often far from elliptical. Further, even though the distributions may look quite different, the mean and covariances often are quite similar making change detection based on elliptic  distributions challenging. We thus suggest the following simple depth based change detection procedure: 1. Track α-depth contours of the simultaneous acceleration distribution by tracking n u directional quantiles using the DUMIQE algorithm with tuning parameter λ.
2. Compute the Euclidean distance between the current α-depth contours and the contours h seconds back in time using Equation (10). Let ED t denote the distance at time t.
3. Track the expectation and standard deviation of ED t distribution using exponential moving average 4. When the user changes activity, we expect ED t to rapidly increase. We detected a new activity when ED t is more than η standard deviations higher then E(ED t ), i.e. ED t ≥ E(ED t ) + η SD(ED t ).
5. When a new activity was detected, restart the tracking of the α-depth contours and go back to step 1.
The beauty of the approach above is that since it measures difference in depth contours, it can in practice detect any kind of changes in the shape simultaneous acceleration distribution: for example a change from a symmetric to a non-symmetric distribution. Given the properties of the observations in this application, this flexibility is important. We compare the approach against an identical approach except that in the first part of the algorithm the mean and covariance structure (and not depth contours) were tracked using multivariate exponentially weighted moving average (MEWMA) (Lowry et al.;1992).
We measured the performance of the depth and the MEWMA approaches for a wide range of values for the tuning parameters. Several sessions lasted for only 30 seconds and it was thus important for the tracking algorithms to rapidly adapt to a session before a new change of activity took place. In the first step of the procedures we thus chose decreasing values of the tuning parameters, but with a minimum value to take into account the dynamic changes in accelerations within a session, λ t = max{1/t, λ min }, and tried the values 0.1, 0.05 and 0.01 for λ min 6 . This performed better than using constant values of the tuning parameter. We further tried the values 0.1, 0.05 and 0.01 for δ, 2.5, 5 and 10 seconds for h and 2, 5 and 8 for η. Further for the depth approach we used three depth contours with α equal to 0.2, 0.05 and 0.01 and tried n u = 20 or 50 directional vectors. We ran the two change detection approaches for the whole dataset for all the combinations of the parameters. This resulted in a total of 162 and 81 experiments for the depth and the MEWMA approaches, respectively.
Precision, recall and the F1 score was used to measure performance (Sokolova and Lapalme;). If the approach detects more than one change between two true changes, we characterize the first change as a correct detection and the others as false detections and define    Tables 8 and 9 show the top ten results with respect to the F1 score. The depth approach outperforms the MEWMA with respect to the F1 score and in addition detects the true changes more rapidly. The performance of the depth approach does not seem to be particularly sensitive on the number of directional quantiles used.

Closing Remarks
In this paper we have presented a computationally and memory efficient procedure to estimate and track Tukey α-depth contours using incremental quantile estimators. Due to the flexibility of Tukey depth, the procedure characterize elliptic and non-elliptic distributions equally well. The real-life data examples demonstrate that the procedure is useful to make real-time decisions from complex multidimensional streaming data.
To estimate α-depth contours, the number of directional vectors, n u , and values of tuning parameters in the incremental quantile tracking algorithms must be chosen. We are currently working on procedures that use use information from the history of the data stream to recursively update such values.