Detecting Metachanges in Data Streams from the Viewpoint of the MDL Principle

This paper addresses the issue of how we can detect changes of changes, which we call metachanges, in data streams. A metachange refers to a change in patterns of when and how changes occur, referred to as “metachanges along time” and “metachanges along state”, respectively. Metachanges along time mean that the intervals between change points significantly vary, whereas metachanges along state mean that the magnitude of changes varies. It is practically important to detect metachanges because they may be early warning signals of important events. This paper introduces a novel notion of metachange statistics as a measure of the degree of a metachange. The key idea is to integrate metachanges along both time and state in terms of “code length” according to the minimum description length (MDL) principle. We develop an online metachange detection algorithm (MCD) based on the statistics to apply it to a data stream. With synthetic datasets, we demonstrated that MCD detects metachanges earlier and more accurately than existing methods. With real datasets, we demonstrated that MCD can lead to the discovery of important events that might be overlooked by conventional change detection methods.


Purpose of This Paper
In this study, we are concerned with detecting changes in data streams. The goal of change detection is to detect the time points at which the nature of the data-generating mechanism significantly changes.
Thus far, many algorithms have been proposed to detect change points in data streams (e.g., [1][2][3][4][5][6][7][8][9][10][11]), and several studies addressed or have been related to the issue of changes of changes [12][13][14][15][16][17][18]. In this paper, we refer to the changes of changes as metachanges. A metachange refers to a change in the pattern of when or how changes occur. It is practically important to detect metachanges because they may be early warning signals of important events [12,13]. Metachanges have been treated from a viewpoint of metachanges along time. Metachanges along time indicate that the interval significantly varies between the change points. Such metachanges were called burstiness [12] and volatility [13] in previous studies. The detection of metachanges along time provides users with useful information from data streams. For example, in a machine in a manufacturing factory, a decrease in the interval between change points might be a sign of a serious failure.
There is also another type of metachange: metachanges along state. Here, "state" refers to the parameter value of the probability density function of a distribution. We consider a situation where change points t 1 , . . . are detected for a data stream y 1 , y 2 , . . . , and y t is drawn from p y (y t ; η). Here, p y is a probability density function of distributions, and η is the associated parameter. Note that η is called state in this paper, and it varies before and after a change point. A metachange along state means a change of how significantly η varies before and after a change point. Metachanges along state might provide information such as changes of magnitude and velocity, which indicate an important change in the underlying data-generating mechanism. For example, in a machine in a manufacturing factory, a shift to an abrupt (sudden) change from a gradual (incremental) change [19], or its inverse shift, might be a sign of serious events.
A conceptual illustration of metachanges is shown in Figure 1, where the upper graph shows a data stream y 1 , . . . and change points {t i } 8 i=1 on the horizontal axis. The lower left graph shows intervals between change points ∆t = t i − t i−1 on the vertical axis. Metachanges along time occur at t 4 , t 5 , t 6 , t 7 : for example, t 4 − t 3 is different from t 3 − t 2 and t 2 − t 1 . The lower right graph shows the states estimated piecewisely between the change points. Here, we assume y t is drawn from the univariate normal distribution p y (y t ; µ, σ), where µ is the mean and σ is the standard deviation. In this case, (µ, σ) is a state. In Figure 1, because there is no significant change in the magnitude of state change between t 1 and t 2 , a metachange along state does not occur at t 2 . However, there is a significant change in the magnitude of change of µ between t 2 and t 3 : thus, a metachange along state occurs at t 3 . Because the magnitudes of the changes of µ and σ are almost the same between t 3 and t 4 , a metachange along state does not occur at t 4 . Using the same procedure, we conclude that metachanges along state occur at t 3 and t 7 with respect to µ. Moreover, metachanges along state occur at t 6 and t 8 with respect to σ: the magnitude of the change of standard deviations around t 6 (t 8 ) is greater than those around t 5 (t 7 ). As a result, metachanges along state occur at t 3 , t 6 , t 7 , and t 8 . We can infer that metachanges along both time and state occur at t 6   Metachanges along time have been investigated in previous studies [12,13], and, although there have been several studies related to metachanges along state [14][15][16][17][18], the focus of these studies was not on metachanges along state in particular. The purpose of this paper is to propose a framework and an approach to detect metachanges along time and state from a unified view with the minimum description length (MDL) [20]. Therefore, our framework and approach not only include previous notions such as burstiness [12] and volatility [13] but also extend these notions to metachanges along state. MDL asserts that the best statistical decision strategy is the one that compresses the data best.
Description and coding with MDL are suitable for quantifying changes, and they enable us to easily integrate the code lengths of time and state.

Related Work
Change detection has been extensively explored in the area of data mining. Thus far, several methods have been proposed to detect metachanges in data streams [12,13], and there have been several studies related to metachanges along state [14][15][16][17][18].
Kleinberg [12] and Huang et al. [13] proposed algorithms for detecting metachanges along time. Kleinberg [12] proposed an algorithm to detect bursts in a time series. This algorithm assumes that intervals between successive events are drawn from an exponential distribution. The discretized values of the parameters of the exponential distribution are regarded as states. For intervals between successive events, states are estimated with dynamic programming. Changes of state indicate changes of intervals between the successive events. Huang et al. [13] proposed an algorithm, called the volatility detector, which detects changes of rates of change. The volatility detector prepares two buckets, called the buffer and the reservoir, to store intervals between change points. The intervals are put into the buffer sequentially. When the buffer is full, an interval is dropped from the buffer and moved to the reservoir in a first-in-first-out fashion. The reservoir stores the dropped interval by randomly replacing one of its stored intervals. If the ratio of variances of the buffer and the reservoir is over or under the specified threshold, the algorithm judges that the intervals change between change points. The authors called this event volatility shift. Both the burst detector and volatility detector are assumed to be used in two steps. That is, change points are detected with other change detection algorithms, and then changes of intervals between the change points are detected. While the burst detector works in an offline fashion, the volatility detector works in an online fashion.
Moreover, there have been several studies related to metachanges along state [14][15][16][17][18]. Aggarwal [15] introduced velocity density estimation to understand, visualize, and determine trends in the evolution of fast data streams. Spiliopoulou et al. [16,17] proposed an algorithm, called MONIC, to model and track cluster transitions. Ntoutsi et al. [18] proposed an algorithm, called FINGERPRINT, to summarize cluster evolution. Huang et al. [14] proposed a change type detector, intended to categorize change types into three relative types, some of which correspond to concept drifts proposed in [19]. Although their algorithms [14][15][16][17][18] are related to metachanges along state, they are not intended to characterize and detect metachanges directly. In addition, many change detection algorithms have been proposed based on detecting changes of state (e.g., [6][7][8][9]21,22]). The dynamic model selection [6,7] is the seminal work to apply MDL to the task of dynamic model selection and change detection. The MDL change statistics [8], SCAW [9], and STREAMKRIMP [22] are change detection algorithms with MDL. However, these algorithms are not intended to characterize and detect metachanges along state directly.

Significance of This Paper
In the context of Sections 1.1 and 1.2, the contributions of this paper are summarized in the following subsections.

Proposal of Concept of Metachange
To detect changes of changes in data streams, we define a concept of metachanges along both time and state. Previous studies [12,13] considered metachanges along time only. In this paper, we deal with metachanges along both time and state. Metachanges along time include the notions proposed in previous studies such as burstiness [12] and volatility [13]. Metachange along state could capture changes of changes of the parameters of distribution between change points.
Our concept of metachange can detect the potential change of changes in data streams, which was overlooked by previous studies.

Novel Algorithm for Detection of Metachanges
We define metachange statistics along both time and state. There is a challenge to combining the metachange statistics along time and those along state. In this paper, these statistics are defined based on the MDL principle. Metachange statistics along time (MCAT) is defined as the code length of an interval between the change points, whereas metachange statistics along state (MCAS) is defined as the difference between the predictive code length and the normalized maximum likelihood (NML) code length [23] after a change. It is possible to simply add these statistics because they are defined as code lengths, which enables us to detect metachanges along both time and state in a unified manner.

Theoretical Background of Metachange Statistics
In this section, we consider how to encode both intervals between change points and states around the change points. We assume that for a data stream y 1 , y 2 , . . . change points t 1 , ... are detected and that the intervals between change points x i = t i − t i−1 and y t are drawn, respectively, from where p x and p y are probability density functions of distributions and ξ and η are the associated parameters. Finally, η is the state whose metachanges are addressed in this paper.

Definitions of Metachanges
In this subsection, we give definitions of metachanges.
where q 1 and q 2 are distributions of intervals. q 1 → q 2 means that x t ∼ q 1 at t = t i−1 and x t ∼ q 2 at t = t i . d is a distance function between the probability density functions.

Definition 2.
(Metachange along state) For a data stream y 1 , y 2 , . . . , we say that a metachange along state occurs at a change point t i for a threshold parameter δ s > 0 if and only if where q 1 , q 2 , and q 3 are distributions of values of the data stream. Equation (2) means that y t ∼ q 1 at Here, d is the same as that in Definition 1.

Problem Setting
In this subsection, we consider a situation where (m + 1) change points t 1 , . . . , t m+1 are given. We consider how to encode x i and y t as shortly as possible. The ideal code length required for encoding x i is given by what we call the predictive code length, which is the sum of the negative logarithm of its predictive density p x at each time point, defined as follows: whereξ x i−1 are estimated at each change point. Similarly, the ideal code length required for encoding y t around change points is given by the predictive code length as follows: where Neighbor(t i ) indicates the neighborhood of a change point t i . In practice, as explained in Section 3.3, . . x i−1 and y t−1 = y 1 . . . y t−1 , respectively. A change ofη y t−1 indicates a change of state. Detection of a metachange along time is asserted as a problem of detection of a change ofξ x i−1 in Equation (3). On the other hand, detection of a metachange along state is asserted as a problem of detection of a change of howη y t−1 in Equation (4) changes around a change point between change points.

Metachange Detection Algorithm
In this section, we present our online algorithm called metachange detection algorithm (MCD) for detecting metachanges along both time and state. We consider how to achieve Equations (3) and (4) in an online fashion. A schematic description of MCD is shown in Figure 2. First, we detect change points from data stream (A). Next, we concurrently detect metachanges along time (B) and along state (C). We introduce metachange statistics to quantify these metachanges. Finally, we integrate the metachange statistics along time and state into a statistics (D).
The key challenge of detecting metachanges along time and state is how to describe and integrate them. Our approach describes both metachanges as code lengths with MDL; therefore, it is easy to combine them.

Detecting Change Points
First, we detect change points t 1 , t 2 , . . . . As our proposed algorithm MCD works in an online fashion, it is necessary for the change detection algorithm to work in an online fashion (e.g., [1][2][3][4]8,9]).
In general, MCD is prone to errors by the change detection algorithm and its threshold parameter. We empirically investigate and discuss this point in detail in Section 4.

Detecting Metachanges along Time
For the detected change points t 1 . . . , let us consider intervals between the successive change points For an interval sequence x i = x 1 . . . x i , we consider how to achieve Equation (3) in an online fashion. We define metachange along time (MCAT) a t i as the predictive code length where For example, we can estimateξ x i−1 as the maximum likelihood estimator. To deal with nonstationary data streams, we use the online discounting maximum likelihood estimator [24]ξ where 0 < r < 1 is a discounting parameter. An increase in r has a greater effect on forgetting past data.
In this paper, we introduce a parametric class of the exponential distribution By substituting Equation (7) into Equation (6), we get The inside of argmax in the right-hand side of Equation (8) is expanded as The right-hand side of Equation (9) is maximized by deriving it with respect to ξ. As a result, we obtain the following optimal solution: Thus, by substituting Equation (10) into Equation (5), MCAT at t i is In practice, we judge that a metachange occurs along time when MCAT changes greatly between the change points. Technically, we use the change rate of MCAT: a metachange occurs along time when |(a t i − a t i−1 )/a t i−1 | > t holds, where t > 0 is a threshold parameter. We call the algorithm described above as the metachange detection along time algorithm (MCD-T).
As for computational cost of MCAT, Equation (10) is written aŝ s i and s i−1 satisfy the following relation: Therefore, the computational cost of MCAT a t i is O(i).
Then, x i is calculated as Figure 3 shows the time intervals at the change points ( Figure 3, top), MCATs a t i (Figure 3, second graph), the change rate of MCATs |(a t i − a t i−1 )/a t i−1 | (Figure 3, third graph), andξ x i−1 (Figure 3, bottom). We observe in Figure 3 that we can detect the metachange along time when we choose a suitable threshold t . Here, the discounting parameter is set to r = 0.5.

Detecting Metachanges Along State
For a change point t i detected in Section 3.1, we consider how to achieve Equation (4) in an online fashion. We consider a subset of time around t i for Neighbor(t i ) in Equation (4). The subset is denoted by where h ∈ N is a window size. Thus, we consider a sequence y t i +h t i −h = y t i −h . . . y t i +h , with length n = 2h + 1. We introduce a parametric class of probability distributions F s = {p y (Y; η); η ∈ H}. Here, Y is a random variable and η is a real-valued parameter. H is the associated parameter space.
Next, we define metachange statistics along state (MCAS) at change point t i . First, two statistics, b + t i and b − t i , are introduced. These are defined as the difference between two code lengths for y whereη ± is defined asη which indicates the parameter change to the same side and the opposite side in the same way as the previous change point t i−1 . Here,η y τ 2 τ 1 means the maximum likelihood estimator of η using y τ 2 The latter is calculated as the NML code length, which is defined as the negative logarithm of the NML distribution [20]: The difference between Equation (12) and Equation (14) is given by where (15) is computed using Rissanen's approximation formula under some regularity conditions [23]: where k is the dimension of H and I(θ) is the Fisher information matrix at the parameter value η. Intuitively, Equation (15) quantifies the redundant code length for coding y t i +h t i +1 with the parameters estimated in terms of the parameter change at t i−1 and the parameter values in the former part of t i .
Finally, we define MCAS as which means that metachanges along state are quantified by the relative magnitude of changes in the parameters in this paper. The computational cost of MCAS is O(h) = O(1). We judge that a metachange along state occurs at t i when b t i > s holds, where s > 0 is a threshold parameter. We call the algorithm described above as the metachange detection along state algorithm (MCD-S). Example: We generate a data stream with length 11,250:   Figure 4 shows that the statistics b t i increase when there is a change in how parameters behave around a change point between successive change points. At t 2 = 2001 and t 3 = 3001, b t i are relatively small, which shows that parameter changes (i.e., their magnitudes) do not differ much between t 1 = 1001 and t 2 = 2001 and between t 2 = 2001 and t 3 = 3001. However, b t i increases at t 4 = 4001 because the change shifts to a gradual change from an abrupt one. These results indicate that MCAS provides information regarding changes in the behavior around the change points.

Integrating Metachange Statistics
Finally, we consider how to integrate MCAT a t i and MCAS b t i at a change point t i . Because a t i and b t i are code lengths, they can be summed. Therefore, we propose adding a t i and b t i with weighting. Integrated metachange (MCI) s t i at t i is defined as where λ ∈ R is a hyperparameter. We should carefully choose λ with data. In Section 4.3, λ is determined using a grid search.
In practice, we judge that a metachange along both time and state occur at t i when MCI greatly changes between the change points. As in the case of metachanges along time in Section 3.2, we use the change rate of MCI: a metachange along both time and state occurs at We call the overall algorithm described above MCD; it is summarized in Algorithm 1.

4:
Detect change point with a change detection algorithm. 5: if t is change point then 6: t i ← t. 7: Calculate metachange statistics along time a t i according to Equation (11). 9: Calculate metachange statistics along state b t i according to Equation (16). 10: Calculate integrated metachange statistics s t i according to Equation (17). 11: Raise an alarm if and only if |(s t i − s t i−1 )/s t i−1 | > ts . 12:

Experiment
We conducted five experiments to confirm the effectiveness of the proposed algorithm MCD (https://github.com/s-fuku/metachange).

Synthetic Dataset 1 (Metachanges along Time)
We defined six levels of time intervals between change points referring to the work in [13,25]. The interval lengths were 100,000, 50,000, 10,000, 5000, 1000, and 500. The change points were set using a Bernoulli distribution oscillating between µ = 0.2 and µ = 0.8. For each combination of two intervals, we generated the streams based on the scheme above. Each stream contained 100 change points. In what follows, L 1 and L 2 indicate the first and second interval lengths, respectively.
We confirmed the effectiveness of MCD by comparing it with a volatility detector (VD) [13]. We used the SEED algorithm [13] and the sequential MDL-change statistics algorithm (SMDL) [8] for change detection. SEED was based on ADWIN2 [21] and its parameters were set to δ = 0.05, Γ = 75, = 0.025, and α = 0.025, which are the same as those in [13]. The window size w of SMDL was set to w = 0.2L 1 , and the threshold parameter was set to = 0.01. For the Bernoulli distribution, the change score Ψ t of SMDL at time t was calculated as whereμ 0 = ∑ t+w i=t−w y i /(2w + 1),μ 1 = ∑ t−1 i=t−w y i /w, andμ 2 = ∑ t+w i=t y i /(w + 1). If Ψ t > , t is regarded as a change point. We determined that t was a change point if the change score Ψ t was the maximum. The parameter of MCD-T was set to r = 0.2. Below, we discuss the dependency of MCD-T on r in Figure 5. For VD, buffer size B = 32 and reservoir size R = 32, which were the same as in [13]. We also discuss the dependency of VD on B and R below in Figure 6. In running SEED [13], we used the Java source code provided by the authors (https://www.cs.auckland.ac.nz/research/groups/kmg/ DavidHuang.html). We started to use change points when its number reached B + R for MCD-T and VD because the buffer and the reservoir of VD are not full until B + R intervals arrive.
We investigated the trade-off between detection delay and accuracy in terms of benefit and false alarm rate, defined as in [8,26]. For MCD-T, we first fixed the threshold parameter t and converted , where 1(t) denotes the binary function that takes 1 if and only if t is true. We evaluated MCD-T by varying t .
We let τ be a maximum tolerant delay of metachange detection. When the metachange really started from t * , we defined the benefit of an alarm at time t as The number of false alarms was calculated as We visualized the performance by plotting the recall rate of the total benefit, b, against the false alarm rate, n/ sup t n, with t varying. Likewise, for VD, α t i was calculated using the relative volatility between the variances of the buffer and the reservoir by varying the threshold parameter β. We evaluated all four combinations of change detectors SEED and SMDL and metachange detectors MCD-T and VD by calculating the average and standard deviation of the area under the curve (AUC) of the benefit vs. FAR curves. The AUC scores were calculated over 50 sequences. The delay parameter was set to τ = 5L 2 . Table 1 shows the average AUC scores. Table 1 shows that MCD-T with SEED or MCD-T with SMDL outperforms VD with SEED or VD with SMDL. This indicates the effectiveness of MCD-T.
Because MCD-T depends on discounting parameter r and the change detection algorithm used, we investigated these effects. First, we examined the dependency of AUC on r for all combinations of L 1 and L 2 . We calculated AUC for 30 times with r = 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, and 0.5. We used SEED [13] as the change detection algorithm, and its parameters were set to the same values as above. The dataset used was also the same as in the previous experiment. Figure 5 shows that, when L 1 is relatively small (e.g., L 1 = 500, 1000, 5000, 10, 000), AUC is not heavily dependent on r. When L 1 is larger, however, we observe that the larger r is, the smaller AUC is. This is because, with an increase of L 1 , the number of false alarms of SEED also increases. In such situations, MCD-T is more prone to the false alarms when r is larger. Figure 6 shows the dependency of AUC of VD on the buffer size B and the reservoir size R (B = R) for comparison. We calculated AUC for 50 times. We observe from Figure 6 that AUC decreases as B increases. In addition, we also see that MCD-T outperforms VD for various combinations of r and B(= R) by comparing Figure 5 with Figure 6.
Next, we investigated the effect of the change detection algorithm used. We used SEED by changing the parameterˆ = 0.0025, 0.005, and 0.0075. Other conditions and the dataset were the same as in the previous experiment. Here,ˆ is a hyperparameter that controls the threshold parameter [13]. Figure 7 shows that AUC does not heavily depend onˆ for all combinations of L 1 and L 2 . In general, the threshold parameter of the change detection algorithm controls the performance of MCD-T. Hence, it should be carefully set.

Synthetic Dataset 2 (Metachanges along State)
We generated a data stream with length 24L, where L = 500, 1000, 2000. The generated data stream contained a metachange along state. In the former part, each datum was drawn from After we repeated the procedure 10 times, we obtained a subsequence with length 20L. In the latter part, each datum was drawn from A metachange along state occurred at t = 20L + 1. For change detection, we employed four algorithms for comparison: (1) SMDL [8], a semi-instant method with the MDL change statistics; (2) ChangeFinder (CF) [1,2,4], a state-of-the-art method of abrupt change detection; (3) Bayesian online change point detection (BOCPD) [3], a retrospective online change point detection with a Bayesian scheme; and (4) ADWIN2 [21], adaptive windowing methods. As we assumed a situation where change and metachange mechanisms do not vary significantly, we decided to choose the best combinations of parameters of each change detection algorithm by grid search, as in [8,27]. We generated 10 sequences with the scheme above and calculated the F-scores for each combination of the following parameters: After optimizing the parameters of each change detection algorithm, we generated 30 data streams with the scheme above and detected change points and the metachange. In the metachange detection, we compared MCD-S with SMDL. We chose SMDL for comparison because it calculates a change score at each time based on changes of parameters with MDL. Hence, a change rate of scores between change points is regarded as the degree of metachange along state. Hereafter, we refer to SMDL for metachange detection as SMDL metachange (SMDL-MC) and the window parameter as w mc . We calculated MCAS in Equation (16) for MCD-S and the change rate |(Ψ t i − Ψ t i−1 )/Ψ t i−1 | for SMDL-MC. Ψ t is the change score at time t for a univariate normal distribution [8]: whereσ 0 ,σ 1 , andσ 2 are the maximum likelihood estimators of standard deviations calculated for y t+w mc t−w mc +1 , y t−1 t−w mc +1 and y t+w mc t , respectively. C k is the normalizer of the normalized maximum likelihood code length [20] log C k = 1 2 log 16µ max where Γ is the gamma function. In this paper, µ max = 2 and σ min = 0.005. The window parameters h of MCD-S and w mc of SMDL-MC were set to h, w mc = 100 (L = 500), h, w mc = 200 (L = 1000), and h, w mc = 400 (L = 2000). In calculating the F-scores, the maximum tolerant delay was set to τ = 0.5L. Table 2 shows the average AUC values of MCD-S and SMDL-MC for the detection of metachanges along state at t = 20L + 1. The first and second rows in the header represent change detection and metachange detection algorithms, respectively. The best parameters for each combination of change detection and metachange detection algorithms are = 0.7, w = 100 (L = 500), = 0.7, w = 200 (L = 1000), and = 0.7, w = 400 (L = 2000). Table 2 shows that MCD-S outperforms SMDL-MC overall because MCD-S deals with metachanges along state directly in terms of MCAS, whereas SMDL-MC only quantifies the difference in code lengths between situations where there is a change and where there is no change.  We further investigated the effects of window size h and threshold parameters of the change detection algorithms. We chose SMDL [8] for change detection. Figure 8 shows the dependency of AUC on h and threshold parameter of SMDL. The interval length was set to L = 500, threshold parameter was set to = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, and h = w = 50, 100, 150, where w is the window parameter of SMDL. Figure 8 (top and bottom) shows the dependency of AUC of MCD-S on the threshold parameter of SMDL and the dependency of F-score of SMDL on , respectively. We observe in Figure 8 (top) that AUC of MCD-S decreases between = 0.2 and 0.4, but, when exceeds 0.4, AUC begins to increase for h = 50, 100, 150. This reflects the fact that there are many local maximum points of the change scores of SMDL, leading to false alarms of change points around = 0.2-0.4. It is noticeable that F-scores of SMDL decrease for = 0.1 (h = 100), and for = 0.2 (h = 150), but AUCs of MCD-S do not do so much. This is because SMDL detects many false positive change points, but it detects the metachange point accurately.
As for the dependency of AUC on window size h, we observe that AUC generally increases as h increases for the same .
In total, we obtained a data stream with length 100L 1 + 4L 2 . A metachange along both time and state occurred at t = 100L 1 + L 2 + 1. We chose lengths L 1 and L 2 among 400, 450, and 500.
We detected the metachange in the following three ways: we first detected change points with the same algorithms as in Section 4.2, and then detected the metachanges with MCD-T, MCD-S, and MCD. The parameters of the change detection algorithms were tuned as in Section 4.2. The ranges of parameters were the same as those in Section 4.2. except that, for SMDL, the threshold parameter = 0.05, 0.1, 0.15 for all combinations of L 1 and L 2 . The parameter of MCD-T was selected among r = 0.1, 0.2, 0.3 and MCD-S was among h = 0.1L 1 , 0.2L 1 . The window size of SMDL were selected among w = h, and the maximum tolerant delay was τ = L 2 . We chose the weight parameter λ in Equation (17) among λ = 0.001, 0.01, 0.1, 1, 5, 10. For VD, the buffer and reservoir sizes (B and R) were selected among 16, 24, 32. All the parameters were selected with grid search for the AUCs of metachange detection to be maximum. Table 3 shows the average AUC values. Table 3a-c show average AUC values with MCD-T, MCD-S, and MCD. Table 3a shows that MCD combined with SMDL as the change detection algorithm outperforms MCD-S and MCD-T. Table 3. Average AUC scores of metachange detection on Synthetic Dataset 3. The first and second headers represent change detection and metachange detection algorithms, respectively. Boldfaces describe best performances.
(a) Metachange detection along time.    Table 4 shows the best parameters for each combination of intervals. We observe that the more intensive a metachange along time is, the bigger r is and the less λ becomes. These results reflect the fact that it is necessary to adapt to recent data, and MCAT increases in such a situation, leading to the decrease of λ.

Real Dataset: Human Action Recognition Data
We applied MCD to the detection of metachanges in human action recognition data called HASC-PAC2016 dataset [28] (HASC-PAC2016 dataset is publicly available at http://hub.hasc.jp/). The data were collected from the Human Activity Sensing Consortium (HASC, http://hasc.jp/). HASC-PAC2016 dataset contains sequences of acceleration data for three axes, and each sequence is segmented into one of six action labels: "stay", "walk", "jog", "skip", "stair up", (go upstairs) and "stair down" (go downstairs). For this experiment, we aimed to evaluate the effectiveness of our proposed algorithm MCD by using a data stream with ground truth of "changes of action changes" and "changes of intervals of actions". The former corresponds to metachanges along state, and the latter to metachanges along time. We combined each action into a data stream as follows: first, we repeated "stay" and "walk" alternately for 15 times; then "jog" and "skip" for 15 times; and, finally, "stair up" and "stair down" for 15 times. We repeated each pair of actions for 15 times because "stair up" and "stair down" have only 15 files, which are the fewest in all the six actions. We obtained a data stream of length 89,324. Table 5 shows the files used for a participant named Person06023. We read the files sequentially in alphabetical order for each action. Figure 9 shows the data stream we obtained. Here, acc_X, acc_Y, and acc_Z represent accelerations for x-, y-, and z-axes, respectively.  First, we detected change points with SMDL [8]. It was a challenge to determine the hyperparameters of SMDL-window size w and threshold parameter -in an online change detection. We tuned w and with the remaining dataset for Person06023, which alternated "stay" and "walk" four times, and "jog" and "skip" likewise. Although this dataset lacked "stair up" and "stair down", we thought that it was enough to estimate the best configuration of w and . We calculated F-score as described in Section 4.2 for the change points between different action labels. We selected w = 900 and = 0.75 among w ∈ {500, 600, 700, 800, 900, 1000} and ∈ {0, 0.25, 0.5, 0.75, 1}. Figure 10 shows histograms of intervals for each action label. We observe in Figure 10 that most of the intervals are around 960-970 for "jog", "walk", and "skip", whereas, for "stay", "stair up", and "stair down", the intervals are around 1020. We can see that w = 900 was enough to detect changes. We applied SMDL to the stream and obtained the estimated change scores {Ψ t } at each time point. We calculated Ψ t with the multivariate normal distribution. Specifically, Ψ t is calculated as Note that C w in Equation (21) is the normalizer of the NML code length [29,30]: where m is the dimension of the data stream, Γ is the gamma function, and Γ m is calculated as We set µ max = 50 and σ min = 0.005. Next, we defined the ground truths for metachanges along state at two time points where the changes of action label changes occurred: t =29,752 from "jog" to "stair up", and t =59,588 from "walk" to "skip". Moreover, we also defined the ground truths for metachanges along time at time points where the changes of intervals occurred. We see in Figure 10 that the distributions are significantly different between four types of "changes of action changes": from "stay" to "jog", from "jog" to "stair up", from "stair up" to "walk", and from "skip" to "stair down".
We detected metachanges along time with MCD-T and volatility detector (VD) [13], and compared them. Figure 11 shows the estimated MCAT with MCD-T and the relative volatility with VD. The parameter of MCD-T was set to r = 0.1, 0.2, 0.3, whereas one of VD was B = R = 10, 15, 20. Figure 11 shows the results.
We observe in Figure 11 that MCAT detects the metachanges along time between the four action pairs, respectively, for r = 0.1, 0.2, and 0.3. However, the relative volatility fails to detect some of these metachanges along time. We detected metachanges along state with MCD-S and the change rate of the MDL change statistics [8]. Figure 12 shows the estimated MCAS with MCD-S and the MDL change statistics. We observe in Figure 12 that both MCD-S and the MDL change statistics detect a time point around t =29,752 from "jog" to "stair up". However, the MDL change statistics do not change significantly at a time point around t =59,588, where a metachange along state happened from "walk" to "skip". It indicates that the change rate of the MDL change statistics failed to detect the metachange along state around t =59,588, whereas MCD-S detected it successfully. In summary, the proposed algorithm MCD detected metachanges along both time and state more accurately than other methods.

Real Dataset: Production Condition Data
We applied MCD to the detection of metachanges in the production condition data. The data were collected from a factory of a manufacturing company. Each datum comprised eight attributes, and the length of the stream was 26,450. The factory reported that important events occurred 10 times during the study period, at t =668, 2634, 2635, 9663, 13,230, 13,231, 17,372, 17,832, 20,131, and 25,441. Figure 13 shows the attributes from the stream. The dashed line indicates the time points where important events occurred. We investigated whether the detected metachanges were signs of important events, and we finally concluded that it might be true. The details are as follows. Figure 13 shows that the scales of attributes were different. Hence, we normalized each attribute X to (X − µ)/σ, where µ and σ are the sample mean and standard deviation, respectively, which were calculated with the first 250 time points. First, we applied SMDL [8] to the stream and obtained the estimated change scores {Ψ t } at each time. We calculated Ψ t with the multivariate normal distribution in Equation (21). The window sizes w of SMDL and h of MCD were set to w = h = 250 by field knowledge that it roughly represents a unit of production. Moreover, µ max and σ min in Equation (22) were set to 60 and 0.001, respectively. Next, we detected change points t 1 , t 2 , ... as time points where the change scores Ψ t i were locally maximum within an interval where Ψ t > . We set = 0.3 when the total change points detected was less than 0.5% of the total length. It is a business demand by a factory, and so there were not many alarms. The number of detected change points was 97 (0.37%). Finally, we determined the discounting parameter r and the weight parameter λ of MCD in Equation (17) with the first 5000 time points. We selected r = 0.1 and λ = 0.2 so that the AUC score at t = 2634 and t = 2635 would be the maximum. The AUC score was calculated using Equations (18) and (19).  Figure 14 shows the MDL change statistics {Ψ t } calculated with SMDL [8] (Figure 14, top), the estimated MCAT a t i (Figure 14, second), logarithm of the estimated MCAS log 10 b t i (Figure 14, third), and logarithm of the estimated MCI log 10 s t i (Figure 14, fourth). We also estimated the relative volatility with VD [13,25] (Figure 14, fifth) and the change rate of the MDL change statistics |(Ψ t i − Ψ t i−1 )/Ψ t i−1 | (Figure 14, bottom) for comparison in detecting metachanges along both time and state. For VD, the buffer size B and the reservoir size R were both set to 10. In Figure 14 (top), the red points indicate the detected change points.
We summarize what can be seen for metachange statistics in Figure 14 as follows: • t = 9663: The trend of MCI increases roughly after t = 5000, which can be interpreted as a combination of MCAT and MCAS in Figure 14. The relative volatility and the change rate of the MDL change statistics do not show such a significant sign. • t = 13,230, 13,231, 17,372, 17,832: For time points between t = 10,000 and t = 15,000, the trend of MCI increases. It is also due to the combination of MCAT and MCAS, but is more influenced by MCAS. It might also be a sign of important events at t = 17,372 and 17,832 as well as t = 13,230 and t = 13,231. The relative volatility increases after t = 13,231, which might be a sign of the important event at t = 17,372. However, the change rate of the MDL change statistics does not show such a significant sign. • t = 25,440: For time points between t = 20,000 and t = 25,000, the trend of MCI increases with large fluctuations. It is also more influenced by MCAS. It might also be a sign of important events at t = 25,440. The relative volatility increases for the time points, but the change rate of the MDL change statistics does not show such a significant sign.
In summary, we can observe a sign of metachange for each important event. We therefore infer that there might have been some symptoms that should be analyzed using field knowledge.

Conclusions
We propose the concept of metachanges along time and state in data streams, and we introduce metachange statistics to quantify metachanges from a unified view with MDL. The key idea of our proposed method is to encode the time intervals and change of states with code lengths in the same fashion. Next, we introduce the novel methodology of MCD. Using synthetic datasets, we empirically demonstrated that the proposed algorithm was highly effective in detecting metachanges along time and state. Using a real dataset, we demonstrated that the proposed algorithm could detect metachanges in both time and state, some of which were overlooked by VD [13] and the MDL change statistics [8]. The estimated metachange statistics might have been a sign of important events.
Future work will be directed toward the theoretical guarantee of metachange statistics, especially integrated metachange statistics. We will also consider how to adapt to a non-stationary data stream by updating the weight parameter λ in Equation (17). Other research directions might lie in the extension of metachange statistics to transient periods between change points. Furthermore, metachange detection of model structure change and its change sign is another interesting line of research.