Introduction to Extreme Seeking Entropy

Recently, the concept of evaluating an unusually large learning effort of an adaptive system to detect novelties in the observed data was introduced. The present paper introduces a new measure of the learning effort of an adaptive system. The proposed method also uses adaptable parameters. Instead of a multi-scale enhanced approach, the generalized Pareto distribution is employed to estimate the probability of unusual updates, as well as for detecting novelties. This measure was successfully tested in various scenarios with (i) synthetic data, (ii) real time series datasets, and multiple adaptive filters and learning algorithms. The results of these experiments are presented.


Introduction
Novelty detection (ND) plays an important role in signal processing. Many research groups have dealt with both the methods and applications because there are many complex tasks where accurate ND is needed. However, the success of this method depends on the type of data, so the current methods usually give good performance and results only for specific datasets. As more data are being analyzed currently, there is a greater need for new methods of ND. Furthermore, the increasing computational power provides more possibilities and methods that were not possible to use a few decades ago, but can now be performed for real-time tasks easily. For these reasons, we consider the topic of ND to be vital.
Two different approaches have been established over the last few decades. The first approach is based on the statistical features of the data [1], and some methods also use extreme value theory to estimate the novelty of the data [2][3][4][5]. The second approach uses learning systems [6][7][8]: the attributes of a learning system are used to obtain information about novelties in the data. Over the last decade, many new methods have been proposed in the field of machine learning [9]. The set membership algorithm [10][11][12] uses the prediction error for better accuracy, reducing the computational resources required and assuring a greater robustness with the proper filter, especially for data without drift. Bukovsky et al. proved that the learning effort of a learning system can be used to estimate a measure of the novelty for each data point [13,14], but a shortcoming of that method is that it is hard to interpret the ND score. A similar approach, combining the prediction error with adaptive weight increments, was proposed in [15]. That method also lacks the possibility of a meaningful interpretation of the ND score. It was also already shown that the accuracy of the learning system is not necessarily correlated with the accuracy of the ND [16] and that simple predictors are useful even for signals that are produced by complex systems (e.g., EEG, ECG).
ND brings a new point of view to complex signal analysis. Research groups have started dealing with the early diagnosis of different diseases where ND plays an important role. Taoum et al. presented ND and data fusion methods to identify acute respiratory problems [17]. Rad introduced ND for gait and movement monitoring to diagnosis Parkinson's disease and autism spectrum disorders [18]. Burlina used ND algorithms in the diagnosis of different muscle diseases [19].
Other fields where ND can be found are information and mechanical engineering. Hu introduced ND as an appropriate tool for monitoring the health of mechanical systems, where it is usually impossible to know every potential fault [20]. Surace described the application of ND to the simulation of an offshore steel platform [21].
In this article, a new method for ND is introduced. The proposed method combines both a statistics based approach and a learning systems based approach. The changes of the adaptive parameters of the learning system obtained via an incremental learning algorithm are evaluated. A new measure, called extreme seeking entropy, is then estimated. It is shown that the proposed measure corresponds to different types of novelties in various datasets and how it may be useful for diagnostics and failure detection tasks. It also outperforms the other unsupervised adaptive ND methods.
This paper is organized as follows. Section 2 describes the specifications of the learning system and learning algorithm used during the experiments. Section 3 recalls the learning entropy algorithm and an error and learning based novelty detection method. Then, the general suitable properties of learning based information are discussed. Section 4 introduces the new measure of novelty, and the ND algorithm based on this measure is presented. Section 5 describes a case study where both synthetic and real datasets are used to show the usability of the proposed algorithm and also contains the rationale behind the selection of the experiments. Section 6 contains the rate detection of the proposed algorithm in two cases, namely detection of a change in the trend and the detection of a step change of a signal generator. The last two sections are dedicated to limitations and further challenges, Section 7, and then our conclusions, Section 8.

Review of the Learning Systems Used
All the supervised learning systems used in the experimental analysis are introduced in this section. In general, assume that the output of the learning system is a function of weights and the input data: where y ∈ R denotes the output, w ∈ R n is the vector of its adaptable parameters, x ∈ R n is a vector that contains the input data, and f is the mapping function that maps the input data and weights to the output. The following adaptation is done in order to minimize the error: where k is a discrete time index and d(k) ∈ R is the target of the supervised learning system (the desired output). The update of the weights w is done with every new sample as follows: where ∆w ∈ R n is a vector that contains the updates of the adaptive parameters. This update depends on the learning algorithm used. The learning algorithms will be discussed later.

Adaptive Models
The adaptive models used during the experiments are described briefly in this section.

Linear Adaptive Filter
One of the simplest adaptive models is the linear adaptive filter, also known as the linear neural unit (LNU), with finite impulse response (FIR). The output of this model at a discrete time index k is given by: which is equivalent to where w T (k) = [w 1 (k), w 2 (k), . . . , w n (k)] ∈ R n is the row vector of adaptive weights and x T (k) = [x 1 (k), x 2 (k), . . . , x n (k)] ∈ R n is the column input vector. The vector of adaptive weights is updated with every new sample obtained, and the size of the update depends on the learning algorithm used. In general, x may contain the history of a single input or even the history of multiple inputs.

An Adaptive Filter Based on Higher Order Neural Units
The quadratic neural unit (QNU) [22][23][24] (also known as a second order neural unit) is a non-linear predictive model. The output of the QNU is: where often, x 0 = 1. This is equivalent to: where the column input vector colx for n inputs has the general form: and w is a row vector of adaptive weights that has the same length as colx. Note that the first term in colx, x 0 = 1, should be used when the data have a non-zero offset.

Learning Algorithms
To prove the generality of the adaptive weight evaluation approach for novelty detection, different learning algorithms have been tested. Both algorithms are heavily used in signal processing and machine learning.

Normalized Least Mean Squares Algorithm
The normalized least mean squares (NLMS) algorithm [25] is a variant of the least mean squares algorithm. The problem with the selection of the learning rate is solved by normalizing by the power of the input. It is a stochastic gradient approach. The update of this adaptive algorithm is given by: where ∈ R is a small positive constant used to avoid division by zero, µ ∈ R is the learning rate, and e ∈ R is the error defined as in (2). According to the normalization of the learning rate shown in (9), it is necessary to choose a learning rate µ satisfying 0 ≤ µ ≤ 2 to preserve the stability of the NLMS algorithm.

Generalized Normalized Gradient Descent
The generalized normalized gradient descent (GNGD) [26] algorithm is another algorithm for linear adaptive FIR filters. Due to its adaptation of the learning rate based on the signal dynamics, it converges in places where the NLMS algorithm diverges. The update of this adaptive algorithm is given by: with: where η ∈ R is the adaptive learning rate, ∈ R is a compensation term, and ρ is the step size adaptation parameter, which should be chosen so as to satisfy 0 ≤ ρ ≤ 1.

On the Evaluation of the Increments in the Adaptive Weights in Order to Estimate the Novelty in the Data
This section recalls two ND methods that evaluate the increments in the adaptive weights, namely learning entropy, and error and learning based novelty detection. Those methods are compared with the proposed algorithm in Sections 4 and 6. Then, the general properties of the learning based information measure will be discussed.

Learning Entropy: A Direct Algorithm
The recent publication on Learning Entropy [14] specifies a direct algorithm to estimate the learning entropy (LE) as follows.
Here, z is a special Z-score, given as follows: where |∆w M i (k − 1)| is the mean of the last M increments of w i , σ(|∆w M i (k − 1)|) is their standard deviation, and n w is the number of adaptive weights. According to Equation (15), the function f in this case corresponds to the special Z-score function z, and the function A is represented by the sum over the adaptive weights.

Error and Learning Based Novelty Detection
Another recently published method that evaluates the increments of the adaptive weight together with the prediction error is ELBND [15]. ELBND describes every sample with the value obtained as follows: or, alternatively, In this case, the function f is represented by multiplying the ith adaptive weight increment ∆w i by the prediction error e. The function A is the maximum of the vector in the case of ELBND given by Equation (13) and the sum over the weights in the case of the ELBND given by Equation (14).

General Properties of A Suitable Learning Based Information Measure
Learning entropy was proposed in [13,14]. It is a learning based information measure L that, in general, evaluates unusually large learning increments, as follows: where A is a general aggregation function and f is a function that quantifies the irregularity in the learning effort [14].
Another form for f and A will be presented in the present paper. Firstly, the function f is presented.
Assume that the value of f should be high when the increments ∆w are unusually high. Furthermore, this function also takes the history of those increments as input. As stated, some cumulative distribution function of each weight increment seems suitable. This cumulative distribution function (cdf) is discussed later in this paper. The question is how to deal with the aggregation function A. Under the assumption that each weight is independent of the others, it is possible to choose the aggregation function A as follows: The function A in the stated form is high for high cdf values of the weight updates, and hence for the values where the cdf is close to one. The function 1 − f cd f i can be viewed as the complementary cumulative distribution function (or the survival function, also known as the reliability function). This approach clearly avoids the need for a multi-scale approach. The result is that much fewer parameters are needed for detecting potential novelties. Only the crucial choice of the cdf remains. In the next section, a suitable probability distribution will be presented, together with the new novelty detection algorithm.

The Generalized Pareto Distribution
A normal distribution is used in some novelty detection algorithms [27][28][29]. However, the normal distribution cannot always be used, especially when the description of the data by a mean and a symmetric range of variation would be misleading [30]. Let us mention the Pickands-Balkema-de Haan theorem [31,32], which states that if we have a sequence X 1 , X 2 , . . . of independent and identically distributed random variables and F u is their conditional excess distribution function (over the threshold u), then: where GPD is the generalized Pareto distribution and F u is defined by: for 0 ≤ x ≤ x F − u, where x F is the right endpoint of the underlying unknown distribution F. The probability density function of the GPDtakes the form: where in general, µ ∈ (−∞, +∞) is a location parameter, σ ∈ (0, ∞) is the scale, and ξ ∈ (−∞, ∞) is a shape parameter. The corresponding cumulative distribution function then takes the form: Note that the support is Figure 1, we show the ability of the GPD to deal with many possible shapes of the tails of the distributions. Note that if ξ = 1, it is equivalent to the uniform distribution; if ξ = 0, it is equivalent to the exponential distribution; if ξ = −0.5, it is the triangular distribution; if −0.5 < ξ < 0, it is a light tailed distribution (e.g., the normal distribution or the Gumbel distribution); if ξ > 0, it is a heavy tailed distribution (e.g., the Pareto distribution, the log-normal distribution, or Student's t-distribution); and if ξ < −1, it is a monotonically increasing distribution with compact support (e.g., the beta distribution). As long as we do not know the distribution of increments of the adaptive weights, it is appropriate to use the GPD due to its universality in modeling the tails of other distributions [33][34][35]. As the aim is to evaluate unusually high increments of an adaptive system, the need for some threshold arises: denote this threshold by z. This threshold should divide the weight increments into two sets. An increment that is lower than the threshold should belong to the set that contains the usual high increments; denote this by L. However, an increment that is greater than or equal to this threshold should belong to the set H. Assume that both sets exist for every adaptable parameter, so for the ith adaptable parameter w i , we should set a threshold z i so the weight updates belong to the sets as follows.
The increments belonging to L i will be unlikely to contain any information about a novelty in the adaptation, so we are not going to evaluate them. The set H i should contain the weight increments that are drawn from the GPD if the choice of the threshold was appropriate. The threshold z i depends on the method chosen, peaks over threshold, which will be discussed in the following subsection.

The Peaks over Threshold Method
The main issue in GPD fitting is the estimation of a suitable threshold, z. If the threshold is too high (i.e., there are only a few points that exceed it), then the parameters of the GPD suffer a high variance. If the threshold is too low, then the GPD approximation is not reliable. Therefore, the proper choice of threshold is crucial for the performance of the ND algorithm. There are many approaches to estimating the threshold [36]. To show the usability of the proposed ND algorithm, multiple rules of thumb [37][38][39] for the choice of the threshold have been used. Let l be the number of samples used for the GPD fitting and n s be the total number of samples available: Note that we use the highest adaptive weight increment to estimate the GPD parameters. The peaks over threshold (POT) method is crucial for deciding whether |∆w k (k)| belongs to H i or to L i . In Section 5 are presented the results with different techniques of choosing the threshold.

Extreme Seeking Entropy Algorithm
In this subsection, the new novelty measure and the new novelty detection algorithm are presented. We will introduce the extreme seeking entropy measure, which is given as follows: where: The proposed algorithm evaluates the value of ESE for every newly obtained weight increment. Note that if the weight increment is smaller than the threshold from the POT method, the addition to the novelty measure ESE is zero. Small probability increments, which are highly likely to contain a novelty, have a high value of ESE. To estimate the parameters of the GPD pdf, it is possible to process all available history samples, or only the n s newest samples, with the POT method. The proposed algorithm is described by the following pseudocode (Algorithm 1).
Algorithm 1 Extreme seeking entropy algorithm.
1: set n s , and choose the POT method 2: initial estimation of the parameters of the GPD: ξ i , µ i , σ i for each adaptable parameter 3: for each new d(k) do 4: update the adaptive model to get ∆w(k) 5: proceed with the POT method 6: if |∆w i |(k) ∈ H i then 7: update the parameters ξ i , µ i , σ i 8: end if 9: compute ESE according to (26) 10: end for The proposed ND algorithm needs only one parameter to be set, which avoids the need for a multi-scale approach and overcomes the issues arising from setting multiple parameters. The parameter n s can also take all available samples, if needed. Furthermore, there is the need to choose the proper POT method. Choosing the POT method depends strongly on the nature of the data. The limitation of the proposed method is the need for an initial estimate of the parameters of the GPD. We need a priori information about ξ, σ, and µ for each adaptive weight. If there are n w adaptive weights, then we need 3 · n w parameters to start the extreme seeking entropy algorithm. If there is no a priori information about the parameters, we need at least n s samples to obtain the first results. Another problem may arise if the type of underlying unknown distribution F or its parameters are significantly varying in time.

The Design of the Experiments
The proposed ESE algorithm was studied in various testing schemes with synthetic data and with one real dataset. For each experiment, we also show the results of the ELBND and LE methods, for the sake of comparison. The parameter M that specifies the number of increments for the LE evaluation was set as M = n s in all experiments. The first experiment was the detection of perturbed data in the Mackey-Glass time series. This experiment was chosen due to the possibility of comparing it with the results published in [13]. The second experiment, with synthetic data, showed the ability of the ESE algorithm to detect a change in the standard deviation of the noise in a random data stream, which can be viewed as a novelty in the data. It was inspired by a problem that arises in hybrid navigation systems that use both GPS and dead-reckoning sensors [40]. The third experiment, involving a step change in the parameters of a signal generator, was an analogue to a problem that may arise in evaluating multiple stream random number generators [41], where we may detect and evaluate the probability of changes in the parameters of those generators. The fourth experiment was the detection of the disappearance of noise. This experiment was chosen as neither of the compared methods (LE, ELBND) were able to deal with this problem, where the disappearance of the noise could be also viewed as a novelty in the signal. The fifth experiment was the detection of a change in trend; this is a common problem in fault detection and diagnosis [42]. The last experiment was performed on the mouse EEG dataset. The aim of this experiment was to show that the proposed ESE algorithm was suitable even for real-world complex phenomena that are characterized by non-linear dynamics [43,44]. This dataset contained the start of an epileptic seizure, and we wanted to show that it was possible to detect this seizure with the proposed ESE algorithm. All of the experiments were carried out in the programming language Python [45], with the libraries Numpy [46], Scipy [47], and Padasip [48]. The graphs were plotted with the Matplotlib library [49]. The codes with the experiments can be obtained via email from the authors.

Mackey-Glass Time Series Perturbation
The first experiment was the detection of a perturbed sample in a deterministic chaotic time series. The time series data were obtained as the solution of the Mackey-Glass equation [50].
The data sample at discrete time index k = 523 contained the perturbation, as follows: The data series and detailed perturbation are depicted in Figure 2.  The QNU was chosen for the data processing. The number of inputs to the QNU was set to n = 4, so the inputs are: and hence, the adaptive filter had in all 15 adaptive weights. The parameters were updated with every newly obtained sample by means of the NLMS algorithm. The setting was the same as in [13]. The learning rate during the experiment was constantly set to µ = 1. The POT method was chosen according to (23) with n s = 300. The details of the adaptive filters and prediction error are depicted in Figure 3. The results of the ND are shown in Figure 4. Note that the global maximum in the ESE corresponds to the perturbed sample. The global maxima of the ELBND and LE methods correspond to the biggest prediction error, and not to the perturbed sample.

Change of the Standard Deviation of the Noise in a Random Data Stream
The detection of a change in the standard deviation of the noise in the obtained data was carried out in the following experiment. Assume there are two inputs x 1 (k) and x 2 (k) and that the output y(k) is related to them by: where v(k) represents a Gaussian noise that is added to w(k). The Gaussian noise has zero mean and standard deviation 0.1, υ ∼ N(0, 0.1). The values of x 1 (k) and x 2 (k) are drawn from a uniform distribution, so that x(k) ≥ 0 and x(k) ≤ 1 for every k. At the discrete time index k = 500, the standard deviation of the noise changes to 0.2, so υ ∼ N(0, 0.2). The QNU was chosen for the data processing. The number of inputs to the QNU was set to n = 2, so the inputs are: and hence, the adaptive filter had three adaptive weights in all. The structure of the QNU corresponds to the structure of the data generator described by Equation (31). The parameters were updated with every newly obtained sample using the GNGD algorithm. The learning rate during the experiment was set to µ = 1. The POT method was chosen according to (24) with n s = 500. The results of the novelty detection and details about the adaptive filters are depicted in Figure 5. The a priori values of GPD for ESE and for LE were obtained using 500 samples, which are not shown in Figure 5. Note that the global maximum of the ESE corresponded to the change in standard deviation. The detection by the ELBND and LE was delayed.

Step Change in the Parameters of a Signal Generator
The scheme of this experiment was similar to the previous one. Assume there are two inputs x 1 (k) and x 2 (k) and one output y(k), related by: where v(k) represents a Gaussian noise that is added to y(k). The Gaussian noise has zero mean and standard deviation 0.1, υ ∼ N(0, 0.1). The values of x 1 (k) and x 2 (k) are drawn from a uniform distribution, so x(k) ≥ 0 and x(k) ≤ 1 for every k. At the discrete time index k = 500, the equation is changed to the following one: The QNU was chosen for the data processing. The number of inputs to the QNU was set to n = 2, so the inputs are: and hence, the adaptive filter had three adaptive weights in all. Note that the structure of the QNU corresponded to the structure of the signal generator. The parameters were updated with every newly obtained sample, using the GNGD algorithm. The learning rate during the experiment was constantly set to µ = 1. The POT method was chosen according to (23) with n s = 500. The a priori values of GPD for ESE and for LE were obtained using 500 samples, which are not shown in Figure 6. The results of the novelty detection and details about the adaptive filters are depicted in Figure 6. Note that the ESE successfully detected the change in the parameters of the signal generator. The LE failed to detect this change, and the detection by ELBND was delayed. Furthermore, the value of the peak in ESE was significantly higher than that in the ELBND case.

Noise Disappearance
In this experiment, it was shown that the slightly reformulated algorithm could also deal with an immediate decrease of the learning effort. Assume that instead of an unusually high learning effort, we want to focus on an unusually low learning effort. The only change in the proposed algorithm was that we used the POT method to get l the smallest weight updates, and based on those, the parameters of the GPD would be estimated. The scheme of this experiment was similar to the previous one. We assumed there were two inputs x 1 (k) and x 2 (k) and one output y(k), which were related by (31). However, in this case, at discrete time index k = 500, the noise was removed, so Equation (31) for k ≥ 500 takes the form: The QNU was chosen for the data processing. The number of inputs to the QNU was set to n = 2, so the inputs are: and so, the adaptive filter had three adaptive weights in all. The structure of the adaptive filter was chosen to correspond to the structure of the signal generator. The parameters were updated with every newly obtained sample using the GNGD algorithm. The learning rate during the experiment was constantly set to µ = 1. The POT method was chosen according to (23) with n s = 500. Figure 7 shows that the peak in ESE corresponded to the disappearance of the noise. The LE and ELBND methods failed to detect the disappearance of the noise. For ELBND, these results were to be expected, as the values of the ELBND were high for a high prediction error and high adaptive weight increments. For the discrete time index k ≥ 500, the noise is removed from the signal, which corresponds to the peak in ESE. Graphs (e) and (f) contain the results of the ELBND and LE methods.

Trend Change
The last experiment with artificial data was the detection of a change in trend. Assume that there are two inputs x 1 (k) and x 2 (k) and one output y(k), related by: where v(k) represents a Gaussian noise that is added to y(k). The Gaussian noise had zero mean and standard deviation 0.1. At the discrete time index k = 500, there was a change in the trend, so Equation (38) changes to: where k ≥ 500. The LNU was chosen for the data processing. The number of inputs to the LNU was set to n = 3, so the inputs are: and the adaptive filter had three adaptive weights in all. The structure of the adaptive filter was chosen in accordance with the structure of the signal generator. The parameters were updated with every newly obtained sample by means of the GNGD algorithm. The learning rate during the experiment was constantly set to µ = 1. The POT method was chosen according to (23) with n s = 500. Figure 8 shows that the peak in ESE corresponded to the trend change point, which was the same as the peak in LE and ELBND. Note that the value of the peak in ESE was significantly higher than in LE and ELBND. The graph (d) shows the ESE novelty score. At discrete time index k = 500, there is a step change in the trend, which corresponds to the peak in ESE. Graphs (e) and (f) contain the results of the ELBND and LE methods and peaks corresponding to the trend change.

Detection of Epilepsy in Mouse EEG
The last experiment was with a mouse EEG signal. Three channels of the EEG data were chosen, which contained a significant seizure. According to the expert, the seizure started at about k ≈ 1700, as is shown in Figure 9, which shows the z-scores of the EEG data.
The LNU was chosen for the data processing. The number of inputs to LNU was set to n = 10, so the inputs are: and the adaptive filter had 10 adaptive weights in all. The number of inputs and filter structure were chosen experimentally. The parameters were updated with every newly obtained sample using the NLMS algorithm. The learning rate during the experiment was set to µ = 1. The POT method was chosen according to (25) with n s = 1000. Figure 10 shows that the peak in ESE approximately corresponded to the beginning of the seizure. Especially in channel C3, the peak in ESE was significant. The position of the peaks was at k = 1735 for channel C3, k = 1698 for channel Pz, and k = 1727 for channel Fp1.  Figure 10. ESE value for mouse EEG data channels containing a seizure. The peaks approximately correspond to the beginning of the seizure. Note that channel C3 contains a significant peak in ESE compared to the other channels.

Evaluation of the ESE Detection Rate
This section is dedicated to evaluating the detection rate in two different cases. The first case was a step change in the parameters of a signal generator (similar to the experiment described in Section 5.4). The second case was the detection of a change in trend.

Step Change in the Parameters of a Signal Generator: Evaluation of the Detection Rate
Assume there are two inputs x 1 (k) and x 2 (k), one output y(k), and weights a 1 , a 2 , and a 3 , related by: where v(k) represents a Gaussian noise that is added to y(k). The Gaussian noise had zero mean and standard deviation σ. The initial values of a 1 , a 2 , a 3 were drawn from the uniform distribution U (−1, 1). At discrete time index k = 200, there was a step change in a 1 , a 2 , and a 3 , and their new values were drawn again from U (−1, 1). The structure of the adaptive filter was the same as described in Section 5.4. The parameters were updated with every newly obtained sample using the GNGD algorithm. The POT method was chosen according to (23) with n s = 1200. The performance of the ESE algorithm was compared with those of LE, ELBND, and plain prediction error evaluation. The a priori values of GPD for ESE and LE were obtained using 1200 samples with initial values for the parameters a 1 , a 2 , a 3 . For each experiment, the signal-to-noise ratio (SNR) was evaluated as follows: where σ s is the standard deviation of the output of the system and σ is the standard deviation of the noise. The evaluation of the rate detection was performed as follows: 1. choose noise standard deviation σ 2.
for given noise standard deviation σ, perform 1000 experiments, and at the beginning of each experiment, choose new parameters a 1 , a 2 , and a 3 3.
successful detection was when the global peak in ESE, LE, ELBND, or prediction error was between discrete time index k ≥ 200 and k ≤ 210; compute the detection rate 4.
compute the SNR for each experiment according to (43), and compute the average SNR for all experiments for given noise standard deviation σ The evaluation of the detection rate was performed for the inputs x 1 , x 2 whose values were drawn from the uniform distribution U(−1, 1) and from the normal distribution N(0, 1). The results for the inputs drawn from the uniform distribution are depicted in Figure 11. The corresponding table  with results for various SNRs is Table A2 (see Appendix A). The results for inputs drawn from the normal distribution are depicted in Figure 12. The corresponding table with results for various SNRs  is Table A3 (see Appendix A).  N(0, 1). For SNR > 8 dB, the ESE algorithm outperforms in the detection rate the LE, ELBND, and error evaluation. For SNR > 34 dB, the ESE achieved a 100% detection rate.

Detection of a Change in Trend: Evaluation of the Detection Rate
Assume there are two inputs x 1 (k) and x 2 (k) and one output y(k), related by: where v(k) represents a Gaussian noise that is added to y(k). The Gaussian noise has zero mean and standard deviation σ. At discrete time index k, the trend changed, so the output of the system y(k) for k ≥ 200 is given by: where a is drawn from the uniform distribution U(−0.02, 0.02). The structure of the adaptive filter was the same as in the experiment described in Section 5.6. The parameters were updated with every newly obtained sample using the GNGD algorithm. The POT method was chosen according to (23) with n s = 1200. The performance of the ESE algorithm was compared with LE, ELBND, and plain prediction error evaluation. The a priori values of the GPD for ESE and LE were obtained using 1200 samples where the output of the system was described by Equation (44). For each experiment, the SNR was evaluated according to (43). The evaluation of the rate detection was performed as follows: 1. choose noise standard deviation σ 2.
for given noise standard deviation σ, perform 1000 experiments where at k = 200, there is a change in trend 3.
successful detection is when the global peak in ESE, LE, ELBND, or prediction error is between discrete time index k ≥ 200 and k ≤ 210; compute the detection rate 4.
compute the SNR for each experiment according to (43), and compute the average SNR for all experiments for given noise standard deviation σ The evaluation of the detection rate was performed for inputs x 1 , x 2 whose values were drawn from the uniform distribution U (−1, 1). The results are depicted in Figure 13. The corresponding table  with the results for various SNRs is Table A1 (see Appendix A).

Limitations and Further Challenges
There is a significant limitation to using the ESE algorithm. As was already mentioned in Section 4, before we could obtain the first results, we needed to get a priori information about the parameters of the GPD or obtain a suitably large sample size to compute those parameters. This limitation arose from the nature of using the probability distribution and is common to many statistical approaches to ND. This was the main drawback compared to, e.g., the ELBND method, which was able to produce the results immediately. Another limitation of the presented algorithm is the selection of a suitable POT method, as the estimation of the parameters of the GPD and the selection of the threshold were strongly related to this. To avoid this issue, it was possible to implement some sophisticated parameter estimator that could deal with the optimal threshold selection (e.g., Zhang's method [51], an estimator based on generalized probability weighted moment equations [52], or a method that combines the method of moments and the likelihood moment [53]), but these are outside the scope of this article. Another challenge was how to combine the ESE of unusually low and unusually high increments together, because both could correspond to a novelty in the data. Further work will be oriented toward using adaptive filters whose adaptive parameters are non-linearly related to the output, e.g., fuzzy adaptive filters or non-linear adaptive Kalman filters. Furthermore, more learning algorithms should be tested. Another topic, which was not mentioned in this article, is that of deciding whether the value of the ESE implies a novelty in the data or not, so we need some threshold. To evaluate the precision of the classification, the area under the receiver operating characteristics [54,55] should be estimated. Due to the scope of this article, this was omitted, but it will be part of further work on the ESE.

Conclusions
This paper introduced a new measure of data novelty, called extreme seeking entropy, and a detection algorithm that used this measure. An experimental study was also presented. The algorithm evaluated the absolute value of the increments in the adaptive system weights that were unusually high. The generalized Pareto distribution was used to model those increments, and we tested whether a low probability of a weight increment corresponded to a novelty in the data. It was also shown that the prediction error did not need to be correlated with a novelty in the data, so relatively simple, even inaccurate, adaptive models could be used. Five experiments with synthetic data including novelties and one experiment with a real mouse EEG signal were presented. It was shown that the proposed novelty detection algorithm was able to detect novelties in both kinds of data (real and synthetic) and that the proposed approach using simple adaptive models should be suitable for adaptive novelty detection. The detection rate of the proposed algorithm was evaluated for various SNRs in the scenarios of trend change detection and of a step change in the parameters of a signal generator. These scenarios were also tested with LE, ELBND, and prediction error evaluation. It was shown that for higher SNRs, the proposed ESE algorithm outperformed the other tested algorithms in terms of a successful detection rate in both scenarios.  Acknowledgments: Jan Vrba would like to thank Matouš Cejnek for developing the PADASIP (Python Adaptive Signal Processing library) and Ivo Bukovský for helpful discussions about learning entropy and learning systems.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:  Table A2.
Step change detection rates for inputs drawn from uniform distribution U(−1, 1).  Table A3.