Defining a Trend for a Time Series Which Makes Use of the Intrinsic Time-Scale Decomposition

We propose criteria that define a trend for time series with inherent multi-scale features. We call this trend the {\it tendency} of a time series. The tendency is defined empirically by a set of criteria and captures the large-scale temporal variability of the original signal as well as the most frequent events in its histogram. Among other properties, the tendency has a variance no larger than that of the original signal; the histogram of the difference between the original signal and the tendency is as symmetric as possible; and with reduced complexity, the tendency captures essential features of the signal. To find the tendency we first use the Intrinsic Time-Scale Decomposition (ITD) of the signal, introduced in 2007 by Frei and Osorio, to produce a set of candidate tendencies. We then apply the criteria to each of the candidates to single out the one that best agrees with them. While the criteria for the tendency are independent of the signal decomposition scheme, it is found that the ITD is a simple and stable methodology, well suited for multi-scale signals. The ITD is a relatively new decomposition and little is known about its outcomes. In this study we take the first steps towards a probabilistic model of the ITD analysis of random time series. This analysis yields details concerning the universality and scaling properties of the components of the decomposition.


Introduction
Finding the trend of a time series is a fundamental analytical task. To varying degrees, the definition of the term "trend" is dependent on the methodology used to compute it. Some trending strategies are optimal and thus very attractive because the optimality criteria provide mathematical constraints with which to interpret the time series trend. Not all optimal trends deliver useful trends. An example of a non-optimal trend is the Hodrick-Prescott Filter (see (Hodrick & Prescott 1997)), which is widely used in econometrics. This paper proposes a new empirically-defined trend for an inherently multi-scale, finite time series.
In an econometric context, the trend is often used to capture the longer time scale structure of the markets, by filtering out high frequency events that might be more relevant to shorter time scale changes. It is used this way in the physical sciences as well. Our interest in this topic was motivated by the problem of trend determination in a geoscience context, where, in addition to the removal of biases, one is often confronted with the necessity of analysing signals with the aim of recovering structural aspects of the signal that can be captured or explained by physical models.
Geoscience problems often involve multi-physics and other sources of complexity which manifest themselves in a time series with a rich variety of time scales. There is no rigorous definition of "multi-scale" signals, but in the physical modeling community the adjective is applied to signals that are the result of the coupling of inherent degrees of freedom (sometimes given by spectral components). Furthermore, it is frequently the case in geoscience that the time series to be analysed is short, of length much shorter than the number of degrees of freedom of the system that generated the series; sometimes too short to be amenable to law-of-large numbers statistics.
The procedure we propose to find the tendency is a two-stage process: we first decompose the signal in a series of time series of progressively lower complexity, and then we apply a set of criteria to these and single out the decomposition mode that best satisfies the criteria. This mode is declared to be the tendency of the signal.
While we employ the Intrinsic Time Decomposition (ITD) of (Frei & Osorio 2007), the strategy could be applied to other algorithms, such as the Emprical Mode Decomposition (EMD), (Huang et al. 1998, Wu & Huang 2009, Wu et al. 2007). In fact, a similar procedure has been proposed in connection with EMD (see (Moghtaderi et al. 2011, Moghtaderi et al. 2013): the EMD modes are calculated, and a time series representing the trend is built according to certain prescriptions.
Most everything that is known to date about the ITD will be reviewed in Section 2; the algorithm that performs the decomposition appears in the Appendix. In Section 2 we summarize results of computer experiments on random signals that suggest certain probabilistic and scaling features in the ITD decomposition. In Section 3 we initiate a mathematical analysis of these numerical results. This analysis might be applicable to the EMD its variants, such as those proposed by (Hou & Shi 2011) and (Oberlin et al. 2012). We also use computer experiments to draw attention to the influence of boundary conditions on the outcomes, focusing only on the ITD. Boundary effects are seldom highlighted in the EMD and ITD papers, but we find that for some signals, the boundary conditions can have a significant effect on the outcomes and thus on the construction of the tendency from the components of the decomposition. This discussion appears in Section 4.
Section 5 introduces the criteria that are used to pick the ITD mode that is declared to be the tendency. The criteria are empirically-based notions of signal information whose implementation is discussed in Section 6. In that section, we illustrate the application of the two-step process for finding the tendency on deterministic and random signals as well as on real geophysical signals. In the latter group, we will feature an analysis of the post-industrial temperature record in Moscow (available from (NASA/GISS 2013)). Rahmstorf and Coumou (see (Rahmstorf & Coumou 2011)) set out to determine whether the extreme Moscow summer temperatures of 2010 were outlier samples of climate or the result of an ever warming Earth. Their analysis of extreme events depends on the proper determination of a sensible long-time trend, or "climate." In Section 7 we summarize the outcomes of the analysis and the outcomes of the tendency calculations, and take the opportunity to compare, in general terms, the ITD tendency and the EMD trend.

The Intrinsic Time-Scale Decomposition
The Intrinsic Time-Scale Decomposition (ITD) is a purely algorithmic, non-lossy iterative decomposition of a time series {Y (i)} N i=1 . At the first stage, the signal is decomposed into a proper rotation R 1 (i), an oscillating mode in which maxima and minima are positive and negative, respectively, and a residual B 1 (i) called baseline .
The baseline B 1 is now decomposed in the same fashion, producing a proper rotation R 2 and a baseline B 2 , and so on. The process stops when the resulting baseline has only two extrema, or is a constant.
If there are D steps altogether, the decomposition has the form Rotations and baselines satisfy the relation Parenthetically, we note that in the algorithm as described in (Frei & Osorio 2007) there is one, and only one, adjustable parameter denoted by α, which has been set to α = 1/2 in ouer study. In general, the rotation signal at the j th level will be "noisier" than the rotation signal at (j+1) th . The proper rotations are not orthogonal; moreover, the decomposition is not linear, in the sense that a decomposition of the sum of time series is not equal to the sum of the decompositions of each of the signals.
Let {τ j k }, k = 1, 2, .., K be the times at which the extrema of B j (i) occur. (In the event that there are several successive data points with the same extremal value, we take τ j k to correspond to the time of the rightmost of these extremal values). The baseline B j+1 (i) is constructed by a piecewise linear formula: in the interval i ∈ (τ j k , τ j k+1 ], between successive extrema, where the knots are B j k := B j (τ k ). The formula that generates the knots is The construction guarantees that the residual function is monotonic between adjacent extrema. Figures illustrating the construction may be found in (Frei & Osorio 2007). One must also decide on a boundary condition at the two ends. The effects of different choices will be discussed in Section 4. We shall interpret the end points as extrema, and take the corresponding baseline knots to be averages of the first and last pair of extrema, B j 1 = B j (1), and B j K j = B j (N): These will be called free boundary conditions. The decomposition ends when j = D, which is when a proper rotation cannot be constructed from this last baseline. Baseline B D (i) will only have two knots: B D k=1,2 , the two end points.
We now state a few important properties of the ITD decomposition, largely following (Frei & Osorio 2007).
(i) The baselines given in (3) can be rewritten as a convex combination; viz., , where s j k (i) ∈ [0, 1], and j = 0, 1, .., D; (ii) The knots, (3), at level j + 1 can also be written as where the overline indicates average of nearest neighbors; (iii) The ITD decomposition is ambiguous with regard to handling the end points of a finite time series, and thus, different end conditions can generate different ITD decompositions. See Section 4; (iv) The baseline extraction step can be thought of as a nonlinear operator L, homogeneous with respect to independent rescaling of the abscissa and also the ordinate: B j+1 = LB j and R j+1 = (1 − L)B j .
(v) The B j and R j are monotonic between successive extrema, since they are obtained, in succession, through linear transformations; (vi) It follows from the above property, that the ℓ 2 -norm of B j is similar (to within a constant) to the ℓ 2 -norm of an approximation of the same signal, built by connecting extrema with piece-wise linear segments; (vii) The extrema of B j are inflection points or extrema of B j+1 ; (viii) Between extrema of B j , B j+1 has the same smoothness as B j ; (ix) At extrema of B j , B j+1 will be continuous and differentiable, but not always twice differentiable; (x) R j+1 (i) will have extrema at the same locations as B j (i).

Random signals
We want to understand some basic features of the ITD before we try to extract the tendency of a realistic signal. Since we use the ITD to strip random noise from a time series, it is appropriate to begin by applying the method to purely random signals, and furthermore, since the ITD extracts the rotation components in order of increasing wavelength, we start with a random series in which every point is a local extremum. As already mentioned above, we study the scaling properties of the wavelengths of the baseline B j , numerically in this section, and analytically in the next.
Our time series has the form The random variables z i are drawn from a normal N (0, σ 2 ). Definition (4) for the baseline at the initial step becomes The corresponding proper rotation is By periodizing and taking N even data points, the ratio of the discrete Fourier transform of this B 1 to Z yieldŝ where ω = 2πν/N, and 0 ≤ ν ≤ N/2, the integer frequency. Similarly, the ratio of the transform of R 1 and Z giveŝ One sees that B 1 and R 1 are obtained by convolving the signal Z with a low-pass, resp. high-pass, filter. If Z were a discrete sinusoid with a highest frequency of N/2, R would be an exact copy of Z, while B would be zero, and there would be no further decompositions. Generally, however, the averaging operator (2.1) will tend to smooth features that appeared in the original signal, and thus, the resulting baseline will generally have a different distribution of extrema than the original signal, see Section 3. Parenthetically we note that the Hodrik-Prescott filter (see (Hodrick & Prescott 1997)), used in econometrics to find the large-scale trend of financial data, produces a trend H with a transfer function whereĤ is the Fourier transform of the filter output, and λ is a free parameter. This filter is a windowed low pass filter, capable of handling data from a non-stationary process, however, it is hard to make sense of its outcome if the time series is not at least I(2) (non-stationary and must be differenced twice to obtain stationarity). Figure 1 illustrates a typical ITD decomposition of a noisy signal; Figures 2, 3, and 4 depict empirical scaling properties that will be studied quantitatively in Section 3. Panel (2a) shows the spectum of the energy of (8). Panels (2b) and (2c) show the normalized enegy spectrum of B and R, respectively. We can see how the energy is shared between baseline and rotation: the ratio B 2 / Z 2 is about 0.37, and the ratio of R 2 / Z 2 is about 1.77. (Subscript 2 denotes the ℓ 2 norm.) Next, we estimate the rate at which the wavelength of the baselines in our allextrema signal increases as the high-frequency components R j are removed. We measure this by computing the ratio of the spacings between extrema from one stage to the next, averaged over an ensemble of decompositions of random all-extrema signals of the same length and statistical distribution. Experiments of this kind were done for the EMD in (Wu & Huang 2004), (Flandrin & Goncalves 2004); because of the very regular scaling  (8), normalized to Z 2 = 1. N = 1024, and z i drawn from N (0, 4). Normalized spectra of resulting B 1 and R 1 are in panels (b) and (c), respectively. Note that the original signal has most of its energy concentrated in the highest frequency, 512.
behavior, the EMD could be interpreted as a filter bank. We found that the ITD has similar scaling universality, and offer a partial analytical explanation in the next section. (a) The log 2 of the mean number of extrema, of the baselines, normalized to N ; (b) log 2 (mean B j 2 /mean Z 2 ). The total number of j levels in the ITD decomposition of random signals is of order log N .
The slopes of the lines in panels (3a) and (3b) and the data in Table 1 show that the spacing of the extrema of the B j increases by a factor 2.6, and the number of extrema drops by a factor .4, as j increases. The energy ratio B j 2 / Z 2 drops by about 0.4 for j = 1, and by about 0.63 for j > 1. As would be expected, the number D of levels required for a full decomposition increases with N. Table 1 summarizes the data for the N = 512 case in Figure 3, up to level j = 6. The trends shown in the table and the figures were very stable to changes in the variance of the original signal changes in the outcomes of the order of tenths of a percent for a range of variances between 1 and 20. We also tested an all-extrema time series with the z i drawn from a uniform distribution, and found that the scaling factors are close to those of the normal case reported above.
A decomposition of a signal that consists of 2 16 normal variates drawn from N (0, 4) (discrete white noise) yields the results portrayed in Figure 4.
There were 13 baselines (approximately log 2 2 16 ). From the slopes of the lines in panels (4a) and (4c) and the corresponding data (not shown) we estimate that the number of extrema again drops by the factor 0.4, the distance between extrema increases Table 1. Analysis of the average of the first six levels of an ITD decomposition of allextrema signals, length N = 512, with z i drawn from N (0, 4). See Figure 3. Average results from 50,000 experiments (with B 0 = Z of the same length and statistical distribution). NE denotes the average number of extrema, normalized to N . DE refers to the average distance between extrema. In the last column, subscript 2 denotes the ℓ 2 norm.   Figure 4. For an N = 2 16 random normally-distributed time series, with variance σ 2 = 4, as a function of j: (a) log 2 of the mean number of extrema, normalized to N ; (b) log 2 of the mean distance between the extrema of the baselines; (c) log 2 (mean B j 2 /mean Z 2 ); (d) log 2 (mean R j 2 /mean Z 2 ).
by the factor 2.55, and the normalized ℓ 2 of the baselines decreases by about 0.61. The analytical model developed in Section 3 yields a value of 0.55. The scaling pattern deteriorates as the baselines and rotations flatten.

Intrinsic Time-scale Decomposition of Random Signals: Universality
In this section, we will attempt to understand the scaling laws from the section 2.1 that were obtained numerically for ITD applied to random signals. We first propose a surrogate model for the baselines of a random ITD signal using the scaling/translation symmetries of the ITD process and intuition gained from numerical experiments. We then validate the surrogate model by comparing predictions of the surrogate model with the ITD of random Gaussian signals. This comparison also suggests ways to improve the surrogate model. Finally, we analyze one step of the ITD process applied to the surrogate baselines, and this analysis helps explain the observed self-similarity of the ITD baselines for random signals, and also provides estimates for the decay rates for the L 2 norm and the number of extrema in the baselines.

Surrogate model for the baselines
Associated with the ITD at level j, we define the set S j = {τ j 1 , τ j 2 , . . . , τ j m j } of cardinality m j = |S j |, the location of the extrema in B j , and the vector b j ∈ R m j , the values of the baseline B j at the extrema. We denote by E the operator that extracts the locations and values of the extrema of an arbitrary time series, so in particular, (4)). The ITD procedure therefore gives a reduced dynamics on the pairs {S j , b j } = E(B j ). The operator E is not one-to-one, and hence not invertible. In order to compare the reduced dynamics on {S j , b j } with the "full" ITD baselines B j , we define a surrogate baselineB j byB j = m j k=1 b j k e j k where e j k is a piecewise linear function (time-series) which is 1 at location S j k and 0 on every other S j ℓ . Then E(B j ) = E(B j ) = (S j , b j ) and since B j andB j are both monotone between their (common) extrema, we expectB j to be a good approximation to B j . The surrogate baseline is a "rough" analog of the IMFs in the EMD method; in that construction, the modes arise from cubic spline interpolations of the maxima and minima in the signal (Huang et al. 1998).
Certain extrema in S j "disappear" in S j+1 , so that the extrema at level j + 1 satisfy S j+1 ⊆ S j . There are two types of processes which decrease the number of extrema. These are illustrated in Fig. 5. In the top panel, neighboring extrema flip their relative positions; in dynamical systems language, this is a saddle-node bifurcation. In the bottom panel an extremum changes type and its two neighbors disappear; this is a pitchfork bifurcation. We ran the ITD process with 2 16 initial points and computed  Figure 5. The dark circles represent extrema, and the light squares are points which go from being extrema at level j to not being extrema at level j + 1.
the frequencies of occurrence of the two bifurcation types. After an initial transient (corresponding to j = 1 and 2) the saddle-node bifurcation occurs with probability γ ≈ 0.58, and the pitchfork bifurcation with probability β ≈ 0.21, and these probabilities are independent of the level j. Also, (1 − γ − β) ≈ 0.21 ≈ β, which corresponds to the probability of no (local) change in the nature of the extremum. This gives the (a priori unexpected) conclusion that every local maximum at level j remains a maximum or becomes a local minimum with roughly equal probabilities β at level j + 1. Further, these probabilities are independent of j.
Our numerical experiments suggest that after a few iterations, usually one or two, the extrema disappear independently of their neighbors. At that stage, the sets S j evolve by an independent random decimation process, so the "lifetime" for any given point x ∈ S 1 as an extremum, i.e, the maximal j such that x ∈ S j , has a geometric distribution with parameter γ (the probability of losing an extremum via the pitchfork bifurcation). The probability that the lifetime equals j is (1 − γ) j−1 γ.
The initial distribution of the inter-extremal spacings is given by the chosen initial conditions. E.g., the distribution is concentrated at l = 1 for the all-extremum signal Z(i) in (8). Evolution by independent random decimation at each extremum implies that each site at level j the inter-extremal spacings l k are a sum of a random number n j k of "initial" separations, where n j k is drawn from a geometric distribution. After an initial transient, n j k ≫ 1, so the law of large numbers will imply that l k ] is the average spacing between extrema in the initial condition. Consequently, l k is approximately geometrically distributed with a j-dependent mean denoted by λ k . This agrees qualitatively with numerical simulations of the ITD with random gaussian initial conditions, as shown in Fig.6. Numerical experiments also suggest that the initial transient is short, typically j = 1 or 2 ITD steps.  The ITD algorithm Eq. (4) which computes the extrema at level j + 1 can be written as where I is the identity matrix and M j is a matrix whose rows sum to zero; when we are only interested in baseline extraction, we omit the symbol S j+1 . (I + M j ) is thus a stochastic matrix, and the entries of M j are determined by S j via (4). In particular, for any vector where q j k ∈ (−1, 1) is given by (See Eq. (7)). The parameter q j k measures the asymmetry in the distances of the knot τ j k from the neighboring extrema. If the vector b is obtained by sampling a smooth function B(x), then M j b can be interpreted as sampling 1 4 B ′′ (x) + 1 2 q j (x)B ′ (x). We can thus interpret (I + M j )b j as the numerical solution at one time step of the forward-time, center-difference approximation to the solution to In what follows, we will be assuming periodic boundary conditions, so there are as many local minima as maxima, and the cardinality of S j is always even. If the underlying time signal is mean zero and stationary, then the expected value of a maximum is the negative of the expected value of a minimum. We now make two approximations to obtain a form for b j k . First, we assume that each maximum or minimum is a random Gaussian perturbation of the expectation. Second, we postulate that the values of the maxima and minima are independent random variables (for a test of this assumption, see below, and Figure 7 where µ j is the mean value of the maxima (or the negative of the minima) in B j , the n k are independent normal variates and α 2 j is the variance of the maxima (or also the minima). The fact that α j only depends on j and not on τ j k is a consequence of the underlying random process being stationary.
We can test this ansatz numerically by computing the auto-correlation R j (l) : Fig. 7a depicts the average over 100 runs of the normalized autocorrelation R(l)/R(0) for lags 0 ≤ l ≤ 31 for the first 6 levels of the ITD (the six curves are superimposed). Note that the autocorrelation is for the signal b j at level j which consists of only the extremal values (the signal sampled at τ j k and then exhibited as a function of k), and not the full baseline B j . In each run, the initial time series has 2 16 i.i.d normal variates. R(l)/R(0) = 1 for l = 0 (zero lag) and otherwise |R(l)/R(0)| ≤ 1. As one would expect, there is a high frequency oscillation in the auto-correlation corresponding to the alternation between maxima and minima. We can remove this oscillation by considering the absolute value of the autocorrelation |R(l)| = (α j ) 2 δ l + (µ j ) 2 . The assumed ansatz for b j k thus predicts that |R(l)|/R(0) should be a constant, less than 1, for all l = 0. Figure 7 (b) shows |R(l)/R(0)| for different levels j. After an initial transient, the normalized correlations collapse on to a single universal curve for j ≥ 3. Further, this universal curve is well described by a single, j independent, constant, except for persistent deviations at l = 1 and l = 2. This implies there is a universal self-similar description of b j k for large j, and there is indeed a short range correlation between the extrema (nearest neighbor l = 1 and next nearest neighbor l = 2). Our assumption, that b j k ≈ µ j (−1) k + α j n k where the n k are independent, can likely be improved by accounting for this correlations between the values of the extrema.

Analysis: Universality and decay rates
Given this approximate description of the signal b j , we can now compute the signal b j+1 and also the surrogate baselineB j , and thus study the evolution of the baselines as a function of the index j of the ITD. Since (I + M j )(−1) k = 0, the mean periodic oscillation between the maxima and the minima is in the null space of the matrix (1 + M j ). Therefore (1 + M j )b j = α j (1 + M j )n k and b j+1 = α j E((I + M j )n), where n = n k is a vector of independent normal variates. This motivates the consideration of the signal E((I + M j )n). If x 1 , x 2 and x 3 are consecutive entries of the vector (I + M j )n, we have where n 1 , n 2 , n 3 , n 4 and n 5 are independent normal variates and q k is defined in (12). For every given realization of q 1 , q 2 and q 3 , the entries x 1 , x 2 and x 3 are jointly Gaussian with mean zero and covariance Σ(q 1 , q 2 , q 3 ) = AA T = 1 16 The conditional joint density of x 1 , x 2 and x 3 is given by To proceed further, we now compute the joint density of q 1 , q 2 and q 3 . For this we need the (as yet unknown) distribution of the inter-extremal separations l k = τ k+1 − τ k .
A typical realization of the sets S j = {τ j 1 , τ j 2 , . . .} is shown in Fig. 6. As we argued earlier, at every level j the inter-extremal separations l k have a geometric distribution, with a parameter that depends on j. If the number of nodes is large, then we can ignore the discrete nature of the underlying sets S j and consider instead the exponential distribution which is the continuous analog of the discrete distribution. The probability density of the inter-extremal separation is then given by p j (l) = λ −1 j exp(−l/λ j ) where λ j is the mean inter-extremal spacing at ITD level j (See figure 6).
The variables q j 1 and q j 2 are defined by ratios of l 1 , l 2 and l 3 in Eq. (12), so their distribution does not depend on the parameter λ defining the mean of the exponential distribution. Alternatively, we are free to pick our unit for length for l 1 , l 2 and l 3 as the mean of the exponential distribution for l k , and this does not affect q k which are nondimensional. Without loss of generality, we can thus assume the mean inter-extremal spacing is 1. The probability P (y, z) = Prob((q 1 > y) and (q 2 < z)) is the probability of the event l 1 > 1+y 1−y l 2 and l 3 > 1−z 1+z l 2 , The joint density of q 1 and q 2 is given by We can also compute the marginal distribution of q 2 by Prob(q 2 > z) = P (−1, z) = 1 + z 2 , showing that q 2 is uniformly distributed in (−1, 1). This yields the conditional density Equation (12) shows that there are no common intervals l k in the definition of q 3 and q 1 , and by translation invariance, the joint distribution of q 2 and q 3 is identical in form to the computed joint distribution of q 1 and q 2 . The joint density of q 1 , q 2 and q 3 is therefore The joint density for (x 1 , x 2 , x 3 , q 1 , q 2 , q 3 ) is the product of the densities in Eqs. (13) and (14). An important observation is that the joint density is independent of the level j of the ITD so we should expect self-similar behavior in the ITD decomposition of a random signal. We will define an (approximate) marginal distribution on x 1 , x 2 and x 3 by positing that this distribution is still jointly Gaussian. We can compute the covariance of this We now obtain the joint density of x 1 , x 2 and x 3 by integration, The probability β that x 2 is a local maximum is the probability of the event x 1 < x 2 and x 2 > x 3 , i.e.
The probability that a given site is a local minimum is also β since by the symmetry of p, thus explaining the observation that an extremum in ITD level j + 1 was equally likely to be a maximum or a minimum independent of its type at level j. The decay rate for the number of extrema is given by which is in approximate agreement with the numerically determined decay rate of 0.4 for the number of extrema is ITD for a random i.i.d Gaussian signal (figure 4). We can also compute the mean and the variance of the distribution of the maxima of (I + M j )n by the conditional expectations From this, we obtain E((I +M j−1 )n) = (S j , b j ) where b j ≈ µ(−1) k +αn ′ = 0.48×(−1) k + 0.55n ′ , where n ′ is a vector of |S j | i.i.d normal variates. IfB j is the piecewise linear interpolating function defined by the extremal values b j k on the set S j = {τ j 1 , τ j 2 , . . . , τ j m j }, then a direct calculation shows that where L = l k is the total number of data points in the time series, and we are assuming that |S j | = m j ≫ 1 so we are justified in replacing the (random) sum by its expected value. The second approximation is replacing the sum by the corresponding integral which is valid for L ≫ 1. Observe that the ℓ 2 norm of the surrogate baselinẽ B j only depends on the two numbers µ and α, and not on the set S j !.
Since E((I + M j−1 )n) = (S j , b j ) and E(α(I + M j )n ′ ) = (S j+1 , b j+1 ), it immediately follows that B j+1 / B j = α = 0.55. This also gives a decay rate for the ℓ 2 norm equal to α ≈ 0.55 in comparison to the numerically obtained figure of 0.61 (See figure 4). This estimate, as well as the decay rate of the number of maxima in (15), both exceed the empirical parameters by about 17%. We saw that extrema disappear through nearest neighbor interactions, Figure 5, and it is plausible that the nearest-neighbor correlations seen in Figure 7 cause extrema to persist longer than is predicted when independence is postulated.
Finally, we observe that µ 2 /(µ 2 + α 2 ) ≈ 0.45 in good agreement with |R(l)|/R(0) for l = 0 in Fig. 7(b). We believe that this type of analysis can be extended to EMD and it is an interesting question whether this will explain the observed self-similar behavior in EMD for random signals (Wu & Huang 2004, Flandrin & Goncalves 2004.

End effects
The results of the ITD decomposition of a signal may depend strongly on the boundary conditions. Consider the time series f (t i ) = 0.5e t i + a cos(10t i ), t i = {0 : 0.01 : 2π}, where a = 10 or a = 50. Variation of the parameter a will cause significant changes in the baselines at the right endpoint. Supposing the knot conditions are free at both The signal is f (t i ) = 0.5e ti + 10 cos(10t i ); the tendency is shown as a thick line. It was computed with free boundary conditions at both ends; for comparison, the first ITD baseline computed using a free knot condition on the left, and a clamped knot condition on the right, is shown as a dashed line. This mixed boundary condition is designed to capture the end behavior. ends, Figure 8a shows the time series (thin), for a = 50, as well as one of the baselines that will become the tendency (thick), described in Section 5 below. Now, we decrease a = 10. We note that the right-most local extremum has moved away significantly from the end of the time interval. The tendency, with both end knots free, is the thick line in Figure 8b. The dashed line, on the other hand, was the result of a decomposition with the left end knot free, but the right one clamped: B j+1 K j = B j (N). The dashed-line tendency in this case is arguably more reasonable.
There are other end effects. We emphasize just one. As stated in the Introduction, the ITD decomposition of a signal of specified length will not be necessarily the same as the same signal with added points to the right, say. (This is also the case in the Empirical Mode Decomposition of (Huang et al. 1998)). Hence, the outcomes of these methods are by no means unambiguous when used for extrapolating or forecasting over a time range exceeding that of the time series.

The Tendency
Most notions of trend for a time series come equipped with a methodology that in itself defines the sense in which it captures a characteristic of the original signal. In order to distinguish our trend from other versions, we refer to our analysis as the process of computing the tendency The process of determining a tendency for a time series amounts to applying a set of criteria that we define a tendency to have to a collection of time series that are related to the original one. The tendency is thus independent of the manner used to obtain the collection of time series. We use the ITD decomposition process to generate this collection of time series. We use the ITD because it is adaptive, fast, robust, and because the application of a mulitscale diffusion process as a filter captures the spirit of the modeling enterprise, wherein one wants to find characteristics of the signal that are prominent and obtain a complement that could be conceivably well captured by a simple stochastic parametrization.
The tendency, in some informal way, should capture some essential elements of a time series: its inherent time scale structure and the most significant part of its histogram; the tendency of a strictly monotonic series is the series itself; and the tendency of a series of constant values is the series itself.
The multiscale structure of a signal can be ascertained qualitatively from the distribution of the locations of its local extrema, and the importance of these local extrema to the total density, estimated by the histogram. This is extracted from projections onto the time axis. The distribution of the data is encoded in a histogram of the projection of the signal on the vertical axis. Our two principal diagnostics extract certain quantitative information from these two aspects of the decomposition.

Horizontal Projection: the Correlation c j
The measure that most critically determines the choice of the baseline to be the tendency is the empirically determined correlation. We have found that the quantity is convenient for graphical depiction of numerical results. We think of the process of ITD iteration as extraction of the noise-like rotation components to reveal the part of the signal that carries inherent information found in the signal. When all this noise has been removed, the next baseline is declared to be the tendency. As can be inferred from the analysis in Section 3, the correlation between the signal and the first rotations should be low, and the correlation between the first baselines and the signal, high. Further iterations break up the meaningful component artificially, and those baselines and rotations should be more correlated. Indeed one can see in Figure 11, which is typical in this respect, that both B j and R j become flatter, and are therefore correlated for a trivial reason. In our experience, there is a noticeable downward jump of the parameter c j at a certain j, and after that, not much of a pattern. The tendency is the a baseline that is still highly correlated with the signal, but one that is not too highly correlated with the rotations. The choice of a suitable baseline for the tendency is made less ambiguous by the symmetry statistic s j described next.

Vertical Projection: the Symmetry Statistic s j
The other determining diagnostic is a measure comparing the symmetry of the baselines to the symmetry of the signal. Define the fluctuation time series of the signal Y with respect to a signal T (which will be the candidate tendency) by F (i) := Y (i)−T (i), i = 1, 2, ..., N. After the ITD decomposition is done, the histogram of the fluctuation is computed for each baseline. Each B j is considered to be a potential tendency. We employ an empirical measure of symmetry, namely, the x-percentiles, P r j x := P r x (F j ), from which we define the symmetry estimator s j for level j to be s j = P r j 75 − 2 P r j 50 − P r j 25 (P r j 75 − P r j 25 ) We pick baseline candidates whose associated fluctuation time series is the most symmetric, s j ∼ 0. In most instances we just compare the absolute values of s j ; however, we retain sign information as it is sometimes useful in further choices between potential tendencies.

Supplementary diagnostics
Symmetry and correlation are the most important properties of our tendency, but we look for confirmation to two other diagnostics (they are included in the examples below). The spread v j : For a given j, the spread v j is the unsigned difference between the standard deviation of the baselines and the rotations. These are normalized to the standard deviation of the signal Y . The standard deviation of the baseline is always decreasing. The spread will reflect certain qualities of the signal and its decomposition. A signal that is random and stationary will have a v j that remains small throughout the j range. This is especially so for a signal that has all scales, in the sense of having a dense and wide spectrum. If the signal is mutiscale, meaning that its spectrum contains several dense spectral ranges separated by gaps, the spread is large and it decreases as j increases. It is typical that the spread reaches zero before j reaches the last baseline index D. The Hellinger Distance: For two probability densities p 1 and p 2 on R, define This is a special case of the Hellinger distance, a measure of the difference between probability measures. Since the tendency should capture the most important, non-random, features of the time series, its Hellinger distance to the pdf of the original signal should be small.

Choosing B j * = T , the baseline that becomes the tendency
In the examples presented in the next section, we follow these steps. 1) By normalization, the correlation parameter is initially equal to 1. Typically, it decreases gradually as j increases, indicating that the correlations between baselines and rotations are small. Often, there is a j * beyond which c j drops significantly, meaning that the rotation and baseline become correlated. See Figure 11, first panel, below. The baseline B j * becomes the candidate tendency.
2) As in Figure 14, the correlation parameter may sometimes decrease gradually. In those cases, we may need to choose two or three candidates B j for the tendency, and use the symmetry criterion, i.e. that s j be close to zero, to single out j * .
3) The choice of a particular baseline as tendency can be further tested by examination of spread and Hellinger distance. Those quantities are shown in the examples below.
The recent papers (Moghtaderi et al. 2011, Moghtaderi et al. 2013) have proposed a definition of trend in terms of the intrinsic mode functions (IMFs) of the EMD. The number of zero crossings of IMFs decreases, on average, by a factor of 1/2 per step. At some point, this pattern breaks down. The trend is now defined to be the sum of all the subsequent IMFs. This criterion is probably related to our requirement of an increase in correlation between rotation and baseline. In both approaches, the oscillatory components extracted by the respective algorithms become spurious modes; they no longer represent noisy fluctuations.
6.1.1. A stochastic process We analyzed a signal consisting of 512 points from a fractional Brownian motion process with Hurst exponent of 0.7. We pretended that it consists of a non-random "carrier" perturbed by noise. This is of course not true; the signal is random, but nonetheless, our criteria combine to pick out the best prospect for a tendency. They are satisfied only approximately, but that is the typical situation.
The tendency appears as the heavy line in Figure 9(a). It was determined to be the baseline B 5 , based on the following observations (see Figure 10 and Figure 9(b)). The correlation parameter c j jumps down at j = 5. The symmetry measure s 5 is not close to zero, but Figure 9(b) shows that the fluctuation time series, Y − B 5 , nonetheless has a quite symmetric empirical pdf, with variance smaller than that of the original signal. Because the correlation jump at c 5 is so pronounced, we wound up choosing B j * = B 5 .
There are eight baselines; it is generally true that the first baseline is close to the original signal, and the last one is flat and uninformative. The first six baselines show significant large-scale structures with small-amplitude small scale structures, superimposed. This is clear in Figure 11.
The spread v j (the difference between the top and bottom curves in Figure 10 (b)) decreases more rapidly from j = 5 on, indicating that the variances of rotation and baseline become more equal; this suggests that they fluctuate on the same scale, and the noise has been removed.
The Hellinger distance of B 5 from the signal is small. The histograms of the signal and the baseline are similar (they are not pictured here).
It is true that the baseline B 5 is more or less the "tendency" one would draw by hand. However, here it is produced algorithmically by the ITD, and chosen from the list of candidates by reasonable quantitative measures.

Deterministic Signals .
Fully deterministic signals with strong multiscale character are particularly problematic for the estimation of trends, when nothing is known about the underlying structure of the signal. Here we consider data that have been carefully engineered to have multi-scale character. An example of a multiscale signal with challenging qualities is Y (i) = 1 1.5 + sin(2πt i ) cos[32πt i + 0.2 cos(64πt i )] + 1 (1.2 + cos(2πt i )) for t i ∈[0:0.0025:1]. This signal was investigated in (Hou & Shi 2011). It was designed to be a test of standard Fourier-based resolution or wavelet-based multi-resolution techniques. The result of the determination of the tendency appears in Figure 12.  Because the ITD algorithm is based on extraction of extrema, even if they are not equally spaced, it is capable of removing the faster oscillations more efficiently than a global spectral method, for example. For this example, the j = 1 baseline is the chosen tendency. We found that the ITD decomposition was sensitively dependent on the sampling rate. We did not pursue this issue further, other than to confirm its existence computationally. tendency (dark). The baseline chosen for the tendency corresponds to j * = 3. According to the correlation criteria, baselines j = 3 and j = 4 are suitable, but the symmetry is higher for j = 3. See Figure 14. The diagnostics indicate that the tendency is suitably close, with regard to the Hellinger distance, to the signal itself. The spread difference suggests that this is an inherently multiscale signal. The tendency choice leads to a fluctuation histogram shown in Figure 13b. Its symmetry and fast decay in the tails lead us to compare the fluctuation to a Gaussian. The fit appears in Figure 13c.  Figure 14. Diagnostics for the tendency, accompanying Figure 13. In lexicographic order: c j , v j baselines (dots), rotations (lines), s j , and the Hellinger distance of the baselines (dots) and the rotations (lines).

Arizona surface temperature anomalies
The annually-averaged temperature in Southwest Arizona for the period 1948-2011 appears in Figure 15a, with tendency in red. Figure 15b shows the corresponding empirical pdfs of the raw data (thin) and tendency fluctuation (heavy). The data can be obtained from the NOAA National Weather Service GISS system. When all 768 monthly records are used for the input time series we obtain the results portrayed in Figure 15c and d. In this case we note that the tendency is consistently below the data mean (dashed), and as a result the empirical pdf of the tendency fluctuation will shift to the right, as compared to the data pdf. The reason is that the tendency is more sensitive to the asymmetry of the input signal: the lower portion of the signal is far less uniform than the highs.

Moscow temperatures.
We now apply our method to the time series of July temperatures in Moscow from 1881 to 2011. The data are found at the NOAA web site and on the homepage of S. Rahmstorf. (Rahmstorf & Coumou 2011) define a nonlinear trend of this series, and conclude that the unusually high Moscow summer temperatures of 2010 were a result of a gradual increase in the global temperature, rather than being an exceptionally large but otherwise normal fluctuation of the weather. Since the data set was relatively short (131 points), their conclusion was also based on expert knowledge of Earth's climate, e.g., of climate scales, on which to base the windowing of a moving average calculation, and an assumption of an underlying Gaussian distribution, about a mean, for the temperature data.
Our goal here is not to focus on the authors' conclusions or methodology. Rather, we are interested in estimating the moving average and testing the Gaussianity using only the intrinsic structure of the time series.
In Figure 16a we show the filtered, or moving, average (which they call "nonlinear trend line") calculated by Rahmstorf and Coumou (light), and the tendency (heavy). We followed the procedure described in their paper. The empirical cumulative distribution functions associated with these data are shown in Figure 16b. The tendency was  calculated using only the 131 data points, without availing ourselves of knowledge about the underlying climate dynamics or statistics of the temperature distribution.

Discussion and Conclusions
With the aim of addressing the challenge of computing trends for multi-scale signals which are not amenable to law-of-large-number arguments we propose a notion of a signal trend which we call the tendency of the time series. This tendency has been designed to agree closely with an intuitive, rather than a statistical, notion of what a trend for a discrete time series could be. It is a time series, of lesser complexity than the original series, that conveys the most salient features of the histogram and the local time development of the series being analyzed. It emphasizes the importance of more frequent time series values and more uncommon extremal value locations.
The ITD process yields a decomposition that respects the inherent multiscale nature of complex signals. In this regard, the ITD and the EMD yield similar decompositions. The tendency is found by then applying a set of criteria that will identify one of the members of the ITD decomposition as a candidate for the tendency. Recently, (Moghtaderi et al. 2011) and (Moghtaderi et al. 2013) proposed criteria to determine a trend for a signal using an EMD decomposition. (Another alternative definition of trend, based upon the EMD decomposition, is found in (Wu et al. 2007)). When the method in (Moghtaderi et al. 2011, Moghtaderi et al. 2013) is applied to the Moscow temperature series, the trend is very similar to ours. The criteria proposed for the trend in connection with the EMD analysis consists of examining the ratio of the energy in the IMF's as well as the ratio of the number of zero crossings. For random signals it has been observed that the energy ratios of consecutive IMF's as well as the ratio of the number zero crossings of consecutive IMF's are very similar. On the other hand, a signal consisting spectrally-uniform random noise over a structured signal with long timescale features will yield a decomposition whose ratios will differ at the IMF level corresponding to when the decomposition method no longer picks out mostly noise. The EMD trend consists of the sum of the remaining IMF's. The tendency will qualitatively agree with the EMD trend for signals of this sort. The tendency and the EMD trend will differ when the underlying process that best describes the time series is a random, intermittent jump process (see (Branicki & Majda 2013)), and for signals that have underlying trends with significant jumps.
Until now, very few analytical properties of the products of the ITD process were known, and those were derived in the original paper (Frei & Osorio 2007). It was established that the decomposition method iteratively produces baselines that are guaranteed to have monotonicity when the signal or the adjacent lower baseline has local monotonicity. From this we can infer that the tendency responds to this by producing a notion of trend that has a high H 1 -like norm (a norm that combines the ℓ 2 -norm of the signal and that of its time-scaled difference values). Moreover, if a signal is globally monotonic the tendency would also be globally monotonic. We feel that this is a very strong characteristic of a raw signal that should be included in some notion of a trend for this signal. In this paper we made some headway in understanding the ITD process. We studied the decomposition of random stationary signals, numerically and analytically. The numerical results suggested existence of a certain scaling universality, and we propose a probabilistic model of the ITD algorithm that exhibits scaling of precisely the type observed experimentally. The scaling coefficients obtained by our method are reasonably close to the computed ones, but refinements of the model are needed. Because the number of extrema in the baselines scales geometrically, the number of baselines generated from a signal of length N is only of order log N, we suspect that a rigorous explanation will require an N → ∞ limit, together with some sort of renormalization. We observe that in a formal continuum limit, still for a random signal, the ITD steps amount to the solution of a diffusion equation; this feature should be exploited. It would be interesting to extend this analysis to the EMD.
The tendency, we believe, can find use in the analysis of data in which one would like to discern structure in a signal from aspects of the signal that might well be described as random noise of high frequency variability, beyond the standard examples from econometrics. This sort of analysis is commonly done in climate variability, where one wants to identify aspects of the signal that can be explained by physical models. Just as other notions of a trend, the tendency requires interpretation. This challenge is presumably one we are willing to accept.