Belief Propagation for Probabilistic Slow Feature Analysis

Slow feature analysis (SFA) is a time-series analysis method for extracting slowly-varying latent features from multi-dimensional data. A recent study proposed a probabilistic framework of SFA using the Bayesian statistical framework. However, the conventional probabilistic framework of SFA can not accurately extract the slow feature in noisy environments since its marginal likelihood function was approximately derived under the assumption that there exists no observation noise. In this paper, we propose a probabilistic framework of SFA with rigorously derived marginal likelihood function. Here, we rigorously derive the marginal likelihood function of the probabilistic framework of SFA by using belief propagation. We show using numerical data that the proposed probabilistic framework of SFA can accurately extract the slow feature and underlying parameters for the latent dynamics simultaneously even under noisy environments.


Introduction
Slow feature analysis (SFA) is a time-series analysis method for extracting slowly varying features from multidimensional data. 1) In recent years, the SFA has attracted much attention in theoretical neuroscience and has been used to establish receptive field models for complex cells in the visual system, place cells in the hippocampus, and grid cells in the entorhinal cortex. [2][3][4] These theoretical studies suggest that the extraction of slowly varying features plays an important role in the information processings in neural systems. Recent machine learning studies have also used the SFA to improve pattern recognition and feature extraction algorithms from multi-dimensional data. [5][6][7][8][9][10][11] A probabilistic version of SFA has been proposed using the framework of the state-space model; 12) a probabilistic perspective has been useful for many machine learning algorithms such as those for principal component analysis 13) and independent component analysis. 14) Probabilistic models have several merits, such as many powerful methods developed for learning and inference and so on. In the conventional probabilistic framework of SFA, 12) the marginal likelihood function is approximately derived by assuming that there exists no observation noise, although observed data would include noise in general. Therefore, it had been unclear whether the slow feature estimated by the conventional probabilistic framework of SFA is accurate for noisy data. From this point of view, a recent study showed that the conventional probabilistic framework of SFA cannot extract slow features accurately under noisy environments. 15) In this study, we propose a probabilistic version of SFA considering the effect of observation noise in order to extract the slow feature and its underlying parameters accurately. Here, we rigorously derive the marginal likelihood function of the probabilistic framework of SFA by using belief propagation. [16][17][18][19] The belief propagation for one-dimensional graphical models is also known as the transfer-matrix method in statistical physics, 16,17,[19][20][21] and it enables the rigorous derivation of the marginal distribution for graphical models with no loops, where we find that the graphical model of the probabilistic framework of SFA has no loop structure. We show using numerical data that our proposed method can accurately extract the slow feature even from noisy timeseries data.

Theory
In this section, we first review the conventional SFAs: the deterministic SFA and the conventional probabilistic framework of SFA. Next, we propose a novel probabilistic framework of SFA with a rigorously derived marginal likelihood function. We rigorously derive the marginal likelihood function by means of belief propagation in order to realize the accurate and robust estimation of the slow feature and parameters even under noisy environments.

Deterministic SFA
The original SFA is a deterministic algorithm for extracting the most slowly varying feature (called the "slow feature") from multi-dimensional time series data. 1) A schematic diagram of the deterministic SFA is shown in Fig. 1(a). For multi-dimensional input time-series data xðtÞ 2 R M , the output of SFA y j ðtÞ ( j ¼ 1; . . . ; N) is obtained using the transformation y j ðtÞ ¼ g j ðxðtÞÞ. This transformation g j ðxÞ is determined to minimize the following expression: where Áðy j Þ is called a Δ-value and hÁi t denotes an average with respect to time. Namely, in deterministic SFA, we perform the transformation g j ðxÞ, which minimizes the averaged square of the derivative of output yðtÞ with respect to time t. Within the outputs of deterministic SFA fy j ðtÞg, the output y j ðtÞ with the minimum Δ-value corresponds to a slow feature. The following three constraint conditions, are used to normalize the output signals, avoid trivial results, and guarantee that different output elements have different information.

Probabilistic SFA
A probabilistic version of SFA has been proposed recently based on Bayesian statistical framework. 12) However, in that version of the SFA, the marginal likelihood function used to estimate parameters of the probabilistic model is approximately derived by assuming no observation noise. 12) A more recent theoretical study has shown that the conventional probabilistic framework of SFA cannot estimate the slow feature accurately under noisy conditions. 15) In this study, we propose a novel probabilistic framework of SFA with rigorous derivation of the marginal likelihood function. We consider the extraction of N-dimensional latent variables y t from M-dimensional observed variables x t for discrete-time steps t ¼ 1; . . . ; T [ Fig. 1(b)]. As shown in Fig. 1(b), observed multi-dimensional time series data are transformed into multiple latent time series with different degrees of slowness.
The probabilistic framework of SFA 12) is based on the state-space model, consisting of a system model and an observation model, and is expressed by the graphical model shown in Fig. 2. As shown by the gray circles in Fig. 2, we assume that the latent variables y t ¼ ½y 1;t ; . . . ; y N;t T depend on those at the preceding time y tÀ1 . The dynamics of latent variables y t are described by the following system model: Here, is a diagonal matrix with elements n (n 2 f1; . . . ; Ng). Each element n determines the dynamics of the nth latent variable y n;t . Namely, n is an indicator of the slowness of each latent variable. Note that the latent variable with the largest value of n corresponds to the slow feature to be extracted in the probabilistic framework of SFA. The vector t describes a system noise obeying white Gaussian noise with average 0 and covariance AE, where AE is a diagonal matrix with elements 2 1:N . Here, the relation between and AE is assumed as 2 n þ 2 n ¼ 1, and each element n takes a value between 0 and 1. As shown in Fig. 1(b), the dynamics of y n;t is slow for large values of n and is fast for small values of n . Note that the probabilistic framework of the SFA proposed by Turner and Sahani 12) is based on a system model with first-order time-differences for each latent dynamics, and an observation model. Hereafter, we express the probabilistic framework of SFA based on the first-order difference as probabilistic SFA for simplicity.
The observed variables x t are assumed to be expressed using the following observation model: Namely, as shown in black circles in Fig. 2, the observed variables x t are assumed to be generated from the latent variables y t , which are converted by an M Â N matrix W À1 under observation noise " t . The observation noise " t is assumed to be white Gaussian noise with average 0 and covariance 2 x I, where I is the identity matrix. From Eqs. (5) and (6), the state space model of probabilistic SFA can be expressed by using probability density functions as follows: pð y t jy tÀ1 ; ; AEÞ The condition n < 1 assumed in the probabilistic version of the SFA would seem to indicate that the latent variables y n;t simply decay to zero with time. However, the framework of the probabilistic SFA is formulated not to suffer from this problem. 12) The mean of the hidden variable hy n;t jy n;tÀ1 i with respect to the conditional distribution given the hidden variable y n;tÀ1 at the preceding time pðy n;t jy n;tÀ1 Þ, x 1 x 2 x T-1 x T y 1,1 y N,1 y N,2 y 1,2 y N,T-1 y N,T y 1,T-1 y 1,T Fig. 2. A graphical structure of the probabilistic SFA. Each observed variable x t at time t depends on a latent variable y t at the same time t, whereas the latent variable y t at time t depends on the latent variable y tÀ1 at the preceding time t À 1. The graphical structure of the probabilistic SFA has a straight structure and no loops. Based on this graphical structure, we employ belief propagation in order to derive the marginal likelihood function rigorously. hy n;t jy n;tÀ1 i ¼ Z y n;t pðy n;t jy n;tÀ1 Þ dy n;t ¼ n y n;tÀ1 ; is smaller than the value at the preceding time y n;tÀ1 . On the other hand, the second moment of the hidden variable hy 2 n;t jy n;tÀ1 i can be larger than that at the preceding time y 2 n;tÀ1 as follows: hy 2 n;t jy n;tÀ1 i ¼ Z y 2 n;t pðy n;t jy n;tÀ1 Þ dy n;t ¼ 2 n y 2 n;tÀ1 þ 2 n ; if y 2 n;tÀ1 < 2 n 1À 2 n . This condition y 2 n;tÀ1 < 2 n 1À 2 n is satisfied when the hidden variable takes a small value (y 2 n;tÀ1 < 1) by setting 2 n þ 2 n ¼ 1. This condition 2 n þ 2 n ¼ 1 is an expression assumed in probabilistic SFA. Thus, the system model [Eqs. (5) and (7)] does not simply show that hidden variables y n;t decay to zero but, as shown by the numerical simulations in the present study (Fig. 1) and in a previous study by Turner and Sahani,12) also describes fluctuating responses including slowly varying features.

Rigorous derivation of marginal likelihood function by belief propagation
To estimate the latent variables y t , we need to estimate parameters in the state space model: ¼ fW À1 ; ; AE; 2 x g. For this purpose, we rigorously derive the marginal likelihood function of probabilistic SFA by means of the belief propagation. [16][17][18]21,22) The marginal likelihood function of the probabilistic SFA is expressed as follows: where x 1:T and y 1:T denote sets of vectors x 1:T ¼ fx 1 ; x 2 ; . . . ; x T g and y 1:T ¼ fy 1 ; y 2 ; . . . ; y T g. pð y 1 Þ expresses an initial condition obeying a Gaussian distribution. To evaluate this marginal likelihood function, the previous study by the Turner and Sahani 12) employed an approximation by assuming that the observation noise 2 x is zero. In their approximation, the probabilistic model of the observation model pðx t jy t ; W; 2 x IÞ is assumed to be Dirac's delta function ðx t À W À1 y t Þ instead of Eq. (8) in order to simplify the integration in the marginal likelihood function. However, this approximation assumes that there exists no observation noise. A recent study showed that the estimation accuracy in the conventional probabilistic SFA is lowered by noisy data. 15) In this study, we derive the marginal likelihood function rigorously. Equation (9) includes the high-dimensional integration of the joint probability distribution with respect to y t . As shown in Fig. 2, the graphical model of probabilistic SFA has a straight structure and no loops. This structure enables us to overcome the difficulty of high-dimensional integration by using the belief propagation. Thus, we can perform integration with respect to each latent variable y t consecutively from time t ¼ 1. Since y t depends on y tÀ1 , integrations after time t ¼ 2 can be performed by using the integration for the preceding time. By the belief propagation, a marginal distribution ð y t Þ can be propagated as a message from time t ¼ 1 to time t ¼ T using the following recursion relation: where the marginal distribution ð y t Þ obeys a Gaussian distribution: Detailed derivation of Eq. (10) is given in the Appendix. The coefficient c t here is the conditional distribution of the observation model as follows: where pðx t jx 1 ; . . . ; x tÀ1 Þ obeys a Gaussian distribution. Substituting the system model [Eq. (7)] into the integral in the right-hand side in Eq. (10), we obtain the following equation: Z dy tÀ1 ð y tÀ1 Þpð y t jy tÀ1 Þ where we put P tÀ1 ¼ AE þ V tÀ1 . By substituting this expression into Eq. (10), we obtain the following recursion relations: where K t is a Kalman gain matrix as follows: and Z t is defined as By conducting analytical integrations consecutively using the belief propagation, we obtain the rigorously derived expression for the marginal likelihood function of the probabilistic SFA as follows: The rigorously derived marginal likelihood function pðx 1:T jÞ is a product of c t . The average and covariance of each c t can be obtained from the matrices and vectors of the preceding time t À 1 by means of belief propagation. Using the rigorously derived marginal likelihood function and the state space model of the probabilistic SFA, the proposed method extracts the slow feature and its underlying parameters simultaneously. Note that the approximated marginal likelihood function used in the conventional method 12) can be obtained in the limit of 2 x ! 0.

Results
In this section, we evaluate the validity of proposed probabilistic SFA. Both latent and observed variables are generated numerically based on the probabilistic SFA. Namely, we assume that latent variables y t at time t are generated from those at the preceding time, and that the observed data x t is generated through noisy observation under linear transformation of the latent variables y t .
We estimate the latent variables y 1:T and the parameters ¼ fW À1 ; ; AE; 2 x g simultaneously from the observed data x 1:T . The latent variables y 1:T are estimated using the Kalman filter, whereas the parameters are estimated using the maximum likelihood method for the rigorously derived marginal likelihood function: ¼ arg max pðx 1:T jÞ. For simplicity, the dimension of the observed variables x t is set to be the same as that of the latent variables y t (M ¼ N ¼ 3), and the total number of time steps is set to be T ¼ 10000. The parameters of probabilistic SFA are set to be 1 ¼ 0:9, 2 ¼ 0:5, 3 ¼ 10 À5 , 2 1 ¼ 0:19, 2 2 ¼ 0:75, and 2 3 ¼ 0:99999.

Estimation of slow feature from multi-dimensional data
Here, we extract the slow feature from noisy observed data x t by using the proposed method. We use the observed variables x t generated under noisy observation ( 2 x ¼ 0:5; 1) or noiseless observation ( 2 Figure 3(a) shows the numerical data for 2 x ¼ 1. We find that each of the observed variables x t includes both slowly-varying and fast-varying elements, whereas each of the latent variables y t shows dynamics with different specific slowness. Note that the latent variable y 1;t with the largest i corresponds to the slow feature.
The upper graphs in Fig. 3(b) show the slow featureỹ t;1 estimated by the proposed method. We find that, even for noisy data, the estimated slow featureỹ t;1 (red dotted line) exhibits dynamical behaviors similar to those of the true slow feature y t;1 (black solid line). In contrast, the slow feature estimated from noisy data by the conventional method is less similar to the true one [bottom graphs in Fig. 3(b)]. These results show that the proposed probabilistic SFA with a rigorously derived marginal likelihood function gives better performance than the previous probabilistic SFA. 12)

Effect of observation noise on performance
To evaluate the effect of observation noise on estimation performance quantitatively, we perform estimation for different levels of observation noise. Figure 4 shows how the discrepancy between the estimated and the true slow feature changes with the observation noise. We find that, for finite values of observation noise, the discrepancy between the estimated and the true slow feature in the proposed method is smaller than that in the conventional method. 12) This result shows that our proposed method gives better performance than the previous method; even though the results of the two methods are similar when there exists no observation noise, the estimation errors for the proposed method are much smaller than those for the previous method when there exists observation noise. From these results, we find that our proposed method using probabilistic SFA with the rigorously derived marginal likelihood function extracts the slow feature more accurately.
We also investigate the estimation accuracy of underlying parameters. The dependence of estimated parameters i (i ¼ 1; 2; 3) on observation noise is shown in Fig. 5. We see that with the proposed method, the estimated parameters are similar to the true ones even for a noisy environment. With the conventional method, however, the discrepancy between the estimated and true parameters increases as the observation noise increases. These results show that the proposed probabilistic SFA with a rigorously derived marginal likelihood function accurately estimates the slow feature and its underlying parameters.

Concluding Remarks
In this paper, we have proposed probabilistic SFA with a rigorously derived marginal likelihood function for extracting the latent slow feature from multi-dimensional data. The marginal likelihood function in the probabilistic SFA has been derived rigorously by means of belief propagation in the proposed method, whereas the marginal likelihood function was approximately evaluated in the conventional probabilistic SFA by assuming that there is no observation noise. Furthermore, we have shown using numerical data that our proposed method can estimate the slow feature and its underlying parameters for latent dynamics accurately even under noisy conditions. The conventional and proposed probabilistic frameworks of the SFA assume a simple system model described by firstorder autoregressive models with some constraints. This assumption may cause inaccurate estimation in the case of latent dynamics including complex oscillations. To overcome this limitation, it would be important to extend the probabilistic frameworks of the SFA to a framework with system model including an orthogonal matrix with offdiagonal elements and to consider more general dynamical systems including higher-order autoregressive models. We leave this extension as a future work.  1; 2; 3) on the variance of observation noise 2 x . Estimated parameters in the proposed method are indicated by filled marks whereas those in the conventional method 12) are indicated by open marks. For both methods, estimated values of 1 , 2 , and 3 are indicated by circles, triangles, and squares, respectively. Note that the true parameter values are 1 ¼ 0:9, 2 ¼ 0:5, 3 ¼ 10 À5 . We find that parameters are accurately estimated even in noisy case in the proposed method whereas the discrepancy between the true and estimated parameters increases with the observation noise in the conventional method.   4. Discrepancy between the true slow feature y 1;t and the estimated slow featureỹ 1;t evaluated for different levels of observation noise 2 x . The discrepancies obtained from the proposed method and the conventional method 12) are indicated by circles and squares, respectively.