A Decentralized Approach for Nonlinear Prediction of Time Series Data in Sensor Networks

Wireless sensor networks rely on sensor devices deployed in an environment to support sensing and monitoring, including temperature, humidity, motion, and acoustic. Here, we propose a new approach to model physical phenomena and track their evolution by taking advantage of the recent developments of pattern recognition for nonlinear functional learning. These methods are, however, not suitable for distributed learning in sensor networks as the order of models scales linearly with the number of deployed sensors and measurements. In order to circumvent this drawback, we propose to design reduced order models by using an easy to compute sparsification criterion. We also propose a kernel-based least-mean-square algorithm for updating the model parameters using data collected by each sensor. The relevance of our approach is illustrated by two applications that consist of estimating a temperature distribution and tracking its evolution over time.


Introduction
Wireless sensor networks consist of spatially distributed autonomous sensors whose objective is to cooperatively monitor physical or environmental parameters such as temperature, humidity, concentration, pressure, and so forth. Starting with critical military applications, they are now used in many industrial and civilian areas, including industrial process monitoring and control, environment and habitat monitoring, home automation, and so forth. Some common examples are monitoring the state of permafrost and glacier, tracking wildland and forest fires spreading, detecting water and air pollution, sensing seismic activities, to mention a few. Modeling phenomena under consideration allows the extrapolation of present states over time and space. This can be used to identify trends or to estimate uncertainties in forecasts, and to prescribe detection, prevention, or control strategies accordingly.
Here, we consider the problem of modeling complex processes such as heat conduction and pollutant diffusion with wireless sensor networks, and track changes over time and space. This typically leads to a dilemma between incorporating enough complexity or realism on the one hand, and keeping the model tractable on the other hand. Due to computational resource limitations, priority is given to oversimplification and models that can separate the important from the irrelevant. Many approaches have been proposed to address this issue with collaborative sensor networks. In [1], an incremental subgradient optimization procedure has been applied in a distributed fashion for the estimation of a single parameter. See also [2] for an extension to clusters. It consists of passing the parameter from sensor to sensor, and updating it to minimize a given cost function locally. More than one pass over all the sensors may be required for convergence to the optimal (centralized) solution. This number of cycles can be theoretically bounded. The main advantages of this method are the simple sensor-to-sensor scheme with a short pathway and without lateral communication, and the need to communicate only the estimated parameter value over sensors. However, as explained in [3], such a 2 EURASIP Journal on Wireless Communications and Networking technique cannot be used for functional estimation since evaluating the subgradient in the vicinity of each sensor requires information related to other sensors. Model-based techniques that exploit the temporal and spatial redundancy of data in order to compress communications have also been considered. For instance, in [4], data captured by each sensor over a time interval are fitted by (cubic) polynomial curves whose coefficients are communicated between sensors. Since there is a significant amount of redundancy between measurements performed by two nearby sensors, spatial correlations are also modeled by defining the basis functions over both spatial parameters and time. The main drawback of such techniques is their dependence upon the modeling assumptions. Model-independent methods based on kernel machines have recently been investigated. In particular, a distributed learning strategy has been successfully applied to regression in sensor networks [5,6]. Here, each sensor acquires information from neighboring sensors to solve locally the least-squares problem. This broadcast, unfortunately, leads to high energy consumption.
We take advantage of the pros of some of the abovementioned methods to derive our approach. In particular, we will require the following important properties to hold.
(i) Sensor-to-sensor scheme: each sensor has the same importance in the network at each updating cycle. Thus, failure of any sensor has a small impact on the overall model, as opposed to cluster-head failure in aggregation and clustering techniques. It should be noted that several such conventional methods have been investigated specifically for use in sensor networks. Examples are LEACH with data fusion in the cluster head [7], PEGASIS with data conveyed to a leader sensor [8], and (minimum) spanning tree and junction tree, to name a few.
(ii) Kernel machines: these model-independent methods have gained popularity over the last decade. Initially derived for regression and classification with support vector machines [9], they include classical techniques such as least-squares methods and extend them to nonlinear functional approximation. Kernel machines are increasingly applied in the field of sensor networks for localization [10], detection [11], and regression [3]. One potential problem of applying classical kernel machines to distributed learning in sensor networks is that the order of the resulting models scales linearly with the number of deployed sensors and measurements.
(iii) Spatial redundancy: taking spatial correlation of data into account has been recommended by numerous researchers. See, for example, [12][13][14] where relation between the topology of the network and measurement data is studied. In particular, the authors of [12] seek to identify a small subset of representative sensors which leads to minimal distortion of the data.
In this paper, we propose a new approach to model physical phenomena and track their evolution over time.
The new approach is based on a kernel machine but controls the model order through a coherence-based criterion that reduces spatial redundancy. It also employs sensor-tosensor communication, and thus is robust to single sensor failures. The paper is organized as follows. The next section briefly reviews functional learning with kernel machines and addresses its limitations within the context of wireless sensor networks. It is shown how to overcome these limitations through a model order reduction strategy. Section 3 describes the proposed algorithm and its application to instantaneous functional estimation and tracking. Section 4 addresses implementation issues in sensor networks. Finally, we report simulation results in Section 5 to illustrate the applicability of the proposed approach.

Functional Learning and Sensor Networks
We consider a regression problem whose goal is, for example, to estimate the temperature distribution over a region where n wireless sensors are randomly deployed. We denote by X the region of interest, which is supposed to be a compact subset of R d , and by · its conventional Euclidean norm. We wish to determine a function ψ * (·) defined on X that best models the spatial temperature distribution. The latter is learned from the information coupling sensor locations and measurements. The information from the n sensors located at x i ∈ X and providing measurements d i ∈ R, with i = 1, . . . , n, is combined in the vector of pairs {(x 1 , d 1 ), . . . , (x n , d n )}. The fitness criterion is the mean square error between the model outputs ψ(x i ) and the measurements d i , for i = 1, . . . , n, namely, Note that this problem is underdetermined since there exists an infinite number of functions that verify this expression. To obtain a well-posed problem, one must restrict the space of candidate functions. The framework of reproducing kernels allows us to circumvent this drawback.

A Brief Review of Kernel Machines.
We consider a reproducing kernel κ : X × X → R. We denote by H its reproducing kernel Hilbert space, and by ·, · H the inner product in H . This means that every function ψ(·) of H can be evaluated at any x ∈ X with By using Tikhonov regularization, minimization of the cost functional over H leads to the optimization problem ψ * (·) = arg min where η controls the trade-off between the fitting to the available data and the smoothness of the solution. Before proceeding, we recall that data-driven reproducing kernels have been proposed in the literature, as well as EURASIP Journal on Wireless Communications and Networking  more classical and universal ones. In this paper, without any essential loss of generality, we are primarily interested in radial kernels. They can be expressed as a decreasing function of the Euclidean distance in X, that is, κ(x i , x j ) = κ( x i −x j ) with some abuse of notation. Radial kernels have a natural interpretation in terms of measure of similarity in X, as the kernel value is larger the closer together two locations are.
Two typical examples of radial kernels are the Gaussian and Laplacian kernels, defined as where β 0 is the kernel bandwidth. Figure 1 represents these kernels. Other examples of kernels, radial or not, can be found in [15].
It is well-known in the machine-learning community that the optimal solution of the optimization problem (3) can be written as a kernel expansion in terms of the available data [16,17], namely, This means that the optimal function is uniquely identified by the weighting coefficients α 1 , . . . , α n and the n sensor locations x 1 , . . . , x n . Whereas the initial optimization problem (3) considers the infinite dimensional hypothesis space H , we are now considering the optimal vector α = [α 1 · · · α n ] in the n-dimensional space of coefficients. The corresponding cost function is obtained by inserting the model (5) into the optimization problem (3). This yields where K is the so-called Gram matrix whose (i, j)th entry is κ(x i , x j ), and d = [d 1 . . . d n ] is the vector of measurements. The solution to this problem is given by Note that the computational complexity involved in solving this problem is O(n 3 ). Practicality of wireless sensor networks imposes constraints on the computational complexity of calculations performed by each sensor, and on the amount of internode communications. To deal with these constraints, the optimization problem (6) may be solved distributively using a receive-update-transmit scheme. For example, sensor i gets the parameter vector α i−1 from sensor i − 1, and updates it to α i based on the error e i defined by In order to compute [κ(x 1 , x i ) · · · κ(x n , x i )]; however, each sensor must know the locations of the other sensors. This unfortunately imposes a substantial demand for both storage and computational time, as most practical applications require a large number of densely deployed sensors for coverage and robustness reasons. To alleviate these constraints, we propose to control the model complexity to significantly reduce the computational effort and communication requirements.

Complexity Control of Kernel Machines in Sensor Networks.
Consider the restriction of the kernel expansion (5) to a dictionary D m composed of m functions κ(x ωk , ·) carefully selected among the n available ones, where {ω 1 , . . . , ω m } is a subset of {1, . . . , n} and m is several orders of magnitude smaller than n. This is equivalent to choosing m sensors denoted by ω 1 , . . . , ω m whose locations are given by x ω1 , . . . , x ωm . The resulting reduced order model will be given by The selection of the kernel functions in the reducedorder model is crucial for achieving good performance. In particular, the removed kernel functions must be well approximated by the remaining ones in order to minimize the difference between the optimal model given in (5) and the reduced one in (9). A variety of methods have been proposed in recent years for deriving kernel-based models with reduced order. They broadly fall into two categories. In the first one, the optimization problem (6) is regularized by an 1 penalization term applied to α [18,19]. These techniques are not suitable for sensor networks due to their large computational requirements. In the second category, postprocessing algorithms are used to control the model order when new data becomes available. For instance, the short-time approach consists of including, as we visit each sensor, the newly available kernel function while removing the oldest one. Another technique, called truncation, removes the kernel functions associated with the smallest weighting coefficients α i . These naive methods usually exhibit poor performance because they ignore the relationships between the kernel functions of the model. To efficiently control the order of the model (9) as the model travels through the network, only the less redundant kernel functions must be added to the kernel expansion. Several criteria have been proposed in the literature to assess the contribution of each new kernel function to an existing model. In [20][21][22], for instance, the kernel function κ(x i , ·) is inserted into the model and its order is increased by one if the approximation error defined below is greater than a given threshold where κ is a unit-norm kernel (replace κ( Note that this criterion requires the inversion of an m-by-m matrix, and thus demands high precision and large computational effort from the microprocessor at each sensor. In this paper, we cut down the computational cost associated with this selection criterion by using an approximation which has a natural interpretation in the wireless sensor network setting. Based on recent work in kernel-based online prediction of time series by three of the authors [23,24], we employ the coherence criterion which includes the candidate kernel function κ(x i , ·) in the mth order model provided that where ν is a threshold in [0, 1[ which determines the sparsity level of the model. By the reproducing property of H , we note that κ( The condition (11) then results in a bounded crosscorrelation of the kernel functions in the model. Without going into details, we refer interested readers to our recent paper [23], where we study the properties of the resulting models, and connections to other sparsification criteria such as (10) or the kernel principal component analysis.
We shall now show that the coherence criterion has a natural interpretation in the wireless sensor network setting. Let us compute the distance of two kernel functions in H where we have assumed, without substantive loss of generality, that κ is a unit-norm kernel. Back to the coherence criterion and using the above result, (11) can be written as follows: In-sensor parameters Evaluation of the kernel κ(· , ·) Coherence threshold ν Step-size parameter ρ Communicated message Thus, the coherence criterion (11) is equivalent to a distance criterion in H where kernel functions are discarded if they are too close to those already in the model. Distance criteria are relevant within the context of sensor networks since they can be related to signal strength loss [10]. We shall discuss this property further at the end of the next section when we study the optimal selection of sensors.

Distributed Learning Algorithm
Let ψ(·) = m k=1 α k κ(x ωk , ·) be the mth order model where the kernels κ(x ωk , ·) form a ν-coherent dictionary determined under the rule (11). In accordance with the leastsquares problem (3), the m-dimensional coefficient vector α * satisfies where H is the n-by-m matrix with (i, j)th entry κ(x i , x ωj ), and K ω is the m-by-m matrix with (i, j)th entry κ(x ωi , x ωj ). The solution α * is obtained as follows: which requires O(m 3 ) operations as compared to O(n 3 ), with m n, for the optimal solution given by (7). We shall now cut down the computational cost further by using a distributed algorithm in which each sensor node updates the coefficient vector.

Recursive Parameter Updating.
To solve problem (15) recursively, we consider an optimization algorithm based on the principle of minimal disturbance, as studied in our paper [25]. Sensor i computes α i from α i−1 received from sensor i−1 by minimizing the norm between both coefficient vectors under the constraint ψ(x i ) = d i . The optimization problem solved at sensor i is where κ i is a m-dimensional column vector whose kth entry is κ(x i , x ωk ). The model order control using (11) requires different measures for each of the two alternatives described next. max k=1,...,m |κ(x i , x ωk )| > ν. Sensor i is close to one of the previously selected sensors ω 1 , . . . , ω m in the sense of the norm in H . Thus, the kernel function κ(x i , ·) does not need to be inserted into the model, whose order remains unchanged. Only the coefficient vector needs to be updated. The solution to (16) can be obtained by minimizing the Lagrangian function where λ is the Lagrange multiplier. Differentiating this expression with respect to both α i and λ, and setting the derivatives to zero, we get the following equations Assuming that κ i κ i is nonzero, these equations yield Substituting the expression for λ into (18) leads to the following recursion where we have introduced the step-size parameter ρ in order to control the convergence rate of the algorithm. max k=1,...,m |κ(x i , x ωk )| ≤ ν. The topology defined by sensors ω 1 , . . . , ω m does not cover the region monitored by sensor i. The kernel function κ(x i , ·) is then inserted into the model, and will henceforth be denoted by κ(x ωm+1 , ·). Now we have To accommodate the new entry α m+1 , we modify the optimization problem (16) as follows: where the subscript [1:m] denotes the first m elements of α i . Note that κ i now has one more entry, κ(x i , x ωm+1 ). Writing the Lagrangian and setting to zero its derivatives with respect to α i and λ, we get the following updating rule The form of recursions (21)- (24) is that of the kernelbased normalized LMS algorithm with order-update mechanism. The pseudocode of the algorithm is summarized in Table 1.

Algorithm and Remarks.
We now illustrate the proposed approach. We shall address the problem of optimally selecting the sensors ω k in the next subsection. Consider the network schematically shown in Figure 2. Here, each sensor is represented by a node, and communications between sensors are indicated by one-directional arrows. The process is initialized with sensor 1, that is, we set ω 1 = 1 and m = 1. Let us suppose that sensors 2 and 3 belong to the neighborhood of sensor 1 with respect to criterion (11). As illustrated in Figure 2, this means that (11) is not satisfied for k = 1 and i = 2, 3. Thus, the model order remains unaltered when the algorithm processes the information at nodes 2 and 3. The coefficient vector α i is updated using rule (21) for i = 2, 3. Sensor-to-sensor communications transmit the locations of sensors that contribute to the model, here x 1 , and the updated parameter vector. As information propagates through the network, it may be transmitted to a sensor which satisfies criterion (11). This is the case of sensor 4, which is then considered to be outside the area covered by contributing sensors. Consequently, the model order m is increased by one at sensor 4 and the coefficient vector α 4 is updated using (24). Next, the sensor locations [x 1 x 4 ] and the parameter vector α 4 are sent to sensor 5, and so on.
Updating cycles can be repeated to refine the model or to track time-varying systems. (Though beyond the scope of this paper, one may assume a time-evolution kernel-based model in the spirit of [4] where the authors fit a cubic polynomial to the temporal measurements of each sensor.) For a network with a fixed sensor spatial distribution, the coefficient vectors α i tend to be adapted using (21) with a fixed-order m, after a transient period during which rules (21) and (24) are both used.
Note that different approaches may be used to derive the recursive parameter updating equation. The wide available literature on adaptive filtering methods [26,27] can be used to derive different kernel-based adaptive algorithms that may have desirable properties for solving specific problems [23]. For instance, specific strategies may be used to tune the step-size parameter ρ in (21) and (24) for a better trade-off between convergence speed and steady-state performance. Note also that regularization is usually unnecessary in (21) and (24) for adequate values of ν. (Regularization would be implemented in (21) and (24) by using a step-size of the form ρ/( κ i 2 + η), where η is the regularization coefficient.) If sensor i is one of the m model-contributing sensors, then κ(x i , x i ) is an entry of vector κ i . Assuming again without loss of generality that κ is a unit-norm kernel, this yields κ i ≥ 1. Otherwise, sensor i does not satisfy criterion (11). This implies that there exists at least one index k such that κ(x i , x ωk ) > ν, and thus κ i > ν.

On the Optimal Selection of
Sensors ω k . In order to design an efficient sensor network and to perform a proper dimensioning of its components, it is crucial that the order m of the model be as small as possible. So far, we have considered a simple heuristic consisting of visiting the sensor nodes as they are encountered, and selecting onthefly with criterion (11) those to include in the model. We  shall now formalize this selection as a minimum set cover combinatorial optimization problem.
We consider the finite set H n = {κ(x 1 , ·), . . . , κ(x n , ·)} of kernel functions, and the family of disks of radius ν centered at each κ(x k , ·). Within this framework, a set cover is a collection of some of these disks whose union is H n . Note that we have denoted by D m the set containing the κ(x ωk , ·)'s, with k = 1, . . . , m. In the set covering optimization problem, the question is to find a collection D m with minimum cardinality. This problem is known to be NP-hard. The linear programming relaxation of this 0-1 integer program has been considered by numerous authors, starting with the seminal work [28]. Greedy algorithms have also received attention as they provide good or near-optimal solutions in a reasonable time [29]. Greedy algorithms make the locally optimum choice at each iteration, without regard for its implications on future stages. To insure stochastic behavior, randomized greedy algorithms have been proposed [30,31]. They often generate better solutions than the pure greedy ones. To improve the solution quality, sophisticated heuristics such as simulated annealing [32], genetic algorithms [33], and neural networks [34] introduce randomness in a systematic manner.
Consider, for instance, the use of the basic greedy algorithm to determine D m . The greedy algorithm for set covering chooses, at each stage, the set which contains the largest number of uncovered elements. It can be shown that this algorithm achieves an approximation ratio of the optimum equal to H(p) = p k=1 1/k, where p is the size of the largest set of the cover. To illustrate the effectiveness of this approach for sensor selection, and to compare it with the on-the-fly method discussed previously, 100 sensors were randomly deployed over a 1.6-by-1.6 square area. The variation of the number of selected sensor nodes as a function of the coherence threshold was examined. To provide numerical results independent of the kernel form, and thus to simplify the presentation, criterion (11) was replaced by the following. (In the case where κ is a strictly decreasing function, criterion (11) can be rewritten as The results are reported in Figure 3, and illustrative examples are shown in Figure 4. These examples indicate that the greedy algorithm, which is based on centralized computing, performs only slightly better than the on-the-fly method. Moreover, it can be observed that m tends rapidly to moderate values in both cases. Our experience has shown that the application of either algorithm leads to a model order m which is at least one order of magnitude smaller than the number of sensors n. This property will be illustrated in Section V, where the results obtained using the elementary decentralized on-the-fly algorithm indicate that there is room for further improvement of the proposed approach.

Resource Requirements
Algorithms designed to be deployed in sensor networks must be evaluated regarding their requirements for energy consumption, computational complexity, memory allocation, and communications. Most of these requirements are interrelated, and thus cannot be analyzed independently. In this section, we provide a brief discussion of several aspects regarding such requirements for the proposed algorithm, assuming for simplicity that each sensor requires similar resources to receive a message from the previous sensor, update its content, and send it to the next one.
Energy-Accuracy Trade-Off. the proposed solution allows for a trade-off between energy consumption and model accuracy. This trade-off can be adjusted according to the application requirements. Consider the case of a large neighborhood threshold ν. Then, applying rule (11), each sensor will have many neighbors and the resulting model order will be low. This will result in low computational cost and power consumption for communication between sensors, at the price of a coarse approximation. On the other hand, a small value for ν will result in a large model order.
This will lead to a small approximation error, at the price of high computational load for updating the model at each sensor, and high power requirements for communication. This is the well-known energy-accuracy dilemma.
Localization. as each node needs to know its location, a preprocessing stage for sensor autolocalization is often required.
The available techniques for this purpose can be grouped into centralized and decentralized ones. See, for example, [10,[35][36][37] and references therein. The former requires the transmission of ranging information, such as distance or received signal strength measurements, from sensors to a fusion center. The latter makes each sensor locationaware using information gathered from its neighbors. The decentralized approach is more energy-efficient, in particular for large-scale sensor networks, and should be preferred over the centralized one. Note that the model (9) locally requires the knowledge of only m out of the n sensor locations, with m n.
Computational Complexity. the m-dimensional parameter updating presented in this paper uses, at each sensor node, an LMS-based adaptive algorithm. LMS-based algorithms are very popular in industrial applications, mainly because of their low complexity-O(m) operations per updating cycle and sensor-and their numerical robustness [26].
Memory Storage. each sensor node must store its coordinates, the coherence threshold ν, the step-size parameter ρ, and the parameters of the kernel. Unlike conventional techniques, the proposed algorithm does not require storing information about local neighborhoods. Each sensor node needs to know is if it is a neighbor of the model-contributing sensors. This is determined by evaluating rule (11) using the locations x ωk transmitted by the last active node.
Energy Consumption. communications account for most of the energy consumption in wireless sensor networks. The energy spent in communication is often dramatically greater than the energy consumption incurred by in-sensor computations, although the latter is difficult to estimate accurately. Consider, for instance, the energy dissipation model introduced in [7]. According to this model, the energy 8 EURASIP Journal on Wireless Communications and Networking required to transmit one bit between two sensors meters apart is given by E amp 2 + E elec , where E elec denotes the electronic energy, and E amp is the amplifier energy. E elec depends on the signal processing required, and E amp depends on the acceptable bit-error rate. The energy cost incurred by the reception of one bit can be modeled as well by E elec . Therefore, the energy dissipation is quadratic in the routing distance, and linear in the number of bits sent. The proposed algorithm transmits information between neighboring sensors and requires the transmission of a small amount of information only.
Evaluation. in a postprocessing stage of the distributed learning algorithm, the model is used to estimate the investigated spatial distribution at given locations. From (9), this requires m evaluations of the kernel function, m multiplications with the weighting coefficients, and m additions. This reinforces the importance of a reduced model order m, as provided by the proposed algorithm.

Simulations
The emerging world of wireless sensor networks suffers from lack of real system deployments and available data experiments. Researchers often evaluate their algorithms and protocols with model-driven data [38]. Here, we consider a classical application of estimating a temperature field simulated using a partial differential equation solver. Before proceeding, let us describe the experimental setup. Heat propagation in an isotropic and homogeneous medium can be modeled by the partial differential equation where T(x, t) is the temperature as a function of location and time, μ and C the density and the heat capacity of the medium, k the coefficient of heat conduction, Q(x, t) the heat sources, h the convective heat transfer coefficient, and T ext the external temperature. In the above equation, ∇ 2 x denotes the Laplace spatial operator. Two sets of experiments were conducted. In the first experimental setup, we considered the problem of estimating a static spatial temperature distribution, and studied the influence of different tuning parameters on the convergence of the algorithm. In the second experimental setup, we studied the problem of monitoring the evolution of the temperature over time.

Estimation of a Static Field.
As a first benchmark problem, we reduce the propagation equation (26) to the following partial derivative equation of parabolic type The region of interest is a 1.6-by-1.6 square area with open boundaries and two circular heat sources dissipating 20 W. In order to estimate the spatial distribution of temperature at any location, 100 sensors were randomly deployed according  to a uniform distribution. The desired outputs T(x), generated by using the Matlab PDE solver, were corrupted by a measurement noise sampled from a zero-mean Gaussian distribution with standard deviation σ equal to 0.01 at first, and next equal to 0.1. This led to signal-to-noise ratios, defined as the ratio of the powers of T(x) and the additive noise, of 9.7 dB and 25 dB, respectively. These data were used to estimate a nonlinear model of the form T n = ψ(x n ) based on the Gaussian kernel. Preliminary experiments were conducted as explained below to determine all the adjustable parameters, that is, the kernel bandwidth β 0 and the stepsize ρ. To facilitate comparison between different settings, we fixed the threshold ν x introduced in (25) rather than the coherence ν presented in (11). The algorithm was then evaluated on several independent test signals. This led to the learning curves depicted in Figure 5, and to the performance reported in Table 2. An estimation of the temperature field is provided in Figure 6.  The preliminary experiments were conducted on sequences of 5000 noisy samples, which were obtained by visiting the 100 sensor nodes along a random path. These data were used to determine β 0 and ρ, for given ν x . Performance was measured in steady-state using the meansquare prediction error 1/1000 5000 n=4001 (T n − ψ n−1 (x n )) 2 over the last 1000 samples of each sequence and averaged over 40 independent trials. The threshold ν x was varied from 0.20 to 0.40 in increments of 0.05. Given ν x , the best performing step-size parameter ρ and kernel bandwidth β 0 were determined by grid search over the intervals (0.05 ≤ ρ ≤ 0.7) × (0.14 ≤ β 0 ≤ 0.26) with increments 0.05 and 0.02, respectively. A satisfactory compromise between convergence speed and accuracy was reached with β 0 = 0.24 and ρ = 0.3. The algorithm was tested over two hundred 5000-sample independent sequences, with the parameter settings obtained as described above and specified in Table 2. This led to the ensemble-average learning curves shown in Figure 5. Steady-state performance was measured by the normalized mean-square prediction error over the last 1000 samples, defined as follows: where the expectation was approximated by averaging over the ensemble. Table 2 also reports the sample mean values m for the model order m over the two hundred test sequences. It indicates that the prediction error decreased as m increased and ν x decreased. Note that satisfactory levels of performance were reached with small model orders.
For comparison purposes, state-of-the-art kernel-based methods for online prediction of time series were also considered: NORMA [39], Sparse Sequential Projection (SSP) [40], and KRLS [22]. As the KNLMS algorithm, NORMA performs stochastic gradient descent on RKHS. The order of the kernel expansion is fixed a priori since it uses the m most recent kernel functions as a dictionary. NORMA requires O(m) operations per iteration. SSP method also starts with stochastic gradient descent to calculate the a posteriori estimate. The resulting (m + 1)-order kernel expansion is then projected onto the subspace spanned by the m kernel functions of the dictionary, and the projection error is compared to a threshold in order to evaluate whether the contribution of the (m + 1)th candidate kernel function is significant enough. If not, the projection is used as the a posteriori estimate. In the spirit of the sparsification rule (10), this test requires O(m 2 ) operations per iteration when implemented recursively. KRLS is a RLS-type algorithm with, in [22], an order-update process controlled by the condition (10). Its computational complexity is also O(m 2 ) operations per iteration. Table 3 reports a comparison of the estimated computational costs per iteration for each algorithm, in the most usual situation where no order increase is performed. These results are expressed for real-valued data in terms of the number of real multiplications and real additions.
The temperature distribution T(x) considered previously, corrupted by a zero-mean white Gaussian noise with standard deviation σ equal to 0.1, was used to estimate a nonlinear model of the form T n = ψ(x n ) based on the Gaussian kernel. The same initialization process used for KNLMS was followed to initialize and test NORMA, SSP and KRLS. This means that preliminary experiments were conducted on 40 independent 5000-sample sequences to perform explicit grid search over parameter spaces and, following the notations used in [22,39,40], to select the best settings reported in Table 3. For an unambiguous comparison of these algorithms, note that their sparsification rules were individually hand-tuned, via appropriate threshold selection, to provide models with approximately the same order m. In addition, the Gaussian kernel bandwidth β 0 was set to 0.24 for all the algorithms. Each approach was tested over two-hundred 5000-sample sequences, which led to the normalized mean-square prediction errors displayed in Table 3. As shown in Figure 7, the algorithms with quadratic complexity performed better than the other two, with only a small advantage of SSP over KNLMS. Obviously, this must be balanced with the large increase in computational cost. This experiment also highlights that KNLMS significantly outperformed the other algorithm with linear complexity, namely, NORMA, which clearly demonstrates the effectiveness of our approach.   (26) in a partially bounded conducting medium. As can be seen in Figure 8, the region of interest is a 2-by-3 rectangular area with two heat sources that dissipate 2000 W when turned on. This area is surrounded by a boundary layer with low conductance coefficient, except on the right side where an opening exists. The parameters used in the experimental setup considered below include rectangular area: μC r = 1 k r = 10 h r = 0, boundary layer:

Tracking of a Dynamic
The heat sources were simultaneously turned on or off over periods of 10 time steps. In order to estimate the spatial distribution of temperature at any location, and track its evolution over time, 9 sensors were deployed in a grid. The desired outputs T(x, t) were generated using the Matlab PDE solver. They were corrupted by an additive zero-mean white Gaussian noise with standard deviation σ equal to 0.08, corresponding to a signal-to-noise ratio of 10 dB. These data were used to estimate a nonlinear model, based on the Gaussian kernel, that predicts temperature as a function of location and time. Preliminary experiments were conducted to determine the adjustable parameters of our algorithm, using 100 independent sequences of 360 noisy samples. Each sequence was obtained by collecting, simultaneously, the 9 sensor readings over 2 on-off source periods. Performance was measured with mean-square prediction error, which was averaged over the 100 sequences. Due to the small number of available sensors, no condition on coherence was imposed via ν or ν x . This led to models of order m = n = 9. The best performing kernel bandwidth β 0 and step-size parameter ρ were determined by grid search over the interval (0.3 ≤ β 0 ≤ 0.7) × (0.5 ≤ ρ ≤ 2.0) with increment 0.05 for both β 0 and ρ. A satisfactory compromise between convergence speed and accuracy was reached by setting β 0 to 0.5, and ρ to 1.55.
The algorithm was tested over two hundred 360-sample sequences prepared as above. This led to the predicted temperature curves depicted in Figure 9, which demonstrate the ability of our technique to track local changes. Figure 8 provides two snapshots of the spatial distribution of temperature, at time instants 10 and 20, when the heat sources are turned off and on. We can observe the flow of heat from inside the container to the outside, through the opening in the right side.

Conclusion
Over the last ten years or so, there has been an explosion of activity in the field of learning algorithms utilizing reproducing kernels, most notably in the field of classification and regression. The use of kernels is an attractive computational shortcut to create nonlinear versions of conventional linear algorithms. In this paper, we have demonstrated the versatility and utility of this family of methods to develop a nonlinear adaptive algorithm for time series prediction EURASIP Journal on Wireless Communications and Networking (i) Figure 9: Evolution of the predicted (solid blue) and measured (dashed red) temperatures by each sensor. Both heat sources were turned on over the intervals [0, 10] and [20,30], and turned off over [10,20] and [30,40]. Sensor locations can be found in Figure 8.
in sensor networks. A common characteristic in kernelbased methods is that they deal with models whose order equals the size of the training set, making them unsuitable for online applications. Therefore, it was essential to first propose a mechanism for controlling the increase in the model order as new input data become available. This led us to consider the coherence criterion. We incorporated it into a kernel-based normalized LMS algorithm with order-update mechanism, which is a notable contribution to our study. Our approach has demonstrated good performance during experiments.
Perspectives include the derivation of new sparsification criteria also based on sensor readings rather than being limited to sensor locations. This would certainly result in better performance on the prediction of dynamic fields. Online optimization of this criterion by adding or removing kernel functions from the dictionary also seems interesting. Finally, in a broader perspective, controlling the movement of sensor nodes, when allowed by the application, to achieve improved estimation accuracy appears as a very promising subject for research.