Model-Free Distributed Reinforcement Learning State Estimation of a Dynamical System Using Integral Value Functions

One of the challenging problems in sensor network systems is to estimate and track the state of a target point mass with unknown dynamics. Recent improvements in deep learning (DL) show a renewed interest in applying DL techniques to state estimation problems. However, the process noise is absent which seems to indicate that the point-mass target must be non-maneuvering, as process noise is typically as significant as the measurement noise for tracking maneuvering targets. In this paper, we propose a continuous-time (CT) model-free or model-building distributed reinforcement learning estimator (DRLE) using an integral value function in sensor networks. The DRLE algorithm is capable of learning an optimal policy from a neural value function that aims to provide the estimation of a target point mass. The proposed estimator consists of two high pass consensus filters in terms of weighted measurements and inverse-covariance matrices and a critic reinforcement learning mechanism for each node in the network. The efficiency of the proposed DRLE is shown by a simulation experiment of a network of underactuated vertical takeoff and landing aircraft with strong input coupling. The experiment highlights two advantages of DRLE: i) it does not require the dynamic model to be known, and ii) it is an order of magnitude faster than the state-dependent Riccati equation (SDRE) baseline.


I. INTRODUCTION
In nonlinear control theory, distributed estimation and tracking target dynamics is a fundamental task and challenging due to the unobservable estimation errors for hidden states in such loosely coupled sensor networks. If the sensing model and the target dynamics are known and linear in the states, the distributed Kalman Filtering (DKF) algorithm [1] can be realized. For instance, for a linear time-invariant (LTI) sensing model and the target object observed in additive Gaussian noise, the DKF comprises two dynamic consensus problems when the target object is estimated. Then, solving two dynamic consensus filters (i.e. a low-pass filter and a band-pass filter). The published approaches on this subject can roughly be separated into two complementary categories: (A) derivative-building filters focusing on the use of Jacobians [1], which are the truncated first-order Taylor series of the nonlinear functions, and (B) derivative-free filters focusing on avoiding the use of Jacobians [2], [3]. While both concepts are generally justified for heterogeneous multi-sensor fusion. The task becomes more intriguing for target dynamics that cannot be easily modeled or that are unknown. In such a case, a learning mechanism turns out to be an attractive solution to control and estimate dynamic systems interacting with a physical environment to do more demanding tasks like autonomous flying [4], solving decision-making problems [5], [6], etc.
implementations, both process and measurement noises appear in the dynamical system models, and it is crucial to consider the effect of measurement noise.
Adaptivity and optimality are central elements combined to control unknown dynamical system models. Reinforcement learning (RL) refers to a class of such methodologies [9]. In a standard RL setting, the dynamical system model is unknown, and the optimal control mechanism is learned through the interaction with the system. The key idea is to learn and solve the value function to satisfy the Hamilton-Jacobi-Bellman (HJB) equation [10]. The celebrated policy iteration (PI) algorithm belongs to this class. The PI algorithm starts by considering the admissible average cost setting for a given control policy and uses this data to obtain an improved control policy. These two steps of policy evaluation and policy improvement are repeated until convergence to the optimal control is achieved [11]. Another closely related approach to learn continuous-time (CT) nonlinear systems is the neural network (NN) structure [12] which can be trained to yield an approximate solution of the value function at the policy evaluation step. This category of solutions is called modelbuilding or model-free RL, and the term model-free is used to emphasize that no knowledge of the dynamical system is assumed.
In all the aforementioned results, assumptions were made only for availability of the state variable [13]. However, it is known they may perform weakly when the state variable is not fully observable and becomes even more challenging in distributed estimations for sensor networks setups [14]. This is particularly important in target tracking of a network, where only a noisy measurement of the output of each node is available for the tracking purpose. When a modelfree methodology is engaged, it is not possible to use a filter, e.g. DKF, to estimate the state variable because the dynamical system model is unknown. The existing distributed estimation algorithms for target tracking using sensor networks are limited and highlighted by some publications [15], [16].
In this paper, we derive a CT model-free distributed reinforcement learning estimator (DRLE) using integral value function, based on PI, in sensor networks. A DRLE is a network framework of interacting intelligent microfilters. Each microfilter is implemented as an embedded component in a sensor architecture. The DRLE algorithm is capable of learning an optimal policy from a neural value function that aims to provide the estimate of a target point mass. We take one step further and replace the state of the target point mass with a noisy measurement of the state by each node of the network. It should be noted that the DRLE algorithm is in the model-free family of classical RL. The only information used for the simulation is the outputs of a highly coupled nonlinear target point mass of known order; no known dynamical system model is assumed to generate the data. There are three contributions to this paper. Our first contribution is to resolve the limitation of the existing derivative-building and derivative-free filters that needs the model dynamics by introducing a novel model-free microfilter with two identical high-pass consensus filters. Our second contribution is to increase the performance by an order of the magnitude. Lastly, our third contribution is to propose a base performance standard for distributed estimation and comparison with the state-dependent Riccati equation (SDRE) algorithm [2], [17] on target point mass tracking.
The rest of the paper is organized as follows. In Section II, we describe briefly the motivation and problem statement. Our main results including the DRLE algorithm with two high-pass consensus filters are presented in Section III. In Section IV, we give the simulation result and compare our proposed DRLE algorithm with the SDRE approach. Finally, the conclusions are drawn.
Notation: Unless stated otherwise, all vectors in this paper are column vectors. The state variables are a function of time, e.g., x = x(t ). The trace of an n × n square matrix A, denoted tr(A), is defined to be the sum of elements on the main diagonal of A.

II. MOTIVATION
We present the motivations behind the need for a novel design of a model-free distributed reinforcement learning estimator (DRLE) using an integral value function for the online generation of an optimal policy. Finally, this policy leads to the optimal observer gain without knowing the dynamical system model of the target point mass.

A. PROBLEM STATEMENT
Consider a network topology G = (V, E ) with n nodes interconnected via an undirected graph. The nodes are denoted by the set V = {ν 1 , ν 2 , . . . , ν n }, and the set of The objective is to perform distributed state estimation of a nonlinear target point mass. To be more precise, leṫ be the dynamics of the target (e.g. a moving point mass) and the observation model of node i in the considered network.
Here, x(t ) ∈ R m denotes the state of the target, y i (t ) ∈ R nr represents the r-dimensional measurements vector obtained by n sensors, u(t ) ∈ Rm is the control input, and w(t ) and v i (t ) are white noises with covariance matrices R ww and R vv i , respectively. We also assume the h i 's are different across the entire network. The statistics of the target dynamics and the measurements noise are given by where δ(.) is the Dirac delta function. It is possible to transform (1) and (2) into a state-dependent coefficient (SDC) form aṡ Due to the fact that we are trying to estimate the target point mass associated with the information data of each node y i (t ) of the network, we should build the dynamics of the estimator, with the expressioṅ is the closed-loop dynamics of the estimator. The next step is to minimize the error covariance matrix for each node of the network Taking the derivative of the error covariance matrix Q i (t ) of each node, we geṫ Therefore, the optimal estimator gain L * i (t ) is the solution of the following equation: which gives the optimal gain: It remains to determine the update rule for the error covariance matrix Q i (t ) of each node. Substituting (12) in (10), we have the Riccati equatioṅ The SDRE information filter [2] can be exploited to solve the Riccati equation (13). However, the SDRE approach suffers from two key weaknesses. First, the Riccati equation in (10) needs to be solved in real-time, and the solution of the Riccati equation (if it exists) can be only represented numerically. Second, the dynamical system model of the target point mass should be known (e.g. A(x)). The proposed weaknesses motivate us to develop a novel DRLE for system networks with a broader range of aerospace applications. Furthermore, analytic closed-form expressions of the estimator gain L i (t ) of each node approximating the numerical solutions offer several advantages: 1. Oversampling, thus, they provide a quick methodology to evaluate intermediate points. 2. In the context of optimal control, they can be used to express the observer feedback, allowing derivation and integration for state estimation. 3. They are often smooth functions (e.g. polynomials) and smoothness in observer feedback is a desired property. Furthermore, they can be easily implemented in micro-controllers.

III. DISTRIBUTED REINFORCEMENT LEARNING STATE ESTIMATION
In this section, we present a novel DRLE that does not rely on the dynamical system model of the target point mass. Now, substituting (12) in (7), the state propagation equation can be expressed aṡ The terms in (15) are two network aggregate quantities and represent the average measurements and the average inverse-covariance matrix, respectively. They can be written as follows: Both p(t ) and Z (t ) are time-varying quantities, and there is a need to solve two dynamic consensus problems that allow computing the average p(t ) and Z (t ). If one can compute the average p(t ) and Z (t ), the estimator emerges. This means that each node in the network calculates its consensus valuesp(t ) andẐ (t ), respectively. To perform distributed averaging, we use high-pass consensus filter of [18].
is the adjacency matrix, and D denotes the degree matrix. The high-pass filter is in the following forṁ whereū i is the input of node i, s i is the state of the consensus high-pass filter, and p i is its output. The inputs of each with zero initial states s i (0) = 0 of two filters. The output of high-pass filters asymptotically converge to p(t ) and Z (t ) in (16) and (17), respectively. Until now, we have described a formal methodology to estimate the state of a nonlinear dynamics of a target point mass under the observation of nodes in a heterogeneous sensor network.

A. MODEL-FREE DRLE USING AN INTEGRAL VALUE FUNCTION
In this subsection, we propose a model-free DRLE to estimate the state of a target point mass in a network under two separate dynamic consensuses in terms of weighted measurements and inverse-covariance matrices. The objective is to minimize the error covariance matrix for each node Q i (t ) without solving the Riccati equation (13) (i.e.x(t ) = x(t ) ∀t ≥ 0). To proceed, consider the system model (14) and measurement model (15), then we define for each node the following integral value function is a positive definite matrix for each node i. If we divide the integral in (23) into two terms, then Considering t → 0 is sufficiently small, the first integral term is a rectangle with the length of t and the width of r i (x i (t ),ū i (t )) and results in Dividing both side of (25) by t, we get The last (29) is the continuous time (CT) Bellman equation [19]. Solving the CT Bellman equation requires the full knowledge of the dynamical system model and yields poor generalization. To avoid this, we may write Using the Kronecker product property, we can rewrite (30) as linear in the parameter vectorQ i (t ) = vec(Q i (t )) in the following form x i (t ) is the quadratic polynomial vector containing pairwise products of the m components ofx i (t ). Since our target point mass in (1) is a nonlinear dynamical system, the value function of each node V i (x(t )) requires higher-order nonlinearities. It is possible to approximate the proposed value function by a suitable approximator network in terms of unknown parameters which can be trained to become the approximate solution of the Riccati equation in (13) at the evaluation step. The rationale behind the proposed integral value function is formally discussed in the next subsection.

B. CRITIC NEURAL NETWORK
By having an unknown dynamical system model of the target, a relevant question is the system identification of the plant. A critic neural network (NN) can be exploited for such a goal. Interestingly, the value function for each node can be approximated by a critic NN structure as with unknown parameters (weights) of each node and h is referred to as the number of neurons in the hidden layer of the NN topology. Therefore, we can write (32) as After convergence of the value function V i (x(t )) parameters, the control policy is performed. This can be accomplished by modifying the value function V i (x i (t ),ū i (t )) containingū i (t ) as an argument. Therefore ∂V i (x i (t ),ū i (t ))/∂ū i (t ) can be explicitly computed.
From the relation in (33), it is clear that the optimization problem is quadratic and solvable in real time by a recursive least-squares (RLS) technique. Note that the DRLE itself is not enough to perform distributed state estimate for a target point mass. A distributed computation of p(t ) and Z (t ) is also required. The proposed approach provides the online generation of an optimal policy by measuring data by each node along the point mass system trajectories. To facilitate the implementation of the DRLE algorithm, we report the pseudo-code in Table 1.
Based on Fig. 1 and Algorithm 1, each node sends a message with size O(m(m + 1)) to its neighbors. The message consists of the state and input of its consensus filters (e.g. The proposed DRLE offers a constructive learning methodology to design a distributed state estimator in a sensor network that approaches non-linearity even for unknown dynamical system models for which a unique solution for the Riccati matrix parameters in (13) does not exist. Indeed, learning involves numerical procedures that may introduce some challenges. Firstly, a critical aspect is the NN, and it is well-known that its performance depends on the architecture design, weights analysis, and hyper-parameters tuning. Secondly, the DRLE converges to the optimal weights by collecting mostly a low number of data points (F i (t ), F i (t + T ), t+T t r i (x i (τ ),ū i (τ ))dτ ) for each least-squares problem with the fixed time interval T .

IV. SIMULATION RESULTS
The assessment of the proposed DRLE approach without knowing the system matrix A(x) is done by considering the following close-to-real simulation scenario: An agnostic target in a multi-node setup is that each node is a highly coupled nonlinear dynamical system (see Fig. 2) and in particular we show a Comparison with the SDRE algorithm [2], [17].

A. AGNOSTIC TARGET
The target point mass considered in this paper is an underactuated mechanical system with degree one which is also a complex nonlinear system [20]. The proposed novel VTOL deploys a system of three mass particles with masses m 1 , m 2 and m 3 , as shown in Fig. 2 and the dynamics  y 1 , θ] are the generalized coordinates. 1 An external thrust vector f 1 is applied to m 1 in the direction of −x 1 and y 1 respectively, and f 3 to m 3 in the direction of −x 3 and y 3 respectively. For simplicity, we assume that all representative particle masses are the same (e.g., m k = m for k = 1, . . . , 3). The detailed dynamical system model is introduced in [20]. The objective is to track the state of the target point mass. To proceed, the VTOL dynamical model is used for nodes, i = 1, 2, . . . , 4, and also the target point mass. By defining the state vector x = [q 1 ,q 1 , q 2 ,q 2 , q 3 , 1 The inertial measurement unit (IMU) is installed at the center of the shaft (e. g., x, y, θ). Therefore, the following shift is desirable x = x 1 + L cos(θ ), y = y 1 + L sin(θ ).
The network has n = 4 nodes with a topology shown in Fig. 2

and the Laplacian matrix denoted by
The nodes (i.e. VTOLs) of the considered network make noisy measurement of the target VTOL along the x-axis, y-axis and rotation θ -axis, i.e.
Moreover, R vv 1 = 3I 6 , R vv 2 = 1.5I 6 , R vv 3 = 1.5I 6 , R vv 4 = 1.6I 6 . G i and R i in the value function are G i = diag(120, 120, 180, 120, 200, 5), Now, for maneuvering of the target point mass, we consider the following elaborated trajectory We approximate the value function in (33) as quadratic in states and inputs. Therefore, it is sufficient to choose the critic NN basis i (x) as the quadratic vector (h = 35) in the state and input components that is given in (36) shown at the bottom of this page. The DRLE time interval is set as T = 1 sec, with the total time of the simulation that is set to T f = 100 sec. The nonlinear control of the target point mass u(t ) is also based on the proposed critic NN with the same activation function. Therefore, the closed-form control policy for the dynamics of target point mass (5) based on the considered activation functions is  where • is the Hadamard product and The estimates appear a cohesive set of VTOLs that estimate the trajectory of the target point mass. One can see that after 10 sec convergence has occurred. After that, the node's estimations remain very close to the target's trajectories, as required. A good approximation of the value function is being evolved. The results shows that DRLE can learn online the Riccati equation (13) without solving the differential equation and by using data measured along the target trajectories.
In Fig. 4, we consider the same elaborated trajectory as before, but now we show the performance of the SDRE. The estimates of all nodes by SDRE methodology are somewhat  dispersed. Also, the results demonstrate the importance of DRLE which can be used to judge the usefulness of distributed estimation.

B. EXPERIMENTAL RESULTS
We now compare DRLE to the SDRE method. The aim is to assess the ability of DRLE to provide estimates. We randomize different components of the experiment. The randomization elements in this setup are the initial condition of the target point mass x 0 ∼ N (0, σ 2 j I 6 ) with σ j = 2 for j = 1, . . . , 4 and σ j = 0.5 for j = 5, 6, the initial estimatesx i 0 ∼ N (0, σ 2 j I 6 ) with σ j = 3 for j = 1, . . . , 4 and σ j = 0.5 for j = 5, 6, target trajectories, VTOL masses (e.g., m k = m for k = 1, . . . , 3), and the lever arm L. A perturbation during the experiment is applied to VTOL masses m k + 0 , and the lever arm L + 1 with 0 = N (0, 0.2 2 ) and 1 = N (0, 0.01 2 ). Table 1 shows, for several simulation tests, a comparison in terms of standard deviation between the actual state of the target point mass and the estimated filter (DRLE and SDRE) outputs in the network of four VTOLs. As can be seen from Table 1, both SDRE and DRLE provide good and similar estimations of the sensor network of VTOLs. However, DRLE results showed the gain provided by model-building neural based reinforcement learning estimator over the classical SDRE without knowing the system dynamics in Table 1.
It is also interesting to see the evolution of estimation error of the nodes independent of the network topology over time.
To do so, we define the following measure e = n i=1 e 2 i with e i =x i − ( 1 n ix i ). Fig. 5 demonstrates the comparison of DRLE vs. SDRE. As expected, the DRLE performs significantly better than SDRE in the asymptotic regime.
As mentioned before, the SDRE method is an online strategy for solving optimal control problems. However, it is computationally heavier because of the need for solving Riccati differential equations at each iteration. Furthermore, the dynamical system model of target A(x) and B(x) should be available. Another interesting fact is that the DRLE methodology is less sensitive to perturbation on the physical parameters of VTOLs in the network comparing the SDRE approach.

C. DISCUSSION
The proposed DRLE algorithm offers a constructive learning methodology to design a distributed estimation in sensor networks. It approaches collaboratively tracking, even for unknown target dynamics or for which a physical expression for the model does not exist. In summary, one may consider different factors when choosing the DRLE or SDRE approach. The DRLE and SDRE approaches are pretty much equal in terms of qualitative behavior. However, there are two advantages of DRLE: i) the DRLE method does not require the dynamic model to be known, and ii) DRLE is an order of magnitude faster.

V. CONCLUSION
In this paper, we have proposed a novel use of reinforcement learning, the distributed reinforcement learning estimator (DRLE) that solves the continuous-time model-free estimation problem in a senor network for a dynamic target. The DRLE allows the learning of an optimal policy from a neural value function that aims to provide the estimate of a target point mass. The nodes of the network agree on two central consensus sums which are high-pass filters in terms of weighted measurements and inverse-covariance matrices and a critic reinforcement learning mechanism for each node in the network. At the cost of distributed tracking, we gain the additional benefit of the model-free RL setting in identifying a complex moving target by each node. The details of the algorithm from a computational to communication architecture perspective were discussed. DRLE was applied to the novel network of underactuated VTOL aircraft with strongly coupled dynamics and simultaneous learning and collaborative estimation of the state of the target point mass, was accomplished. Our numerical evaluation shows that the DRLE algorithm produces faster policies over the classical SDRE approach. In future work, it would be interesting to extend the DRLE methodology into the direction of explainable machine learning, in which the learning process is motivated by an information-theoretic methodology.