Sparse Online Learning With Kernels Using Random Features for Estimating Nonlinear Dynamic Graphs

Online topology estimation of graph-connected time series is challenging in practice, particularly because the dependencies between the time series in many real-world scenarios are nonlinear. To address this challenge, we introduce a novel kernel-based algorithm for online graph topology estimation. Our proposed algorithm also performs a Fourier-based random feature approximation to tackle the curse of dimensionality associated with kernel representations. Exploiting the fact that real-world networks often exhibit sparse topologies, we propose a group-Lasso based optimization framework, which is solved using an iterative composite objective mirror descent method, yielding an online algorithm with fixed computational complexity per iteration. We provide theoretical guarantees for our algorithm and prove that it can achieve sublinear dynamic regret under certain reasonable assumptions. In experiments conducted on both real and synthetic data, our method outperforms existing state-of-the-art competitors.

. Schematic of processing stages in an oil and gas platform an oil and gas processing platform with multiple wells and separators, which uses hundreds of sensors and actuators to extract raw oil and separate it into oil, water and gas (Fig. 1). If an event occurs in a well, its impact will be reflected in the separators after a delay. Similarly, the oil level in separator-2 depends on the pressure that is controlled by an actuator in separator-3. The data acquired from such a system form a multivariate time series, possibly having many directed time-lagged interactions, which can be represented using a graph structure. Understanding these dependencies is crucial as it aids in predicting the near-future evolution of sensor variables and identifying appropriate control actions. While we use an oil and gas platform scenario for illustration purposes, similar interactions exist in many important networks, such as the brain, the stock market, and smart water networks (SWN). Henceforth, we use the term topology identification to refer to the estimation of such dependencies.
One significant challenge associated with the aforementioned real-world graph-connected networks is the time-varying nature of the dependencies. However, extensive research in the field of online learning [4], [5], has led to the development of methods that outperform classical batch solutions in terms of both computational complexity and ability to track changes. These methods can be applied to topology identification to mitigate the problem of time-varying dependencies. For instance, [6] proposes a sparse online solution for topology identification using proximal updates, while [7] introduces a prediction-correction algorithm based on a time-varying convex optimization framework that exhibits an intrinsic temporal-regularization of the graph topology.
Real-world systems, such as the one shown in Fig. 1, are not only dynamic but also further complex due to the nonlinear nature of their dependencies. In CPSs such as oil and gas platforms or SWNs, nonlinearity may arise from various sources, including control mechanisms of the actuator, nonlinear liquid flows (see, e.g., [8]), and saturation of tanks. Similarly, the interactions in stock market networks and network-structured data related to brain imaging techniques, such as electroencephalography (EEG), electrocorticography (ECoG), positron emission tomography (PET), also exhibit a high level of nonlinearity, with multiple nonlinear effects contributing to the complexity of these systems [9]. Topology estimation based on simple linear models [6], [7] is inadequate for such applications, since many of the inherent nonlinear interactions within the system are discarded.
An effective way to deal with the nonlinearity is to invoke kernel machines, which can approximate any nonlinear continuous function, if enough training samples are available. For instance, in [10], a novel topology identification algorithm based on the nonlinear structural vector auto-regressive (SVAR) model using kernels is proposed. Deep neural networks (DNNs) are also powerful alternatives to kernels for modelling nonlinear interactions. Nonlinear dependencies are estimated in [11] using a temporal convolutional neural network and an attention mechanism, while [12] uses a vector autoregressive (VAR) model with an invertible neural network approach to capture dependencies, and [13] applies a group-Lasso regularizer on neural weights to obtain sparse nonlinear dependencies. Although the abovementioned kernel-and DNN-based methods are powerful tools to model nonlinear dependencies, their batch-based (offline) nature makes them unsuitable for real-time applications that require online topology estimation with every new data sample to track changes in the system. Moreover, such batch-based approaches suffer from a high computational complexity since the algorithm must process the entire data batch together.
The preceding discussion highlights the need for algorithms that can learn nonlinear and dynamic topologies. Kernels are an ideal choice because they offer the advantage of building interpretable models that can be learned online [14], [15], [16]. In kernel frameworks, the data points are transformed to a function space, where a linear relationship exists between them. However, working in a function space has some limitations in the context of online topology identification. First, the standard online convex optimization techniques cannot be readily used as the dimension of optimization variables is not fixed, and it increases with every new data sample. Second, the number of parameters required to express the function increases with the number of data samples, and the computational complexity becomes prohibitive at some point, which is typically known as the curse of dimensionality [17]. In [16], the dimensionality growth is mitigated by discarding the past data samples using a forgetting window. However, this approach can lead to suboptimal function learning because it discards data samples without assessing their significance in representing the functions to be learned.
Sparse kernel dictionaries and random feature (RF) approximation are two popular techniques for tackling the curse of dimensionality associated with kernels. However, existing algorithms have limitations for online topology identification of multivariate time series. For instance, the sparse functional stochastic gradient descent (FSGD) method in [18] requires multiple kernel orthogonal matching pursuit (KOMP) sub-iterations, resulting in high computational complexity. Additionally, in a multivariate setting with N time series, the FSGD derivation in [18] results in identical functional dependencies between a time series n and all other time series n = 1, 2, . . . , N (as observed in [19]), which prevents distinguishing the different functional dependencies. An alternative approach, presented in [20], involves learning a sparse kernel dictionary based on coherence criteria. However, its convergence guarantees assume static optimal parameters (representing the topology), which is impractical for time-varying systems.
The RF approximation approach not only addresses kernel dimensionality growth but also provides greater mathematical flexibility for modelling and learning the nonlinear interaction among multivariate time series while enabling theoretical analysis. RF approximation was originally proposed in [21], and the idea has recently gained popularity in large-scale machine learning problems [22], [23], [24]. In addition to providing a computational boost in large-scale data sets, RF allows working in fixed lower dimensional spaces, making it convenient for online convex optimization routines. RF approximation in kernels can also be used to understand neural networks [25], [26], and researchers have shown equivalence in function approximation between neural networks and RF approximations [25]. Multiple Random Fourier features can be also used to initialize the learning process, and the best one can be kept to avoid overfitting [27], [28].
In this work, we propose a kernel-based online nonlinear topology identification algorithm using RF approximation. We assume that the dependencies of the system can be modelled using nonlinear additive sparse model. The sparsity assumption is motivated by real-world systems that are often sparse due to the dominant local interactions, and it helps to avoid overfitting during learning. The algorithm estimates nonlinear sparse topologies in an online manner at each time instant, using a proximal optimization technique called composite objective mirror descent (COMID) and features incremental updates to the model upon the arrival of new data samples, making it suitable for applications characterized by topology drifts [29], [30]. Through theoretical guarantees based on dynamic regret analysis and numerical evidence, we show the effectiveness of our algorithm in tracking the changes in topology.
The main contributions of this work are listed below: i) We propose an online nonlinear topology estimation algorithm with fixed computational complexity per iteration called Random feature-based nonlinear topology identification via recursive sparse online learning (RFNL-TIRSO). This algorithm differs significantly from our previous work in [31], by using a running average loss inspired by the recursive least square (RLS) formulation instead of an instantaneous loss function, which is susceptible to noise and converges slowly. Compared with [31], RFNL-TIRSO achieves significantly faster convergence speed and improved robustness to the input noise. ii) We provide theoretical guarantees for the convergence of RFNL-TIRSO, which was lacking in [31]. The article derives an upper bound for dynamic regret of RFNL-TIRSO, based on the strong convexity property of the RLS loss function. Dynamic regret characterizes the tracking capability of an online algorithm [32], and we achieve a sublinear dynamic regret under certain reasonable assumptions applicable to real-world applications.
Our dynamic regret analysis encompasses three key elements: an online kernel-based nonlinear algorithm, a non-differentiable objective function, and a model with multiple decoupled functions for interpretable topology identification. Existing related works [33], [34], [35], [36], [37], [38], [39] do not offer complete coverage of these three elements. iii) The performance of the proposed algorithm is tested with extensive experiments using both real and synthetic data. The algorithm estimates interpretable topologies using time series data from sensors in an oil and gas plant. Furthermore, the algorithm demonstrates its effectiveness in detecting epileptic seizure events using EEG signals. The remainder of the article is organized as follows: Section II presents the system model, kernel formulation, and RF approximation. In Section III, we develop the RFNL-TIRSO algorithm. Theoretical analysis of RFNL-TIRSO is performed in Section IV, followed by the numerical results in Section V. Finally, Section VI concludes the article.
Notations: Bold lowercase and uppercase letters denote column vectors and matrices, respectively. The operators ∇, (.) , E, Λ max (.), Λ min (.), <., .> respectively denote gradient, transpose, expectation, maximum eigenvalue, minimum eigenvalue, and inner product operators. The symbols 1 N and I N represent all-one vector of dimension N and identity matrix of dimension N × N , respectively.

A. System Model
Consider a collection of N sensors (nodes) generating a multi-variate time series denoted by y[t] ∈ R N , where t = 0, 1, . . . , T − 1 denotes the time index. We assume that the dynamics of the sensor network can be captured by a P -th order VAR model with additive nonlinear functional dependencies: where y n [t] is the value of time series at time t observed at n,n is a nonlinear function that captures the influence of the p-lagged data point of node n on node n, and u n [t] is the process noise, which is assumed to be zero mean i.i.d. random process. With respect to model (1), we define topology identification as the estimation of the functional dependencies {f (p) n,n (.)} P p=1 , ∀n, n , from the observed time series

B. Kernel Representation
Assume that the functions f (p) n,n in (1) belong to a reproducing kernel Hilbert space (RKHS): where κ (p) n : R × R → R is a positive definite kernel, which characterizes the RKHS. The kernel is a function measuring the similarity between the data points y and y n [t − p]. The expression (2) follows from the fact that any function in RKHS can be expressed as an infinite combination of kernel evaluations [40], i.e., the function f x 2 ) using kernels with reproducible property κ . Such a Hilbert space with the reproducing kernels is termed as RKHS, and the inner product described above induces the RKHS norm, f . For further reading on RKHS, we recommend referring to [41].
The required functions {f n } n ,p at a particular node n can be obtained by solving the following non-parametric optimization problem in batch form, considering all the samples at once: For a non-decreasing function Ω, the solution of (3), denoted as {f (p) n,n } n ,p can be obtained in terms of finite kernel evaluation by invoking the Representer Theorem [42]: Although the solution (4) entails only a finite number (equal to T ) of kernel evaluations, its computational complexity becomes prohibitively high for a large value of T . This is a major drawback of kernel formulations, which is commonly referred to as the curse of dimensionality. In alignment with [14], [24], we use RF approximation to solve the curse of dimensionality.

C. RF Approximation
As observed in Section II-B, the RKHS is characterized by an inner product. Resorting to the theory of RF approximation, the inner product can be expressed in a random Fourier space, which facilitates the approximation of an RKHS function to a function in a fixed low dimensional space, thereby preventing the dimensionality growth. By working in this fixed low-dimensional space, we can leverage standard convex optimization tools for solving the topology identification problem.
The RF approximation requires that the kernel defining the RKHS should be shift invariant, i.e., κ Many popular kernels are shift-invariant, such as the Laplacian, the Cauchy, and the Gaussian kernels. By Bochner's Theorem [43], every shift-invariant kernel can be expressed as an inverse Fourier transform of a probability density function. Following this theorem, the kernel evaluation can be expressed as Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
where E is the expectation operation, π κ (p) n (v) is the probability density function corresponding to the kernel under consideration, and v is the random variable associated with the probability density function. By using an adequate number of i.i.d. samples (v), we can approximate the expectation in (5) as a sample mean (weak law of large numbers): irrespective of the distribution π κ (p) n (v). Notice that (6) is an unbiased estimator of the kernel evaluation in (5) [44]. In general, finding the probability distribution, which is the inverse Fourier transform of a kernel, is a challenging task. However, for a Gaussian kernel with variance σ 2 , the Fourier transform is also a Gaussian with variance σ −2 . Hence, in this work, we restrict our choice of the kernel to Gaussian kernels. Further, the real part of (6) is also an unbiased estimator of the kernel evaluation [22], and (5) can be expressed in vector form using only the real components aŝ where Substitute (7) in (4) to obtain an approximation of the function f (p) n,n in a fixed dimension (2D): where α For the sake of simplicity, we define the following notations: The functional optimization (3) can be reformulated as a parametric optimization problem using (9). First, we define the parametric form of the loss function in (3): which can be expanded in terms of RF components as For convenience, the variables {α v,n ,d (τ )} are stacked in the lexicographic order of the indices p, n , and d to obtain the vectors α n ∈ R 2P ND and z v (τ ) ∈ R 2P ND , respectively, and loss function can be compactly rewritten as: Following [24], the original regularization term in (3) can be converted to an equivalent parametric form as: The function Ω in (14) is chosen to be Ω(.) = |.|, where |.| represents the absolute value function, in order to promote the group sparsity of α (p) n,n [10]. Such regularizers are typically known as group-Lasso regularizers (see, Fig. 2 for a visual representation of the Lasso groups). Note that the function |.| is non-decreasing, thereby satisfying the regularization criteria to apply the Representer Theorem. Using (13) and (14), a parametric form of (3) can be constructed as follows: Although the topology can be estimated by solving (15), this approach has several drawbacks since it is a batch formulation, meaning that (15) requires the entire batch of the time series samples y n [t], t = 0, 1, . . . , T − 1 from all the nodes. This batch formulation is not suitable for streaming data, where the data is available in a sequential manner and real-time tracking of time-varying topologies is required. Furthermore, the computational complexity of batch optimization can be prohibitively high, especially with large batch sizes. Motivated by the above factors, we propose an online topology estimation strategy in the following section, which addresses these limitations and has lower computational complexity.

III. ONLINE LEARNING
To formulate an online optimization framework, we replace the batch loss function L n (α n ) in (15) with a stochastic (instantaneous) loss function n The loss function l n t (α n ) in (16) is analogous to a Least Mean Square (LMS) formulation. However, notice that the estimates of LMS are prone to observation noise and can be unstable in practice. To overcome this problem, we formulate (16) in a recursive least square (RLS) sense, which further provides necessary stability in addition to faster convergence: In (17), we replace the instantaneous loss with a running average loss using an exponential window. The parameter γ ∈ (0, 1) is the forgetting factor of the window, and μ = 1 − γ is set to normalize the exponential weighting window. We expand the RLS loss function as follows: where As in a typical RLS formulation, these quantities can be updated recursively as The gradient of the loss function can be obtained as Finally, using the RLS loss function, the topology can be estimated by solving arg min The cost function in (23) consists of a differentiable loss function and a non-differentiable group-Lasso regularizer. To solve (23) online, we can use methods such as online subgradient descent (OSGD) or mirror descent (MD), which linearize the entire objective function using a subgradient. However, linearizing the group-Lasso regularizer, compromises its ability to induce sparsity, resulting in non-sparse estimates. To overcome this, we employ an alternate optimization technique -a modified version of the MD algorithm known as composite objective mirror descent (COMID) [45]. In COMID, the differentiable part of the objective function is linearized, while keeping the regularizer intact, preserving the sparsity-inducing property. The online COMID updates can be written as where J (25) consists of 3 parts: (i) gradient of loss function given by (22), (ii) a Bregman divergence term with a t as the step size, and (iii) a sparsity enforcing group-Lasso regularizer. The Bregman divergence [46] improves the stability of the online algorithms by constraining the value of the new estimate α n [t + 1] within the proximity of the previous estimate α n [t].
is selected in such a way that the optimization problem (24) has a closed form solution [46]. For notational convenience, we denote the gradient in (25) as The objective function in (25) is expanded by omitting the constants leading to the following formulation: A closed form solution for (24) using (27) can be obtained via the multidimensional shrinkage-thresholding operator [47]: where [v (1) n,n , v

IV. THEORETICAL RESULTS
In this section, we present the performance analysis and convergence guarantee of RFNL-TIRSO using dynamic regret analysis. Regret is a popular metric to measure the performance of an online algorithm [48]. Despite being originally developed for static learning problems, numerous online algorithms involving dynamic regret analysis have been developed [33], [34], [35], [36] to solve problems in a dynamic environment; however, all of them belong to the class of linear algorithms. Additionally, [33], [34], [35] assume differentiable objective functions, which are not applicable to RFNL-TIRSO. Dynamic regret bounds for nonlinear algorithms have been proposed in [37], [38], [39]. In [37], the problem under consideration is limited to positive functions, whereas our problem formulation does not have such a limitation. The regret analysis presented in [38] differs significantly from our proposed method for several reasons. First, the objective function used in [38] must be differentiable, whereas, in our proposed method, the regularizer is non-differentiable. Second, unlike [38], our regret analysis involves multiple decoupled functions that represent interpretable topological connections. Although [39] provides a logarithmic regret bound using second-order information, the objective function in their analysis is differentiable.
Our theoretical analysis is based on the following assumptions: denotes the maximum eigenvalue. A1 is reasonable in practice as the signals from real-world applications are bounded. A2 is true for typical kernels such as Gaussian and Laplacian. Since Φ(t) is a sum of rank one matrices formed using feature vectors, A3 will hold as long as the feature vectors are linearly independent. This is a reasonable assumption in practice when a sufficient amount of data is available. Note that A3 is important for the strong convexity assumption of the loss function, which is used in the sequel. A4 can be obtained by combining A1 and the fact that the sum of eigenvalues of Φ[t] is equal to its trace.

A. Dynamic Regret Analysis
As a preliminary step to the regret analysis, we define the optimum RKHS and RF coefficients.
Optimum RKHS coefficients: Using the batch form solution (4), obtained using the Representer Theorem, a parametric autoregressive representation at time t can be obtained as whereβ n ∈ R NP t and κ t ∈ R NP t are respectively obtained by stacking the variablesβ (p) n,n ,(τ −p) and the kernel evaluations in (4) along the lexicographic order of the indices n , p, and the time index up to t. The optimum RKHS coefficients β * n [t] for each node n at time t can be obtained by solving where the cost function h n t (β n , κ t ) in (30) is composed of two terms: where the first term is RLS loss function, and the second term ω n (.) is the group-Lasso regularizer defined as n,n 2 . Optimum RF coefficients: Following the same procedure, we define the optimum RF coefficients α * where h n t (α n , z v (t)) =˜ n t (α n ) + ω n (α n ), and˜ n t (.) is the RLS loss defined in (17) and n,n 2 . Note that the optimum RF coefficients α * n [t] is different from the RFNL-TIRSO estimate α n [t] obtained by the computationally light COMID algorithm, as RFNL-TIRSO only makes one COMID update per time instant.
Dynamic Regret: Dynamic regret is defined as the cumulative sum of the difference between the estimated cost function and the optimal cost function over all time instants. In our framework, it can be expressed as Our aim is to find a theoretical bound for R n [T ]. Since our online algorithm works in the RF space, we conduct the regret analysis in relation to the optimal cost function in the RF space, i.e., h n t (α * n [t], z v (t)). It is worth noting that this choice is not arbitrary as there exists a one-to-one mapping between the two spaces, ensuring no loss of generality. Adding and subtracting where R rf ] is the regret with respect to optimal cost in RF space and ξ n [T ] = is the cumulative RF approximation error caused by the dimensionality reduction.
1) Bounding the Regret W.r.t. Optimal cost Function in RF Space: Theorem 1 bounds R rf n (T ). Theorem 1: Under the assumptions of A1, A3, A4, and letting a t = 1 L , the dynamic regret of RFNL-TIRSO (Algorithm 1) w.r.t. the optimal cost function in the RF space satisfies Proof: See Appendix A. From Theorem 1, it can be readily seen that if W n (T ) is sublinear, then the regret will also be sublinear.
2) Bounding the Cumulative RF Approximation Error: Theorem 2 provides a bound for ξ n (T ).
Theorem 2: Under assumptions A1 and A2, there exists ≥ 0 such that the cumulative approximation error ξ n [T ] of RFNL-TIRSO (Algorithm 1) satisfies where L h > 0 is the Lipschitz continuity parameter of the cost function.
Proof: See Appendix B. Finally, we bound the dynamic regret R n (T ) using Theorems 1 and 2.

Proof: Theorem 3 can be directly and readily proved by substituting Theorem 1 and Theorem 2 in (33).
It is important to note that by setting = O( 1 √ T ), we can achieve a dynamic regret of O(W n (T ) + √ T ). In such cases, if W n (T ) is sublinear, the dynamic regret becomes sublinear as well. Ideally, an online algorithm should aim for a sublinear dynamic regret, indicating that R n (T )/T → 0 as T → ∞, or in the worst case, a linear regret, which implies R n (T )/T → constant, where constant is known as the steady-state error. In our case, when W n (T ) is sublinear, the steady-state error is L f C. By choosing a small value for , we can ensure a small steady-state error. Appendix B demonstrates that we can make sufficiently small by increasing the number of random features D while considering the trade-off with complexity [21].

V. EXPERIMENTAL RESULTS
In this section, we evaluate the performance of RFNL-TIRSO through extensive numerical experiments. We compare it with three state-of-the-art competitors: TIRSO [6], RFNL-TISO [31], and PDIS [20], [49]. TIRSO is a linear online topology algorithm, chosen to highlight the advantages of RFNL-TIRSO, which is a nonlinear algorithm. RFNL-TISO is another online nonlinear topology estimation algorithm that uses an instantaneous least mean square loss function. However, RFNL-TIRSO is expected to outperform RFNL-TISO due to its utilization of an RLS-based loss function, as discussed in Section III. The third algorithm, PDIS [20], [49], is a recent online nonlinear topology identification algorithm that uses dictionaries of kernel functions with partial-derivative-imposed sparsity. To the best of our knowledge, these three algorithms serve as the best benchmarks for comparing the performance of RFNL-TIRSO. While other batch-based algorithms are available [10], [12], [13], they are not directly comparable to our algorithm as they operate offline. The experiments in this section involve both synthetic and real data sets. The synthetic dataset consists of graph-connected time series data generated with different topology transition patterns to highlight the ability of algorithm to track time-varying topologies. The real data set used include (i) time series data collected from Lundin's offshore oil and Gas platform 1 and (ii) Epileptic seizure data [50].

A. Experiments Using Synthetic Data Sets 1) Piecewise Stationary Topology:
We generate a multivariate time series using a nonlinear VAR model (1) with N = 10, P = 4. The nonlinear function in (1) is taken as f n,n (.) = 0/1 means that the p-th time-lagged dependency between n and n is disabled/enabled. A graph-connected time series is generated by 1 [Online]. Available: https://www.lundin-energy.com/ restricting the number of active edges to be 30% of the total available edges. Further, we introduce abrupt changes in the topology after every 1000 time step by randomly cutting off 30% of the available active edges. In the experiments, the initial P data samples are generated randomly, and the remaining data are generated using model (1). The hyperparameters for all the algorithms used in the experiments are tuned heuristically to get the maximum area under the receiver operating curve, which is explained below. The hyperparameter settings for RFNL-TIRSO are (σ n , λ, a t ) = (2.5, 0.01, 0.1/Λ max (φ[t])), for g 1 and g 2 , and (1, 0.01, 0.1/Λ max (φ[t])) for g 3 Fig. 3 that the estimates of RFNL-TIRSO are very close to the ground truth, and they outperform the other algorithms.
A numerical comparison of the performances of the algorithms is made using the probability of false alarm (P F A ) and the probability of detection (P D ). The probability of false alarm (P F A ) refers to the probability that the algorithm reports the presence of a dependency in the network that is not actually present. On the other hand, the probability of detection(P D ) refers to the probability that the algorithm detects a dependency that is truly present in the network. In our experiment, we assume there is a presence of a detected edge from the p − th n,n [t] is greater than a threshold δ ∈ [0, 1], and define P F A and P D as where 1{x} = 1/0, if x is true/false and δ is a threshold. From (34), it is clear that when δ = 0, both P D and P F A become one. With an increase in δ, both P D and P F A decrease, eventually reaching zero when δ equals one.
The Receiver-Operating curve (ROC) of the different algorithms at time t = 2990 is plotted in Fig. 4 by varying δ from 0 to 1, with P F A in the x-axis and P D in the y-axis. The area under the ROC curve (AUC) is computed to evaluate the performance of the algorithm. A topology identification algorithm with a high AUC value is characterized by a high P D and low P F A , indicating that it can accurately identify network topologies while minimizing the occurrence of false positives. From Fig. 4, it can be observed that the area under ROC (AUC) of the RFNL-TIRSO is substantially better than TIRSO and slightly better than RFNL-TISO for all three nonlinearity functions. These observations are more evident from Table I, where the computed AUC values are tabulated. We further analyze the AUC of RFNL-TIRSO for different RF space dimensions, i.e., D ∈ {20, 30, 50}, at different time instants in Table II, for the nonlinear function g(x) = g 1 (x). As expected, the AUC increases with D and the number of data samples. A similar AUC trend as in Table II was obtained for the other two nonlinear functions g 1 and g 2 .
2) Lorenz Graph: We also conduct experiments using synthetic data sets generated from the Lorenz graph [51]. We consider a discretized version of the Lorenz graph involving 3 time series exhibiting the following nonlinear dependencies: In comparison with the model used in Section V-A1, the Lorenz graph model (35) introduces only first-order dependencies (P = 1) among the nodes. Additionally, it is important to note that (35) incorporates non-additive nonlinear interactions among the nodes, which differs from the VAR assumption in (1). In this  section, we compare the performance of the RFNL-TIRSO algorithm and the PDIS algorithm [49]. Since the TIRSO algorithm assumes P > 1 in its implementation, it is not included in this comparison. To ensure a fair comparison, we replicate the exact experimental setup as described in [49]. The performance evaluation is based on the edge identification error rate (EIER), defined as EIER = A−Â 0 N (N −1) × 100, where A represents the true dependency matrix andÂ represents the estimated dependency matrix. For RFNL-TIRSO,Â is computed using b (1) n,n . The hyperparameters are heuristically tuned to minimize the EEIR, resulting in the setting (σ n , λ, a t ) = (1, .3, 1/(tΛ max (φ[t]))).
The estimated and true binary adjacency matrices (excluding self-dependencies) are shown in Fig. 5, and the EIER up to t = 1750 are plotted in Fig. 6. We remark that although the PDIS algorithm is designed by assuming non-additive nonlinear interactions, its performance lags behind the proposed RFNL-TIRSO algorithm, which assumes additive nonlinearities. This is because the RFNL-TIRSO algorithm employs an RLS loss function, which results in an improved convergence speed compared with the LMS loss used in PDIS.
3) Numerical Evaluation of Dynamic Regret: In Section IV-A, we derived a theoretical bound for the dynamic regret R n [T ] = R rf n [T ]+ξ n [T ]. In this section, using experiments conducted on synthetic data, we numerically compute the dynamic regret of RFNL-TIRSO w.r.t. the optimal cost in the RF space, defined as R rf , z v (τ ))], for T = 1, . . . , 1000. This allows us to experimentally validate our theoretical results. Here, α n [τ ] is the RF coefficient estimated using RFNL-TIRSO at time τ , and α * n [τ ] is the optimum RF coefficient, computed using a standard gradient descent algorithm until convergence. It is worth mentioning that the estimation of α * n [τ ] involves very high computational complexity compared with that of α n [τ ]. In Fig. 7, we plot R rf n [T ] and R rf n [T ]/T . For this experiment, we use the same data generation mechanism with nonlinear dependencies g 1 and g 2 , as explained in Section V-A1, having topology change points at T = 250 and T = 500. Fig. 7 shows that R rf [T ] is sublinear w.r.t. T and R rf [T ]/T is negligibly small, which is in agreement with the theoretical results stated in Theorem 1. It is important to note that numerically evaluating the second component of the dynamic regret ξ n [T ] is a complex task since it involves finding optimal parameters in a high dimensional RKHS. However, as shown in Theorem 2 we emphasize that ξ n [T ]/T is theoretically bounded by L f C, where is a user-controlled parameter. The value of ξ n [T ]/T can be made small to obtain a dynamic regret R n [T ]/T upper bounded by a small constant for T → ∞.

1) Oil and Gas Platform Data:
In this section, we describe experiments conducted using real data collected from Lundin's Offshore Oil and Gas platform Edvard-Grieg. 2 We collected multivariate time series data from 24 nodes (numbered as n = 1, 2, . . . , 24.) of the plant corresponding to various temperature (T), pressure (P), and oil-level (L) sensors. The sensors are placed in the separators of decantation tanks separating oil, gas, and water. The time series are obtained by uniformly sampling the sensor readings with a sampling rate of 5 seconds. We assume that the hidden logic dependencies are present in the network due to the various existing physical connections and control actuators. The data obtained from the sensors are preprocessed by normalizing them to zero mean unit variance signals.
The dependencies are learned using RFNL-TIRSO (D = 10), RFNL-TISO, and TIRSO by assuming a VAR model of order P=12. A Gaussian kernel having a variance of 1 is used in all the experiments with hyperparameter setting λ = 0.1 and step size a t =1/Λ max (φ[t]) (tuned to obtain minimum NMSE). The estimated dependencies are visualized in Fig. 8 using the 2 norms α n,n [t] 2 . RFNL-TIRSO identifies interpretable connections; for instance, two pressure sensors in the same separator are connected, and the oil level in separator-1 is connected to the pressure variation in separator-2. As expected, most of the identified interactions are local (e.g., interactions inside a separator), with very few long-distance interactions (e.g., interactions between two separators). The strong local interactions among variables such as temperature, pressure, and oil level within a  container are directly linked to fluid dynamics of the oil and gas in the closed chamber as governed by the underlying differential equations [52]. However, various control mechanisms governing the whole oil and gas platform and the physical connections across different chambers can also cause some longer-distance non-trivial interactions, although they will not typically be as predominant as the local interactions. For instance, the primary inlet separator and the electrostatic coalescer can interact despite not being physically connected. When there are changes in the oil level within the coalescer, it can affect the head of the system, leading to changes in the pressure and oil level within the primary inlet separator, which operates based on gravity.
We wish to note that the estimated dependencies can be interpreted as an abstract graph representation of various physicsbased equations describing the spatio-temporal variation of the signals. However, since ground truth dependencies are not available in this experiment, directly evaluating the estimated graph using the underlying differential physics-based equations is a complex and time-consuming task that is beyond the scope   of this study. Instead, we assess the ability of the algorithms to learn the dependencies based on the accuracy of time series forecasting using the learned VAR model. A high prediction accuracy implies that the estimated dependencies are close to the underlying unknown true dependencies. In our study, we measure the prediction accuracy using normalized mean squared error (NMSE): whereŷ n [t + t step ] is the estimate of the time series generated by the n th node at time instant t + t step based on the VAR model learned at time t. Fig. 9 shows the NMSE of the estimated signals corresponding to a particular sensor n = 8 using various algorithms. We exclude the PDIS algorithm in this experiment since it is not specifically designed for signal prediction. The NMSE is calculated according to (36) with a prediction horizon of t step = 12, representing one-minute ahead prediction. For the RFNL-TIRSO and TIRSO algorithms, we conduct the experiments by varying the forgetting factor γ ∈ {0.1, 0.5, 0.7, 0.98}. The best NMSE performance is achieved by the RFNL-TIRSO algorithm when γ = 0.98, surpassing all the competing algorithms. As γ decreases, the performance of RFNL-TIRSO approaches that of RFNL-TISO, as expected from (17). Furthermore, we plot the dynamic regret and cumulative variation of the optimal parameter estimates in Fig. 10, demonstrating that our algorithm is capable of tracking the topology even when the optimal topology undergoes variations.
In Section IV, we show that the RFNL-TIRSO converges if the learning rate is less than 1/L, where L is the upper bound of Λ max (φ[T ]). The performance of RFNL-TIRSO under various learning rates is shown in Fig. 11. Intuitively as the learning rate increases, RFNL-TIRSO converges faster; and when the learning rate surpasses 1/L, convergence is not guaranteed, as evidenced in Fig. 11. Note that if the data has a high variance, the value of Λ max (φ[T ]) will also be high, necessitating the use of a lower learning rate to ensure algorithm convergence.

2) Epileptic Data Set:
The dataset used for this experiment [50] is collected from the Children's Hospital Boston, and it consists of EEG recordings from two pediatric subjects with intractable seizures, labelled as P1 (age 11, gender female) and P2 (age 10, gender female). Subjects were monitored for several days after discontinuing anti-seizure medication to characterize their seizures and assess their candidacy for surgical intervention. The EEG recordings followed the electrode positions and nomenclature of the well-known International 10-20 system standard. The signals were sampled at a rate of 64 samples per second, and a total of 23 channels were recorded: The estimated brain topologies using RFNL-TIRSO (P = 2, D = 20) at different time instants (before seizure, during seizure, after seizure) are visualized in Fig. 12, based on the 2 norms α n,n [t] 2 . It can be observed that the estimated topologies before and after the seizure are very similar, with connections concentrated across specific brain regions. However, during the seizure, the topologies become more disrupted, which aligns with the findings in [53]. This disruption can be attributed to increased pathogenic neural discharge during the seizure [54]. The brain can be divided into several regions, namely, temporal, frontal, occipital, parietal and central. Epilepsies are generally classified according to the region of the brain where they originate, with common classifications including temporal lobe (TL) epilepsy and frontal lobe (FL) epilepsy [55]. In TL epilepsy, more inter-region connections originate from the temporal region, whereas in FL epilepsy, such connections originate from the frontal region. To illustrate this, we present an experiment using the brain data of P1 and P2, who belong to the TL and FL epilepsy categories [56], respectively. To measure the activity level of different brain regions, we group all the channels connected to the 'temporal' region into group-T and the 'frontal' region into group-F. Note that all the connections between the 'frontal' and the 'temporal' regions are excluded in this experiment. We define the activation level of a group as the sum of the degrees of all the nodes belonging to the group divided by the total number of nodes in the group, where the degree of a node refers to the total number of edges connected to the node. The activation levels of group-T and group-F for P1 and P2 are shown in Fig. 13(a) and (b), respectively. From the figures, it can be observed that for P1 and P2, the activation levels of group-T and group-F, respectively, exhibit an initial spike, and then the activation spreads across the other brain regions. These observations align with the characteristics of TL and FL epilepsies.

VI. CONCLUSION
This article presents a novel online nonlinear topology identification algorithm called RFNL-TIRSO. The algorithm is designed to process multivariate time series data in a sequential manner and estimate time-varying nonlinear dependencies. The theoretical analysis demonstrates that RFNL-TIRSO follows a sublinear dynamic regret, guaranteeing its ability to track changes in the topology of the system in dynamic environments. To evaluate the performance of RFNL-TIRSO, both real and synthetic data sets are used, and the algorithm outperforms the state-of-the-art online topology estimation methods.
Inequality (42) is obtained by replacing the RF vector (sinusoidal components) with an all-one vector having a higher norm, (43) is obtained using the assumption A1, (44) follows from μ = 1 − γ, and (45) follows from γ ≤ 1. Lemma 1.1 is proved by substituting (45) in (41). Lemma 1.2: Under assumptions A1, A3, and A4, the RFNL-TIRSO algorithm with step size parameter a t = 1 L satisfies Proof: Invoke Lemma 1.1, set a t = a, and let δ = (1 − aρ l ) and 0 ≤ δ ≤ 1, to get The bound of α n [t + 1] 2 in terms of the norm of the initial estimate α n [P ] 2 is obtained by t − P + 1 recursion of (46): In (47), we assumed that the RF coefficients are initialized with zeros, i.e., α n [P ] = 0 2P ND . Using (48) and (45) Proof: To prove Lemma 1.3, we apply Lemma 2.6 from [4], which states that every subgradient of ω n (.) is bounded by its Lipschitz continuity parameter L ω n . In the following, we show that L ω n = λ √ P N. Lipschitz continuity of ω n means there exists L ω n > 0 such that |ω n (a) − ω n (b)| ≤ L ω n a − b 2 (52) for all real a and b. From the group-Lasso regularizer, we have Expanding the left-hand side of (52) using (53) yields |ω n (a n ) − ω n (b n )| In the above derivation, inequality (57) follows from the triangle inequality, inequality (58) from the reverse triangle inequality and (59) from the basic inequality q 1 ≤ √ M q 2 , q ∈ R M . From (59), we obtain the required Lipschitz parameter to be λ √ P N. Substitute the bounds of ∇l n t (α n [t]) 2 given by Lemma 1.2 and u n 2 given by Lemma 1.3 in (40) to complete the proof of Lemma 1. Finally, the proof of Theorem 1 can be completed by substituting Lemma 1 and (39) in (38).

APPENDIX B PROOF OF THEOREM 2
The cumulative approximation error due to the RF approximation is Using the triangle inequality, Inequality (61) is obtained from the Lipschitz continuity of the cost function (L h > 0 is the Lipschitz continuity parameter) and (62) follows from Cauchy-Schwarz inequality. As shown in [21], we can prove that for a given shift-invariant kernel k where C is a constant and (66) follows from the assumption A1: since y n (t) is bounded, the optimal parameters should also be bounded.