Nonlinear system identification with regularized Tensor Network B-splines

This article introduces the Tensor Network B-spline model for the regularized identification of nonlinear systems using a nonlinear autoregressive exogenous (NARX) approach. Tensor network theory is used to alleviate the curse of dimensionality of multivariate B-splines by representing the high-dimensional weight tensor as a low-rank approximation. An iterative algorithm based on the alternating linear scheme is developed to directly estimate the low-rank tensor network approximation, removing the need to ever explicitly construct the exponentially large weight tensor. This reduces the computational and storage complexity significantly, allowing the identification of NARX systems with a large number of inputs and lags. The proposed algorithm is numerically stable, robust to noise, guaranteed to monotonically converge, and allows the straightforward incorporation of regularization. The TNBS-NARX model is validated through the identification of the cascaded watertank benchmark nonlinear system, on which it achieves state-of-the-art performance while identifying a 16-dimensional B-spline surface in 4 seconds on a standard desktop computer. An open-source MATLAB implementation is available on GitHub.


Introduction
B-splines are basis functions for the spline function space [1], making them an attractive choice for approximating smooth continuous functions. For this reason, Bsplines have had numerous applications in system identification [2,3,4,5,6,7,8] and control [9,10,11,12]. The generalization of B-splines to multiple dimensions is done through tensor products of their univariate basis functions. The number of basis functions and weights that define a multivariate B-spline surface, therefore, increase exponentially with the number of dimensions, i.e. Bsplines suffer from the curse of dimensionality. Previous attempts to avoid this limitation include strategies such as dimensionality reduction, ANOVA decompositions and hierarchical structures [13]. The most effective method, i.e. hierarchical B-splines, relies on sparse grids [14] and reduces the storage complexity from O(k d ) to O(k log(k) d−1 ) [15]. This is still exponential in the number of dimensions d. A recently emerging way to alleviate the curse of dimensionality is through the concept of tensor networks. Originally developed in the context of quantum physics, tensor networks efficiently represent high-dimensional tensors as a set of sparsely inter-step n. The error ε t is assumed to be Gaussian white noise. The most common models used in approximating f are polynomials or neural networks [21]. The applicability of polynomial NARX is, however, often limited to weakly nonlinear systems due to computational complexity. Neural networks, on the other hand, require a lot of data to generalize well and can be time consuming to train. Under the reasonable assumption that f is sufficiently smooth, the Tensor Network B-splines model is a suitable candidate to approximate the function from observed input and output data. The contributions of this paper are: • Introduce the Tensor Network B-splines model.
The paper is structured as follows. Section 2 introduces relevant tensor and B-spline theory. Section 3 presents the TNBS model, the regularization technique and the NARX identification algorithm. Section 4 validates the TNBS-NARX approach through numerical experiments on a synthetic and a benchmark dataset. Section 5 concludes this paper and lists some recommendations.

Preliminaries
This section provides the basic terminology and definitions for tensors and tensor decompositions, followed by an introduction to B-splines. Most of the introduced tensor network definitions are based on [22,23,24,25]. A comprehensive treatment of B-splines is given in the book by de Boor [26].

Tensor basics
A tensor is a multidimensional array of real numerical values, e.g. A ∈ R k1×k2×···×k d . Tensors can thus be considered generalizations of vectors and matrices. The order d of the tensor is the number of dimensions of the array. Unless stated otherwise, subscript indices indicate a single element of a tensor, e.g. a = A i1,i2,...,i d . The size of each dimension is indicated by k p , p ∈ {1, 2, . . . , d}, such that i p ∈ {1, 2, . . . , k p }. In this paper, scalars are denoted by lowercase letters (a), vectors are denoted by bold lowercase letters (a), matrices are denoted by bold uppercase letters (A) and higher-order tensors are denoted by calligraphic letters (A). A convenient way of expressing tensors and their operations is using the graphical notation introduced by Roger Penrose in 1972 [23]. Figure 1 shows the representation of a scalar, vector, matrix and third-order tensor using this notation. Every node represents a tensor, the edges represent the indices and the number of edges, therefore, corresponds to its order. The vectorization of a tensor A ∈ R k1×k2×···×k d is the reordering of its elements into a column vector, denoted by vec(A) = a ∈ R k1k2···k d . The elements of a are denoted as: A tensor T ∈ R k1×k2×···×k d is of rank one if it can be decomposed into the outer product of d vectors b (p) ∈ R kp , e.g: where • denotes the outer product operation. The most essential operation in tensor algebra is contraction, which is the summing of elements over equalsized indices. Given the tensors A ∈ R k1×k2×k3 and B ∈ R k3×k4×k5 , contracting the index i 3 results in a tensor A × 1 3 B = C ∈ R k1×k2×k4×k5 whose elements are given by: Contraction is indicated by the left-associative m n -mode product operator [16], where n and m indicate the position of the indices of the first and second tensor respectively. In the graphical notation, contraction is indicated by connecting corresponding edges, as illustrated for (2) in Figure 2. An important equation [18] that relates contraction of a d-dimensional tensor with d matrices to a linear operation is the following: where ⊗ denotes the Kronecker product. The outer product operation is a special case of contraction where the contracted indices have singleton dimensions. The outer product is depicted in the graphical notation by a dashed line connecting two nodes. The inner product between two equal-sized tensors is the sum of their entry-wise products, equivalent to contraction of the tensors over  all pairs of indices. Given two tensors A ∈ R k1×k2×k3 and B ∈ R k1×k2×k3 , their inner product is given by: The Frobenius norm of a tensor is defined as the square root of the sum of squares of its entries:

Tensor trains
The tensor train (TT) decomposition is a widely used tensor network format, popular for its low parametric format and the numerical stability of related optimization algorithms [25]. A tensor train expresses a tensor W ∈ R k1×k2×···×k d of order d in terms of thirdorder tensors G (p) W ∈ R rp−1×kp×rp , also known as the TTcores. Figure 3 shows the TT-decomposition of a fourdimensional tensor in graphical notation. The dimensions of the contracted indices, r p , are called TT-ranks. The first and last TT-ranks, r 0 and r d , are by definition equal to one. Keeping in mind that the m n -mode product operator is left-associative, the tensor train in Figure 3 can be expressed as: There exists a set of TT-ranks r p = R p for which the decomposition is exact. When r p < R p , the tensor train represents an approximation of the original tensor. The lower the TT-ranks, the less accurate the decomposition, but the better the compression. When all r p and dimensions k p are equal, the storage complexity of the tensor train representation is O(kdr 2 ). A TT-decomposition with low TT-ranks can thus significantly reduce the memory footprint of high-dimensional data. For a prescribed set of TT-ranks or a prescribed accuracy, the TT-decomposition of a tensor can be computed with the TT-SVD [25] or the TT-Cross [27] algorithm. An important notion for TT-cores is orthogonality. A TTcore G (p) W is left-orthogonal if it can be reshaped into a matrix G (p) ∈ R rp−1kp×rp for which: A tensor train is in site-k-mixed-canonical form [28] when for for its TT-cores the following applies: For a site-k-mixed-canonical tensor train holds that its norm is contained in the k-th TT-core, i.e.:

B-splines
A univariate spline S is a piecewise polynomial function that maps values from an interval [a, b] to the set of real numbers, e.g. S : [a, b] ∈ R → R. Any spline of degree ρ can be expressed as a unique linear combination of Bsplines of the same degree: The B-spline basis functions B i,ρ (x) are defined by the knot sequence and degree ρ, and they are contained in the basis vector b. A knot sequence t = {t 0 , t 1 , . . . , t m−1 , t m } is defined as a non-decreasing and finite sequence of real numbers that define the par- ]. The number of B-spline basis functions k relates to the degree ρ and number of knots m + 1 by k = m − ρ. B-spline basis functions of arbitrary degree ρ can be recursively constructed by means of the Cox-de Boor formula [26]: If the knots are equidistantly distributed over the domain of the spline, the spline is called uniform. If the uniform knot sequence is also a subset of Z, i.e. a sequence of integers, the spline is referred to as a cardinal spline [29]. In this article, all knot sequences will be considered uniform, as they allow for efficient evaluation of b using a matrix expression [30] instead of (8). Figure 4 illustrates B-splines of degree 1 to 3 on the cardinal knot sequence t = {0, 1, 2, 3, 4, 5}. The dashed purple lines represent the sum of the basis functions. The shape of a B-spline curve S(x) is only fully adjustable within its natural domain D n = [t ρ , t m−ρ ], because the sum of the B-spline basis functions at any point within this domain equals one. It is desirable to have full control over the shape of the B-spline curve over the whole range of data samples. The knot sequences in this article will be chosen such that D n coincides with the unit interval [0, 1].

Multivariate B-splines
B-splines generalize to multiple input dimensions through tensor products of univariate basis functions. One can construct a d-dimensional spline S as a linear combination of multivariate B-splines: For notational convenience, we omitted the degrees ρ. The B-spline tensor B contains the multivariate basis functions and is defined as: where b (p) is the univariate basis vector of the p-th input variable, i.e.
We will assume equal knots and degree for each dimension, hence k p = k, ∀p. The representation of B-spline surfaces in (9) is severely limited by the exponential increase in the number of basis functions and weights, O((k) d ).

Tensor Network B-splines
For our purposes, the input variables x p are the lagged inputs and outputs of (1). For a large number of lags or inputs, it can therefore quickly become computationally infeasible to store or operate on the tensors B and W. Using tensor network theory, the multivariate B-spline surface can be represented in a low-parametric format. In this section, we derive the TNBS model and use it to approximate the function f in (1) from observed input and output data.

Model structure
We illustrate the model structure using a threedimensional Tensor Network B-spline surface as an example, which is derived as follows: ). (13) Figure 5 will be used as a visual reference to walk through these equations. Given the weight tensor W ∈ R k1×k2×k3 and B-spline tensor B ∈ R k1×k2×k3 in (11), their inner product is equal to the contraction over all pairs of indexes, as seen in Figure 5a. As B is a rank one tensor, it can be decomposed into the outer product of three Bspline vectors b (p) (Figure 5b). The outer product operation is a special case of contraction where the contracted indexes have singleton dimensions. Singleton contractions that close a loop in a tensor network are redundant, and hence omitted in Figure 5c. Now S(x 1 , x 2 , x 3 ) in (12) is simply the contraction of W with the B-spline basis vectors. Finally, W is decomposed into a tensor train in Figure 5d. A point (x 1 , x 2 , x 3 ) on the TNBS surface in (13) is evaluated by constructing the B-spline vectors b (p) , contracting them with the corresponding tensor train cores and finally multiplying the sequence of resulting matrices. Due to the constraints, r 0 = r d = 1, this results in a scalar output. Extending the TNBS model to l outputs can be realized by removing one of these constraints, e.g. r 0 = l. In general, a d-dimensional TNBS surface is represented by:

Identification algorithm
We illustrate, without loss of generality, the proposed identification algorithm by means of the following example. Suppose we have the following NARX system model: y n = f (u n , y n−1 , u n−1 ) + ε n .
We want to identify this model from a set of observed input and output data {(y n , u n )} N n=1 . We approximate the function f with the three-dimensional TNBS from Figure 5d, by minimizing the least-squared cost function: . . . Fig. 6. The tensor network written as a vector inner product.
To solve (16) directly for the TT-cores, we use the alternating linear scheme (ALS) [31]. The TT-ranks are chosen beforehand and the TT-cores are initialized randomly. ALS then iteratively optimizes one tensor core at a time while holding the others fixed. Optimizing one core is equal to solving a small linear subsystem. Suppose we wish to update the second core from Figure 5d. The idea is to contract everything in the network up until the nodes adjacent to G = a (2)T g (2) .
More generally, rewriting (14) for the n-th data sample as a linear function of the elements of the p-th core gives: >,n = 1. Computing (18) for all N data samples results in a system of linear equations. The subproblem for updating the p-th core thus becomes: where The optimum is found by solving the normal equation: Reshaping g (p) back into a third-order tensor results in the updated core G (p) W . The ALS algorithm sweeps back and forth, iterating from the first to the last core and back, until convergence. At each iteration, (21) is solved for g (p) . Numerical stability is ensured by keeping the tensor train in site-p-mixed-canonical form through an additional orthogonalization step. To illustrate, consider again the TNBS in Figure 5d. Assume that we are iterating from left to right and the tensor train is in site-2mixed-canonical form. After solving g (p) it is reshaped into a matrix G (p) ∈ R rp−1kp×rp , which is then decomposed through a QR decomposition. The tensor network is now in the form of Figure 7. Finally, Q is reshaped back into a third-order left-orthogonal tensor G (2) and R is contracted with the next core. The tensor train is now in site-3-mixed-canonical form, and the next iteration starts. More details about the orthogonalization step are given in [31]. The optimization with ALS converges monotonously, so a possible stopping criterion is: where J (1) h is the cost of the objective function in (19) during the first core update of the h-th sweep. A modified version of ALS method, MALS [31], updates two cores simultaneously and is computationally more expensive, but is able to adaptively determine the optimal TT-ranks for a specified accuracy. Another adaptive method is the tensor network Kalman filter [17], which can be used for online optimization of the cores.

Regularization
In addition to decreasing computational burden, the TT-rank constraints serve as a regularization mecha-nism. This regularization is however insufficient for highdimensional B-splines, as the volume of the domain of the TNBS increases exponentially. The available estimation data becomes sparse and scattered, which can lead to an ill-posed optimization problem. B-spline curves inherently possess the ability to regularize by adjustment of their degree or knot placement. The choice of knots has been a subject of much research [32], but due to lack of an attractive all-purpose scheme, we opt for a nonparametric approach known as P-splines [33]. P-splines induce smoothness by combining uniform B-splines with a discrete penalty placed on the α-th difference between adjacent weights. For univariate splines, the following penalty function is added to the cost function: The matrix D α ∈ R (k+1−α)×(k+1) is the α-th order difference matrix such that D α w = ∆ α w results in a vector of α-th order differences of w. This matrix can be constructed by using the difference operator α times consecutively on the identity matrix. For α = 0 this is equal to Tikhonov regularization and for α = 1 we get Total Variation regularization. For example, given are a weight vector and the first-order difference matrix: The penalty term then equals: We wish to extend the penalty in (23) to the TNBS format. Without loss of generality, Figure 8 visualizes the necessary steps in graphical notation for a threedimensional B-spline surface. In the multivariate case, the differences in adjacent weights in the weight tensor W have to be penalized along each dimension individually. This is done by contracting the second index of the difference matrix D α with the dimension of the weight tensor W along which the penalty is applied, then taking the norm of the result. For a B-spline curve with d inputs, the penalty on the α-th order differences along the j-th dimension is given by: This is illustrated in 8a, where the penalty is applied along the first dimension, e.g. j = 1. Decomposing W into a tensor train results in the network depicted in 8b.
To write this penalty again as a linear function of the p-th core, we contract everything in the network except these cores (Figure 8c). In this example, C >,1 and C −,1 are simply identity matrices. Then, using (3), the penalty function can be rewritten in the form of Figure 8d: where Ω The matrix Ω (26) is constructed for every dimension j. Due to the site-p-mixed-canonical form of the tensor train, the contraction of two out of the three matrices C <,j result in identity matrices. This knowledge can be utilized for efficient implementation. Adding the penalties to the cost function results in the following regularized optimization problem: s.t. TT-rank(W) = (r 1 , r 2 , . . . , r d−1 ).
The smoothing parameter λ j ≥ 0 controls the penalization of the roughness along dimension j. The subproblem for updating the p-th core becomes: The normal equation is then: The whole procedure of identifying a TNBS model from measured data is summarized as pseudo-code in Algorithm 1. Construct A (p) (20) and Ω Repeat the above 11: end for 12: end while Table 1 summarizes relevant computational complexities concerning the TNBS-NARX method. While the complexities scale only linearly in the dimensions, it is important to realize that high TT-ranks easily degrade the performance of optimization of the cores. There is, therefore, a tradeoff between accuracy and speed. The number of data samples N also appears linearly in the complexities but may become a bottleneck for large datasets. A modification for this scenario is to use a small random batch of the data when updating g (p) . This can speed up estimation time without significant loss of accuracy.

Experiments
In this section, we demonstrate the proposed system identification method. The algorithm is implemented in MATLAB and executed on a personal computer with a 4.2 GHz Intel Core i5-7600K processor and 16 GB of random access memory (RAM). An open-source MATLAB implementation can be found at https://github.com/ Ridvanz/Tensor-Network-B-splines.

Synthetic dataset
First, we validate the proposed methods through the identification of an artificial nonlinear dynamical system that is exactly representable in the TNBS-NARX format. The lagged inputs and outputs are chosen as u n−µ and y n−µ respectively, where µ ∈ (1, 2, 3, 4), such that the system equation is of the following form: y n = f (y n−1 , y n−2 , y n−3 , y n−4 , u n−1 , u n−2 , u n−3 , u n−4 ) (30) The nonlinear mapping f is modeled as an 8-dimensional TNBS. We choose the degree of the B-splines ρ = 2 and the number of knots per dimension m = 6. A random weight tensor W of size (m − ρ) d = 4 8 is generated of which the elements equal either w min = −4 or w max = 5 with equal probability. The generated tensor is decomposed using the TT-SVD algorithm, truncating the TTranks to a value of 5 uniformly. The resulting tensor train represents the true weights of our nonlinear system. For the input signal u we generate a random sequence of length 3000, with values uniformly distributed in the range [0, 1]. This sequence is smoothed with a Gaussian window of size 5 to dampen higher frequencies. We initialize the output signal y with 4 zeros and recursively evaluate the next output with (30), until we have a signal of length 3000. The first 200 samples of the input (blue) and output (red) signals are plotted in in We test the performance of our TNBS-NARX identification algorithm with different levels of Gaussian white noise on the estimation data. Noise is only added to the output signal. The variances for the white noise signals are chosen based on the desired signal to noise ratios SNR. The signal powers are determined after subtracting their means. For simplicity, we penalize the second difference (α = 2) of the weights equally for each dimension, e.g. λ p = λ, ∀p. The experiment is run using three different values for lambda. All other model parameters are set to the true values of the synthetic model. The TT-cores are estimated using Algorithm 1. For consistency, we simply choose a max number (16) of sweeps as  stopping criteria. The root mean squared error (RMSE) is used as the performance metric to evaluate the accuracy on the test set for both prediction and simulation. increasing SNR values, more regularization is needed to avoid overfitting to noise, so larger penalties give better performance. Overall, the TNBS is able to identify the system accurately, even for relatively noisy estimation data.

Cascaded tanks dataset
The cascaded tanks system is a benchmark dataset for nonlinear system identification. A detailed description of the system and the data is given in [34]. The system consists of two tanks, a water reservoir and a pump. The water in the reservoir is pumped in the upper tank, from which it flows to the lower tank through a small opening and then back into the reservoir. The system input u n is the pump voltage and the system output y n is the water level of the lower tank. If too much water is pumped into the upper tank it overflows, causing a hard saturation nonlinearity in the system dynamics.
The input signals are low-frequency multisine signals.
Both the estimation and test set have a length of N = 1024 samples. The major challenges of this benchmark are the hard saturation nonlinearity and the relatively small size of the estimation set. The performance metric used is again RMSE.
The original data is first normalized to the interval [0,1]. Both input and output lags are chosen as u n−µ and y n−µ respectively, where µ ∈ {1, 2, 3, 4, 8, 12, 16, 32}. The large lags are included to capture the relevant slow system dynamics. We choose the degree of the B-splines ρ = 3 and the number of knots m = 7. We penalize the first-order difference only, i.e. α = 1, and set the TT-ranks to 8 uniformly. We choose λ through 3-fold cross-validation on the estimation set. A total of 12 sweeps are performed in the optimization with Algorithm 1. After tuning lambda, the full identification set is used to identify the final model, which takes about 4 seconds. Using TNBS, the number of weights to represent the 16-dimensional B-spline surface is reduced from approximately 4.3 × 10 9 to 3648. The performance on prediction and simulation are listed and compared in table 2. To the best of our knowledge, the algorithm slightly outperforms the current state-of-the-art results on both prediction and simulation. Figure 11 shows the true and simulated output on the test set. It is apparent that the TNBS-NARX model was able to accurately capture the nonlinear system dynamics with relatively sparse estimation data.

Conclusions
This article presents a new algorithm for nonlinear system identification using a NARX model of which the nonlinear mapping is approximated using the introduced Tensor Network B-splines. Tensor Network theory enables to work with B-spline surfaces directly in a highdimensional feature space, allowing the identification of NARX systems with a large number of lags and inputs. The identification algorithm is guaranteed to monotonically converge and numerical stability is ensured through orthogonality of the TT-cores. The efficiency and accuracy of the algorithm is demonstrated through numerical experiments on SISO nonlinear systems. Extension of TNBS-NARX to multiple inputs is straightforward through the addition of input variables. Multiple outputs can be realized efficiently by adding an index to one of the TT-cores, as done in [18]. Future work includes the implementation of an online optimization scheme, as an alternative to ALS, and the development of control strategies for identified TNBS-NARX systems.