Sliding Window Evaluation of the Wiener-Hopf Equation

This paper presents an efficient method for solving the Wiener-Hopf equation in a sliding window by calculating the correlation matrices recursively. Furthermore, a novel algorithm is introduced for evaluating the inverse of the auto-correlation matrix – the Recursion with Splitting the Correlation matrix into 4 Blocks for Inversion (RSC4BI) – which can significantly reduce the computational requirements. The presented procedure is optimized for special cases to achieve an efficient implementation which allows faster real-time signal processing or to reduce the response time – e.g. the latency – by distributing the computations over the time. The proposed method is also validated through numerical simulations and hardware implementation.


Introduction
In this article, it is assumed that the system is a Finite Impulse Response (FIR) filter. This FIR filter is modelled as a Wiener Filter (WF) [1] which is able to track a system by linear time-invariant filtering. To perform the estimation of the filter coefficients, two statistical functions have to be calculated in matrix form: the auto-correlation of the input signal and the cross-correlation of the input and output signals. Using these matrices, the Wiener-Hopf (WH) equation can be formulated and its solution provides the optimum for the filter coefficients in a Minimum Mean Square Error (MMSE) sense [2]. These MMSE-filters -and their modified versions -are widely used in signal processing to identify a given dynamic system [3][4][5][6].
During the solution of the WH equation the auto-correlation matrix has to be inverted. Several algorithms can be applied to solve the inversion of matrices, depending on the properties of the matrix [7], [8]. The computational load of the inversion becomes critical, if a low complexity hardware with limited performance is applied. Furthermore, this operation is also critical, if real-time adaptive filtering has to be performed by the WF: during this adaptive method the filter coefficients adapt over time to track the internal changes of the dynamic system.
The challenges, that require this adaptive and real-time filtering are very diverse, such as compensation of current transformers [9], noise cancellation in microphones [10], or to calculate charge density on a dielectric surface [11]. Studying these examples, it can be stated that the matrix inversion is crucial step of these procedures due the O n 3 complexity. The applicability of this configuration is limited because of the real-time environment, as the applied matrices have to be small enough to fulfill the real-time conditions. Considering these circumstances, such algorithms have to be applied which can evaluate the WH equation in real-time and with low computational complexity. A common method for performing matrix inversion in a recursive manner is the split Levinson algorithm [12], [13]. Although the poor stability limits the usage of the method [14], but its behaviour can be predicted and compensated by computing the condition number [15].
This article presents a novel method for the efficient solution of the WH equation in a sliding window, the Sliding Wiener Filter (SWF). The direct calculation of the inverse of the auto-correlation matrix can be avoided by applying the Recursion with Splitting the Correlation matrix into 4 Blocks for Inversion (RSC4BI). Furthermore, not only the inverse, but the auto-correlation matrix and cross-correlation vector are also recursively evaluated using the results of the previous calculations.
The paper is organized as follows. Section 2 gives a short overview of adaptive filtering and the WH equation. The applied notations for the investigated signals are introduced as well. Section 3 describes the method of SWF, and the RSC4BI is presented as well. In Sec. 4 the complexity analysis of SWF is investigated and compared with the conventional WH solution. An optimized version of the RSC4BI algorithm is also presented for low complexity implementation. Simulation results for the proposed method in terms of runtime and quantization error are presented in Sec. 5. Finally, the conclusions are drawn.
Throughout the paper the following notations are used. The matrices are denoted by capital bold letters, vectors by small bold letters and scalars by plain letters. The superscript () T denotes matrix or vector transposition; () −1 denotes the multiplicative inverse of a matrix. The applied normal distributions are denoted by the common notation: N µ, σ 2 , where µ is the mean and σ 2 is the variance of the distribution, while the E {·} operator means the expected value of the parameter, and the average of the parameter is denoted by an overline: (·).

Adaptive Filtering
The adaptive filter is modeled as a linear time-invariant filter having a finite impulse response with K coefficients. As a result, the output signal of the adaptive filter can be expressed by convolution using the coefficients h(n) and the input signal x(n) as The previous equation can be expressed also as matrix multiplication in an observation window containing N samples: where vectors y(n) and h(n) are expressed using the signals y(n) and h(n), respectively as: and the data matrix X(n) with size N × K can be expressed using x(n) as The error signal vector ε is the expected value of the difference of the desired signal vector d(n) and the output signal vector y(n): The adaptive algorithm should minimize the cost function (C) as a function of the filter coefficients, considering (2) and (6) it can be expressed in MMSE sense as is the auto-correlation matrix with size K × K of the input signal, while p(n)=E X T (n)d(n) is the cross-correlation vector with size K × 1 of the input and output signals.
The optimal solution for the filter coefficients -know as the WH equation -can be given in MMSE sense by the derivative of C (h (n)) with respect to h (n). The optimal coefficientsĥ opt (n) with size K × 1 are given in form [2]:

Evaluation of the WH Equation
In this section the calculation steps for evaluating the WH equation in a sliding window of length N -by a sampleby-sample manner -are described. First, the matrix X(n+1) is expressed based on the previous matrix X(n) with the aid of a permutation matrix and an identity vector. Then, the correlation matrices R(n+1) and p(n+1) are expressed from the matrices elaborated in the previous step based on two different methods: using a permutation matrix and using dyadic products. Finally, the RSC4BI algorithm is introduced for the calculation of the inverse of the auto-correlation matrix R −1 (n+1). The algorithm calculates R −1 (n+1) using the previously calculated matrices R(n), R(n+1) and R −1 (n). The algorithm exploits higher efficiency by splitting the matrix into four submatrices.
For the representation of the column and row elements of matrix X(n) from (5), the following notation will be applied: where x 1 , x 2 , . . . are the column vectors, and x 1 , x 2 , . . . are the row vectors of X.

Calculation of X(n+1)
For the calculation of the matrix X(n+1) derived from the matrix X(n) the permutation matrix and the identity vector are defined. The K × K permutation matrix is given as The N × 1 identity vector -where only the i th element is 1is expressed as Finally, the matrix X(n+1) can be expressed as a permuted version of X(n) and compensated by the dyadic product of the difference of the time domain vectorsx 1 (n+1) and x K (n)and the identity vector:

Calculation of R(n+1)
The auto-correlation matrix R(n+1) can be expressed using R(n) in the following 2 different methods.

Using Permutation Matrix
Applying (14), the transposed version of X(n+1) can be expressed as: As a further step, the matrix R(n+1) can be expressed through (8), (14) and (15) as: (16) where the first term in (16) can be expressed with the of (8) as P T X T (n)X(n)P = P T R(n)P. (17) As a result, the updated auto-correlation matrix is expressed from the previous results using the permutation matrix and the identity vector.

Using Dyadic Product
The auto-correlation matrix can also be expressed also through dyadic product as [16], [17]: The updated version of the auto-correlation matrix can be derived in a similar manner: Due to the calculations in a sliding window, it can be noted that As a result, the updated auto-correlation matrix can be also expressed by subtracting the dyadic product of the oldest vector and adding the dyadic product of the newest vector to the previous matrix [16], [17]:

Calculation of p(n+1)
The cross-correlation vector p(n+1) can be similarly expressed using p(n) in the following 2 different ways.

Using Permutation Matrix
Similarly to the method presented in Sec. 3.2.1, d(n+1) can expressed using the permutation matrix and the identity vector as The updated cross-correlation vector can be derived based on (9), (15) and (22) as As a result the updated cross-correlation vector is expressed with the aid of the permutation matrix and the identity vector.

Using Dyadic Product
Following the derivations presented in Sec. 3.2.2 the cross-correlation vector can be also expressed by a dyadic product as The consecutive cross-correlation vector in the sliding window can be calculated using a dyadic product as Based on (20), (24) and (25), the updated cross-correlation vector can be expressed from the previous cross-correlation vector as

Calculation of R −1 (n+1) -the RSC4BI Algorithm
In this section an efficient algorithm for the calculation of the inverse of the updated auto-correlation matrix (R −1 (n+1)) is presented. The following supplementary matrix notations and their splitting into four submatrices are used for the auto-correlation matrices and their inverse: These notations can be seen in Fig. 1.
The derivation of inverse of such hypermatrices with 4 submatrices is presented in the Appendix. Based on (16) and (17) it can be shown that A 11 = B 22 . Furthermore, the submatrix B −1 22 can also be expressed using (A-3) and (A-7) as With the result of the previous equation, the reduced submatrix of B can be also expressed using (A-7) as As V = B −1 , the upper left corner element of V can also be easily expressed using (31), (32), and (A-3) as Similarly, the lower left submatrix of V can be also calculated The upper right submatrix of V can be calculated using (A-5) or based on (34) as As a final step, the submatrix V 22 can be expressed with the aid of (A-6) as Finally, the updated inverse auto-correlation matrix can be summed and expressed based on the matrices B,U and the calculated submatrices of V as The steps for calculating the inverse of the auto-correlation matrix -the RSC4BI algorithm -can be seen in Alg. 1.

Initialization
As it is presented in this section, R −1 (n+1) can be evaluated based on R −1 (n), therefore the initialization value of the recursion has to be defined prior to starting the algorithm. In this section two different methods are recommended to perform the initialization.
The first method is a simple solution through inversion of the auto-correlation matrix using the conventional procedures. However, this method requires all the samples to arrive in the observation window, thus the filter starts to follow the signal only after collecting the requested number of input samples.
The second method is the expansion of the inverse matrix according to the number of input samples. This solution is performed by applying the Levinson algorithm spreading the computational load across the incoming samples [12].

Zero Valued Input
During signal acquisition, it may occur that the analogto-digital converter records zero values if the signal level is too small. In such a case, X (n) contains zero value which may lead to the rows of X not being linearly independent anymore. Since R = X T X, the rank of the auto-correlation matrix becomes lower than the number of columns. As a result, R becomes singular, i.e. det (R) = 0, therefore the inversion of the matrix is not possible. To avoid this issue, the incoming signal should be perturbed by an additive white noise -e.g. dither -whose level has to be low enough not to notably alter the input signal, but still large enough to enable the inversion of the auto-correlation matrix. Without this perturbation, if the signal falls to zero then the auto-correlation matrix is also zero, so it is not invertible and the algorithm is divergent. After the signal is present again, the diverged algorithm can not find a proper solution, thus it can not be convergent furthermore.

Updating the Matrices X, R and p
Since the SWF requires initial values, it is assumed in the following subsections that X (n), R (n) and p (n) are already available at the beginning of the calculations.

Calculate X (n+1)
In order to update the matrix X, Equation (14) has to be evaluated. After detailed investigation of the equation it can be concluded that only memory operations are required to calculate the updated X(n+1) matrix.

Calculate R (n+1) by Permutation
To create the new auto-correlation matrix, R (n+1), Equation (16) has to be evaluated. The complexity of each term for addition is calculated separately.
The first term can be calculated according to (17) by simple memory operations: shifting columns and rows.
The first factor in the second term, P T X T (n) only contains memory operations. The remaining factor is a matrix generation, where only the first column differs from zero which is equal to x 1 (n+1) −x K (n). The multiplication of these factors is not a complete matrix multiplication; because of the zero columns, only the first column has to be evaluated. Considering the simple matrix identity for transposing the product of matrices, the third term is already given by transposing the second one.
The subtraction in the forth term is already given, so only the multiplication is taken into account. Due to the special form of matrices, the zero columns and rows, this operation can be reduced to a scalar multiplication of two vectors of length N. As a result, this term is a zero matrix except the element in the upper left corner.
After these calculations, the terms have to be added. As the last three terms contain only one row or column vector or a single element, and an auto-correlation matrix is symmetric, only a partial matrix addition is required. The summarized computational requirements of updating R by permutation are presented in Tab. 1.

Calculate p (n+1) by Permutation
To create the new cross-correlation matrix, Equation (23) has to be evaluated. After the memory operations of the permutation, a K×N matrix is multiplied by an N element long column vector.
The first factor of the second term, P T X T (n) is already discussed in Sec. 4.1.2. The remaining factor is a column vector generation, where only the last element differs from zero, which is d (n) −d (n+N). The multiplication by this vector takes the last column from the matrix to the result.
The first factor of the third term is a matrix generation, where only the first row (x 1 (n+1) −x K (n)) differs from zero, which is already calculated when updating R (Sec. 4.1.2). This matrix is multiplied by a permuted column vector, which operation can be reduced to a scalar multiplication due to the zero rows.
The subtractions in the forth term are already calculated, so only the multiplication is taken into account. The factors of the product are almost empty, only the first row of the matrix and last element of the column vector are not zero. Due to the structure of the factors, only a single multiplication is required.
Summarizing these computations, the terms have to be added considering that the last two terms contain a single element. The summarized computational requirements for updating p by permutation are presented in Tab. 2.

Calculate R (n+1) by Dyadic Product
According to (21), two matrices of size K×K are generated from x 1 and x N . As the multiplication of two vectors z T z results in a symmetric matrix, only the upper triangular parts have to be calculated, thus only K 2 +K /2 multiplications are required instead of K 2 . These considerations can be applied for the additions as well.

Operations
× + P T X T (n)X(n)P 0 0 In total: Tab. 1. Computational requirements to update R by permutation (16).

Permutation
Dyadic product Tab. 3. Number of additions to calculate p and R.

Permutation
Dyadic product Tab. 4. Number of multiplications to calculate p and R.

Calculate p (n+1) by Dyadic Product
According to (26), two column vectors of length K are generated from x 1 and x N . These vectors multiplied by d (n) and d (N) and finally, these three vectors are summed.

Comparison of Permutation-and Dyadic Product-Based Methods
The computational complexities of R (n+1) and p (n+1) are summarized and presented in Tab. 3 and Tab. 4. To choose the appropriate method, the computational requirements -in terms of additions and multiplications -have to be compared.
Considering the update procedure of the autocorrelation matrix, the following inequalities should be investigated based on the aforementioned tables: After rearranging (39), the following criterion can be formed: N ≤K. Also, Equation (38) can be expressed as The right side of (40) converges to 1 from below, therefore N has to be lower than K to fulfil the inequality. This is in correspondence with the result obtained from (39), thus the permutation method has lower computational requirements if N ≤K; otherwise the dyadic calculation is recommended.
Similar inequalities can be considered during the upgrade of the cross-correlation vector based on the aforementioned tables: If these equations are rearranged, the following conditions are generated: However Equation (43) could be fulfilled in some degenerate cases (e.g. N <3), but the right side of (44) converges to 1 which leads to the following non-satisfiable criterion: N <1. This requirement can never be satisfied, thus p (n+1) should be calculated by dyadic products in all cases.

Complexity of the RSC4BI Algorithm
Since the presented four blocks inversion can be applied for general cases as long as A 11 = B 22 is valid, therefore this case is investigated first. After that, the auto-correlation matrices are discussed which are symmetric, thus the computational requirements can be further reduced.

General Case
For the expression of the submatrices of V, two temporary matrices need to be calculated: B −1 11 according to (31) and b 11,r according to (32).
The second part of the subtraction in (31) is generated by multiplying two vectors. Before evaluating this multiplication, one of these vectors has to be modified by u −1 22 = 1/u 22 to minimize the number of operations performed.
According to the evaluation of B −1 22 , b 11,r is generated similarly. However, it is a single value, therefore only this scalar is generated from the components of B. The number of required operations are independent of the calculation order of the vector multiplications. If b 11,r is already given, then v 11 can be expressed through a division.

Operations
In total:  After having these temporary matrices, the remaining components of V are calculated. The vector component, v 21 is derived as Equation (34) shows: a column vector is generated by a multiplication, which result is normalized by b −1 11,r = v 11 . Now v 21 v 12 , therefore v T 12 has to be computed similarly to v 21 .
During the calculation of the last submatrix, V 22 , the order of the multiplications should be carefully selected. If the left to right order is applied, then the complexity increases up to K 3 because of the full matrix multiplication. Avoiding this issue -considering that the matrix multiplication is associative -, an outside to the inside approach should be applied as shown in (45) to decrease the complexity to K 2 (The proposed execution order of the operations are shown with circled numbers): Taking these considerations into account, the number of required operations can be decreased if the temporary results of v 12 and v 21 are applied.
In total, this RSC4BI-inversion of R (n+1) achieves an O K 2 complexity by utilizing the special structures of matrices (see in Fig. 1). The summarized computational requirements are presented in Tab. 5.

Reduced Case
Using the fact that R is an auto-correlation matrix, therefore R = R T . Considering these symmetry properties, it can be satisfactory in certain cases to evaluate only the upper triangular part instead of the whole matrix. Also, in (31), it can be assumed that u 12 = u 21 , thus calculating only the upper triangular part is satisfactory.
Due to the symmetry, However Equation (32) could be optimized because the scalar product of the vectors contains repetitive terms in the sum, but considering (37), the calculation of the temporary variable w = B −1 22 b 21 is useful, thus it is required for the calculations of v T 12 , v 21 and V 22 . Therefore the scalar value b T 12 B −1 22 b 21 is calculated in two steps, keeping the order of the operations. This result is subtracted from b 11 to obtain b 11,r = v −1 11 . Using these temporary results, the remaining parts of V are easily evaluated: v 12 (requiring (K − 1) multiplications) is equal to v 21 . Furthermore, the calculation of B −1 22 b 21 b −1 11,r b T 12 B −1 22 is reduced to the multiplication of the two already calculated vectors where only the upper triangular part has to be computed.
This reduced algorithm is presented in Alg. 2, while the computational requirements are summarized in Tab. 6. Comparing the results to Tab. 5, it can be stated that the complexity of the RSC4BI-algorithm can be reduced to less than half by taking into account the symmetry properties of the matrices.  • Gaussian white noise with N (0.1, 1),
In case of a noisy signal the previously described Gaussian white noise is added (N (0.1, 1)). The output signal of the observed filter is generated by convolution (1) to obtain a reference signal for the investigations.
To compare the RSC4BI to the conventional inversion methods, two further signals are also created by using different algorithms for matrix inversion. The first signal is generated by classic matrix inversion which is performed by Lower-Upper (LU) decomposition in M . This computation may suffer from round-off errors, therefore to avoid this defect and returning to that the initial problem is the solution of a system of linear equations -according to (2) -M recommends to use the so called "backslash division" (denoted by \) operator. This procedure generates the second signal solving the equations by a proper solver depending on the matrix properties. This algorithm is optimized to the computational time and it reduces the round-off errors as well.
The observed system is estimated and followed by a WF. Its taps are computed from the WH equation -according to (10) -which is evaluated by using the reduced RSC4BIalgorithm (Alg. 2), the classic matrix inversion, and the system of linear equations method. During the simulations, it is assumed that there is no a priori knowledge regarding the observed filter, it is considered as black box with a given output signal. Therefore the length of the filter (the observed one and the WF) may differ, so their taps can not be compared directly, only their output signals.

Runtime of the Procedures
The simulations are performed on a desktop computer containing an AMD Ryzen 7 2700X processor with 32 GB RAM and on a STM32F722 microcontroller. Each method is evaluated 10, 000 times as a single thread application and the runtime of every calculation is separately measured. The observation window includes N=20 samples, while the filter length is changed between the values K=(10, 20, 30), because the calculation complexity of R (n+1) is dependent only on K. The averages of the measured runtimes are presented in Fig. 2. The bar chart is completed by additional crosses to illustrate the medians and the 5 − 95% percentiles as well. The system of linear equations is a specialized algorithm in M therefore only two methods were investigated on the microcontroller (see Fig. 2(b)), while Fig. 2(a) contains the results of three algorithms. As Section 4 states, the RSC4BI algorithm has lower computational requirements, if the number of the filter taps increase, then the runtime increases slower than in the case of classic matrix inversion or solving the system of linear equations. However, it should be noted that the optimized built-in solution of M , using system of linear equations is the fastest if K is low, but it grows faster with increasing values of K than the proposed RSC4BI method.

Root Mean Square Deviation of SWF
The first simulation is made by Monte Carlo method to investigate the performance of the algorithm by applying different probe signals. The SWF contains K=20 taps, while the observation window encloses N=20 samples. As a result, the Root Mean Square Deviation (RMSD) of the reference and the modeled output signals is calculated. 100, 000 simulations are performed whose RMSD is evaluated as follows: The delay elements and the observation window are empty at the beginning therefore there is a decaying transient, so the first 15 ms of the signal is cut off to obtain the average over the time (displayed by dashed line).  These results are presented in Fig. 3, where the RMSD and its average are displayed in dB in case of the different probe signals. The deviation is the lowest if the input signal is only noise (Fig. 3(c)). However, it has to be noted that the signal level of noise is lower than the amplitude of the sine waves, which leads to lower RMSD if only noise is present.

Dependencies on N and K
As it is presented in Sec. 3, the SWF algorithm depends not only on K but on N as well, so the effect of these dependencies are investigated in Fig. 4 for the following cases: N <K, N=K and N >K. These simulations are similar to the former ones (Sec. 5.3), but the input signal is only the noise (N (0.1, 1)) and K is running from 5 up to 50 with a step size of 5. It can be seen that K should not be greater than N, because it leads to high RMSD. Using the same settings for K and N decreases the RMSD and gives a better result, but the K<N case provides the best precision.
To understand these deviances in RMSD values, the initial problem has to be taken into account: the y = Xh system of linear equations has to be solved (2). If K<N, then the system is overdetermined, therefore a best solution can be found in MMSE sense. The existence of the solution is not necessary to be investigated in this case, because the system (h (n)) exists. In the case K=N, the exact solution of the system is created, whose RMSD is limited by the numeric precision of the applied procedures. If K>N, then there is an underdetermined system, where the solution is not a unique solution, but a whole subspace. A global optimal solution -in MMSE sense -may be found in this subspace, but in certain cases only a local optimum is found, especially if the numeric precision is worse, than in case of matrix inversion, when the conditional number gets higher.

RMSD-Comparison of the Different Inversion Methods
The output of the RSC4BI algorithm is compared not only to the convolved reference signal, but to the two conventional methods as well. These results are presented in Fig. 5 for K= (20, 30) cases while N=20. If N=K is considered, the precision of the calculations depends on the algorithm, and as Fig. 5(a) shows, the RMSD of the system of linear equations is lower compared to the inversion and RSC4BI. Because of the underdeterminated system, if K=30 ( Fig. 5(b)), the inversion can not find the proper solution (Sec. 5.4), while the system of linear equations gives unchanged precision and the RMSD of RSC4BI is higher, but still at an acceptable level.

Conclusion
In this paper an efficient step-by-step algorithm, the SWF is presented for the solution of the WH equation in a sliding window. Not only an algorithm for calculating the inverse of the auto-correlation matrix is given -the RSC4BI algorithm -but a recursive update procedure for the matrices R and p is presented as well. Additionally, two solutions for the initialization of the recursive algorithm were also suggested.
The inverse calculation is deduced to matrix multiplications and simple inversion of singular elements. This procedure is optimized for low complexity execution which makes the RSC4BI algorithm preferable for real-time applications and cost-effective processors. As the theoretical and simulated results show, the complexity of the algorithm is O K 2 because the inversion of the auto-correlation matrix is deduced to the inversion of a single element, meanwhile the matrix inversion generally requires O K 3 operations.
It was also shown, that the reduced complexity degrades the calculation precision, but it does not influence significantly the stability and the convergence. Therefore, the proposed algorithms can be considered as a practical solution suitable for low complexity hardware for real-time signal processing or to reduce the response time by distributing the computational load over time.