Normalized stochastic gradient descent learning of general complex-valued models

Thestochasticgradientdescent(SGD)methodisoneofthemostpromi-nentﬁrst-orderiterativeoptimisationalgorithms,enablinglinearadap-tiveﬁltersaswellasgeneralnonlinearlearningschemes.Itisapplicabletoawiderangeofobjectivefunctions,whilefeaturinglowcomputa-tionalcostsforonlineoperation.However,withoutasuitablestep-sizenormalisation,theconvergenceandtrackingbehaviourofthestochastic gradientdescentmethodmightbedegradedinpracticalapplications.Inthisletter,anovelgeneralnormalisationapproachisprovidedforthe learningof(non-)holomorphicmodelswithmultipleindependentpa-rametersets.Theadvantagesoftheproposedmethodaredemonstrated bymeansofaspeciﬁcwidely-linearestimationexample.

✉ Email: thomas.paireder@jku.at The stochastic gradient descent (SGD) method is one of the most prominent first-order iterative optimisation algorithms, enabling linear adaptive filters as well as general nonlinear learning schemes. It is applicable to a wide range of objective functions, while featuring low computational costs for online operation. However, without a suitable step-size normalisation, the convergence and tracking behaviour of the stochastic gradient descent method might be degraded in practical applications. In this letter, a novel general normalisation approach is provided for the learning of (non-)holomorphic models with multiple independent parameter sets. The advantages of the proposed method are demonstrated by means of a specific widely-linear estimation example.
Introduction: System identification and adaptive noise cancellation are two important tasks in signal processing that require the solution of possibly nonlinear and complex-valued optimisation problems. Many applications put stringent bounds on the computational complexity that is reserved for optimising the model parameters. Especially in dedicated hardware implementations for online estimation, sophisticated algorithms are often not applicable.
The versatile stochastic gradient descent (SGD) approach [11] is a well-known solution in many of those cases. It is related to the gradient descent optimisation of mean square error (MSE) objective function. Though, SGD is only based on instantaneous training examples. Both require differentiability of the objective function with respect to the parameters. The gradient obtained from the instantaneous squared error used by SGD is an approximation of the gradient of the MSE objective function. Under certain conditions, the SGD solution converges in the mean to the optimum in the MSE sense. The SGD approach, for instance, leads to the well-known least-mean squares (LMS) algorithm. The learning rate or step-size of the SGD method is crucial for its convergence rate and accuracy. Tuning the step-size is difficult in practical scenarios that require high robustness, since the length of the update steps depends on the dynamics of the involved signals. This issue can be circumvented by the so-called normalisation, where the step-size is chosen with the goal that the instantaneous error decreases in each step [3,4].
However, despite the wide coverage of the SGD approach in literature, to our knowledge only little information on the learning of nonholomorphic models with multiple independent parameter sets is available. Hence, in the following, we cover three important combinations of real-or complex-valued parameters in (non-)holomorphic models and provide a novel general step-size normalisation. The robustness of the proposed method is demonstrated on a specific widely-linear estimation example with two cascaded adaptive sections. This problem arises from the field of self-interference cancellation in radio frequency (RF) transceivers.
Differential calculus of (non-)holomorphic functions: The complexvalued differential calculus is essential for the derivations in this letter. Hence, we start with a brief recapitulation. The definition of the real derivative ∂ f (z)/∂z can be directly extended to the complex-valued case [9] provided that the function f satisfies the Cauchy-Riemann equations with f (z) = f r (z) + j f i (z), z = z r + jz i and j 2 = −1. Then f is called holomorphic, which is sometimes used interchangeably with the term analytic. If (1) is not fulfilled, the standard complex-valued derivative is not defined. A popular, compatible concept is the Wirtinger (or CR) calculus [9], defined by where z * denotes the complex conjugation. The Wirtinger calculus allows to rewrite the Cauchy-Riemann equations (1) as ∂ f (z)/∂z * = 0. It is important to note that unlike the standard complex derivative, the terms ∂ f (z)/∂z and ∂ f (z)/∂z * in (2) and (3) just represent a formal notation. In general, both terms shall not be understood as rules to calculate the derivatives. However, many functions of the form f (z, z * ) fulfill the analyticity condition by Brandwood [1], i. e. are holomorphic in z for (formally) constant z * and vice versa. Then the partial derivatives with respect to z and z * can be evaluated directly, where the variable z and its conjugate are formally treated to be independent. For the rest of the letter, unless otherwise noted, all derivatives shall be general Wirtinger derivatives. The Wirtinger calculus allows to define a Taylor series for nonholomorphic functions f (z), where we switch to a vector parameter z for generality. It is motivated by deriving the Taylor series of the correspond- To shorten the notation, we drop the function arguments when they are evident from the context. Next, we focus on the linear approximation and expand the vector product by means of the identity a + a * = 2 {a} to get a close relationship to Wirtinger derivatives: For completeness we note that a similar rewrite is also possible for the quadratic term in (4), using the general Hessian matrix In practical examples, f might be a composition of several functions. In this case, calculating the Wirtinger derivatives requires the chain rule.
The corresponding Wirtinger derivative is obtained by recombining the real and imaginary parts to a complex-valued vector and applying a suitable expansion where I is the identity matrix. The complementary chain rule for ∂ f /∂z is obtained by replacing z * with z in (8). Step General SGD update and step-size normalisation: The SGD approach used in adaptive learning applications typically utilizes an approximated gradient based on the instantaneous system input and response. The gradient can be deduced from the objective function J , .). In Table 1 the gradients for three common cases are compiled, where we use the notation The corresponding learning rule is with the time-varying step-size μ[n] > 0. In this letter, we focus on integer-order gradients due to their predominant use in adaptive learning. Similar to this classical approach, g p [n] could be defined via a fractional derivative of the objective function, too. For a comprehensive introduction to the underlying concept of differintegration and its application in fractional SGD methods, we refer to [6,8].
For optimum adaptation speed, μ[n] in (10) is chosen in accordance with the dynamics of the input signals. The common idea is to enforce a condition on the instantaneous error: |e[n+1]| < |e[n]| [4]. However, the a posteriori error e[n+1] is not known before the update, but it can be approximated by means of a linear Taylor expansion in p[n], similar to (5) Joint normalisation of cascaded models: The expansion (11) or its approximation is adequate if a single set of related parameters, such as the coefficients of finite impulse response (FIR) filter, has to be optimised. Though, some estimation problems are best tackled by a cascade of multiple adaptive stages of different types. A prominent example are spline adaptive filters [2], where, for instance, FIR filter is embedded between two adaptive nonlinear spline sections. Disaggregated structures also occur in widely-linear estimation [7]. In these cases, it is beneficial if the learning rate of each stage can be selected independently.
The linearity of this expression allows to calculate the step-size normalisation for p and q individually, allowing to reuse the results in Table 1. The total normalisation is then a weighted combination of b μ,p [n] and b μ,q [n]: , 0 < μ < 1.
This result is easily extended to a function of L parameter vectors where the step-size for the adaption of p l is μ l [n] = α l μ[n].
Widely-linear estimation example: We conclude our investigations with an interference cancellation example from cellular communications [5], where the inference function f is non-holomorphic.
We omitted the factor 2 in the normalization, since ∇ w * f wl is small for small v[n] and ∂ f wl /∂v * = 0. Following the proposed joint normalisation (17), the combined step-size is The update (22), the normalization (24) and the step-size (26) can be simplified for small v[n], which allows to drop the corresponding terms. The advantage of the joint normalisation is obvious when comparing the adaptation performance of the standard and the proposed algorithm by means of the normalized mean square error (NMSE): The expectations are approximated by ensemble averages over multiple simulation runs. The sequence x[n] shall be coloured noise where the correlation factor a = 0.8 and ν[n] is WGN. The simulation is performed on 6 different complex-valued impulse responses h main , which are based on measurements of the relevant circuit components in RF transceiver. The filter length M is 16. h img is set to 0.28 + j 0.15 and the signal-to-noise ratio (SNR) of y intf with respect to η[n] is 15 dB. The stability issues of the standard normalisation become apparent when tuning the algorithm for maximum adaptation speed, for instance by choosing μ w = 0.5, μ v = 0.35 and ξ = 10 −6 . The parameters in case of the proposed joint normalisation, μ = 0.5 and α = 0.7, are selected in a similar range. Note that μ α = 0.35 matches μ v as chosen in the standard case. Figure 1 shows the resulting evolution of the estimation error for both approaches. The variant using the separate normalisation diverges for this choice of parameters, while the proposed method adapts in about 80 samples. It turns out to be substantially more robust for high step-sizes. Further simulations demonstrated that this advantage is maintained even if μ[n] is chosen close to the limit of 1. This is a clear advantage in applications, where a reconfiguration of the parameters for individual scenarios is not possible. Then a static setting has to be found, which performs well in all cases.
Conclusion: Starting at the basics of complex-valued derivatives, we applied the SGD approach to general complex-valued learning schemes. Based on this analysis, we compiled ready-to-use formulas for the gradients and a novel general step-size normalisation scheme covering all important combinations of real-or complex-valued parameters and models. Additionally, we proposed an extended normalisation scheme for cascaded adaptive structures and validated its robustness in a specific widely-linear estimation problem.