Complex-Valued Adaptive Signal Processing Using Nonlinear Functions

We describe a framework based on Wirtinger calculus for adaptive signal processing that enables e ﬃ cient derivation of algorithms by directly working in the complex domain and taking full advantage of the power of complex-domain nonlinear processing. We establish the basic relationships for optimization in the complex domain and the real-domain equivalences for ﬁrst-and second-order derivatives by extending the work of Brandwood and van den Bos. Examples in the derivation of ﬁrst-and second-order update rules are given to demonstrate the versatility of the approach.


INTRODUCTION
Most of today's challenging signal processing applications require techniques that are nonlinear, adaptive, and with online processing capability. Also, there is need for approaches to process complex-valued data as such data arises in a good number of scenarios, for example, when processing radar and magnetic resonance data as well as communications data and when working in a transform domain such as frequency. Even though complex signals play such an important role, many engineering shortcuts have typically been taken in their treatment preventing full utilization of the power of complex domain processing as well as the information in the real and imaginary parts of the signal.
The main difficulty arises due to the fact that in the complex domain, analyticity, that is, differentiability in a given open set, as described by the Cauchy-Riemann equations [1] imposes a strong structure on the function itself. Thus the analyticity condition is not satisfied for many functions of practical interest, most notably for the cost (objective) functions used as these are typically real valued and hence nonanalytic in the complex domain. Definition of pseudogradients are used-and still not through a consistent definition in the literature-and when having to deal with vector gradients, transformations C N → R 2N are commonly used. These transformations are isomorphic and allow the use of real-valued calculus in the computations, which includes well-defined gradient and Hessians that can be at the end transformed back to the complex domain. The approach facilitates the computations but increases the dimensionality of the problem and might not be practical for functions that are nonlinear since in this case, the functional form might not be easily separable to real and imaginary parts.
Another issue that arises in the nonlinear processing of complex-valued data is due to the conflict between the boundedness and differentiability of complex functions. This result is stated by Liouville's theorem as: a bounded entire function must be a constant in the complex domain [1]. Hence, to use a flexible nonlinear model such as the nonlinear regression model, one cannot identify a complex nonlinear function (C → C) that is bounded everywhere on the entire complex domain. A practical solution to satisfy the boundedness requirement has been to process the real and imaginary parts (or the magnitude and phase) separately through bounded real-valued nonlinearities (see, e.g., [2][3][4][5][6]). The solution provides reasonable approximation ability but is an ad hoc solution not fully exploiting the efficiency of complex representations, both in terms of parameterization (number of parameters to estimate) and in terms of learning algorithms to estimate the parameters as we cannot define true gradients when working with these functions.
In this paper, we define a framework that allows taking full advantage of the power of complex-valued processing, in particular when working with nonlinear functions, and eliminates the need for either of the two common engineering practices we mentioned. The framework we develop 2 EURASIP Journal on Advances in Signal Processing is based on Wirtinger calculus [7] and extends the work of Brandwood [8] and van den Bos [9] to define the basic formulations for derivation of algorithms and their analyses in the complex domain. We show how the framework also naturally admits the use of nonlinear functions that are analytic rather than the pseudocomplex nonlinear functions defined using real-valued nonlinearities. Analytic complex nonlinear functions have been shown to provide efficient representations in the complex plane [10,11] and to be universal approximators when used as activation functions in a singlelayer multilayer perceptron (MLP) network [12].
The work by Brandwood [8] and van den Bos [9] emphasize the importance of working with complex-valued gradient and Hessian operators rather than transforming the problem to the real domain. Both contributions, though not acknowledged in either of the papers, make use of Wirtinger calculus [7] that provides an elegant way to bypass the limitation imposed by the strict definition of differentiability in the complex domain. Wirtinger calculus relaxes the traditional definition of differentiability in the complex domain-which we refer to as complex differentiability-by defining a form that is much easier to satisfy and includes almost all functions of practical interest, including functions that are C N → R. The attractiveness of the formulation stems from the fact that though the derivatives defined within the framework do not satisfy the Cauchy-Riemann conditions, they obey all the rules of calculus, including the chain rule, differentiation of products and quotients. Thus all computations in the derivation of an algorithm can be carried out as in the real case. We provide the connections between the gradient and Hessian formulations given in [9] described in C 2N and R 2N to the complex C N -dimensional space, and establish the basic relationships for optimization in the complex domain including first-and second-order Taylor-series expansions.
Three specific examples are given to demonstrate the application of the framework to complex-valued adaptive signal processing, and to show how they enable the use of the true processing power of the complex domain. The examples include a multilayer perceptron filter design and the derivation of the gradient update (backpropagation) rule, independent component analysis using maximum likelihood, and the derivation of an efficient second-order learning rule, the conjugate gradient algorithm for the complex domain.
Next section introduces the main tool, Wirtinger calculus for optimization in the complex domain and the key results given in [8,9], which we use to establish the main theory presented in Section 3. In Section 3, we consider both vector and matrix optimization and establish the equivalences for first-and second-order derivatives for the real and complex case, and provide the fundamental results for C N and C N×M . Section 4 presents the application examples and Section 5 gives a short discussion.

COMPUTATION OF GRADIENTS IN THE COMPLEX DOMAIN USING WIRTINGER CALCULUS
The fundamental result for the differentiability of a complexvalued function where z = x + j y, is given by the Cauchy-Riemann equations [1]: which summarize the conditions for the derivative to assume the same value regardless of the direction of approach when Δz → 0.These conditions, when considered carefully, make it clear that the definition of complex differentiability is quite stringent and imposes a strong structure on u(x, y) and v(x, y), the real and imaginary parts of the function, and consequently on f (z). Also, obviously most cost (objective) functions do not satisfy the Cauchy-Riemann equations as these functions are typically f : C → R and thus have v(x, y) = 0. An elegant approach due to Wirtinger [7] relaxes this strong requirement for differentiability, and defines a less stringent form for the complex domain. More importantly, it describes how this new definition can be used for defining complex differential operators that allow computation of derivatives in a very straightforward manner in the complex domain, by simply using real differentiation results and procedures.
In the development, the commonly used definition of differentiability that leads to the Cauchy-Riemann equations is identified as complex differentiability and functions that satisfy the condition on a specified open set as complex analytic (or complex holomorphic). The more flexible form of differentiability is identified as real differentiability, and a function is called real differentiable when u(x, y) and v(x, y) are differentiable as functions of real-valued variables x and y. Then, one can write the two real-variables as x = (z+z * )/2 and y = −j(z − z * )/2, and use the chain rule to derive the operators for differentiation given in the theorem below.
The key point in the derivation is regarding the two variables z and z * as independent from each other, which is also the main trick that allows us to make use of the elegance of Wirtinger calculus. Hence, we consider a given function f : C → C as f : R × R → C by writing it as f (z) = f (x, y), and make use of the underlying R 2 structure. The main result in this context is stated by Brandwood as follows [8].

function of real variables
x and y such that g(z, z * ) = f (x, y), where z = x + j y and that g is analytic with respect to z * and z independently. Then, can be computed by treating z * as a constant in g and z as a constant, respectively; (ii) a necessary and sufficient condition for f to have a stationary point is that ∂g/∂z = 0. Similarly, ∂g/∂z * = 0 is also a necessary and sufficient condition.
Therefore, when evaluating the gradient, we can directly compute the derivatives with respect to the complex argument, rather than calculating individual real-valued gradients as typically performed in the literature (see, e.g., [2,6,12,13]). The requirement for the analyticity of g(z, z * ) with respect to z and z * is independently equivalent to the condition on real differentiability of f (x, y) since we can move from one form of the function to the other using the simple linear transformation given above [1,14]. When f (z) is complex analytic, that is, when the Cauchy-Riemann conditions hold, g(·) becomes a function of only z, and the two derivatives, the one given in the theorem and the traditional one coincide.
The case we are typically interested in the development of signal processing algorithms is given by f : R × R → R and is a special case of the result stated in the theorem. Hence we can employ the same procedure-taking derivatives independently with respect to z and z * , in the optimization of a real-valued function as well. In the rest of the paper, we consider such functions as these are the costs used in machine learning, though we identify the deviation, if any, from the general f : R × R → C case for completeness.
As a simple example, consider the function g(z, z * ) = zz * = |z| 2 which we can also evaluate as ∂g/∂z * = z, that is, by treating z as a constant in g when calculating the partial derivative.
The complex gradient defined by Brandwood [8] has been extended by van den Bos to define a complex gradient and Hessian in C 2N by defining a mapping Note that the mapping allows a direct extension of Wirtinger's result to the multidimensional space through N mappings of the form (z R,k , z I,k ) → (z k , z * k ), where z = z R + jz I , so that one can make use of Wirtinger derivatives. Since the transformation from R 2 to C 2 is a simple linear invertible mapping, one can work in either space, depending on the convenience offered by each. In [9], it is shown that such a transformation allows the definition of a Hessian, hence of a Taylor series expansion very similar to the one in the real case, and the Hessian matrix H defined in this manner is naturally linked to the complex C N×N Hessian G in that if λ is an eigenvalue of G, then 2λ is the corresponding eigenvalue of H. The result implies that the positivity of the eigenvalues as well as the conditioning of the Hessian matrices are shared properties of the two matrices, that is, of the two representations. For example, in [15], this property has been utilized to derive the local stability conditions of the complex-valued maximization of negentropy algorithm for performing independent component analysis. In the next section, we establish the connections of the results of [9] to C N for first-and second-order derivatives such that efficient second-order optimization algorithms can be derived by di-rectly working in the original C N space where the problems are typically defined.

Vector case
We define ·, · as the scalar inner product between two matrices W and V as We define the gradient vector in order to write the first-order Taylor series expansion for a function g(z, z * ) : where the last equality follows because g(·, ·) is real valued.
Using the Cauchy-Schwarz-Bunyakovski inequality [16], it is straightforward to show that the first-order change in g(·, ·) will be maximized when Δz and the gradient ∇ z * g are collinear. Hence, it is the gradient with respect to the conjugate of the variable, ∇ z * g, that defines the direction of the maximum rate of change in g(·, ·) with respect to z, not ∇ z g as sometimes noted in the literature. Thus the gradient optimization of g(·, ·) should use the update as this form leads to a nonpositive increment given by Δg = −2μ ∇ z * g 2 , while the update using Δz = −μ∇ z g results in updates Δg = −2μRe{ ∇ z * g, ∇ z g }, which are not guaranteed to be nonpositive. Based on (6), similar to a scalar function of two real vectors, the second-order Taylor series expansion of g(z, z * ) can be written as [17] Next, we derive the same complex gradient update rule using another approach, which provides the connection between the real and complex domains. We first introduce the following fundamental mappings that are similar in nature to those introduced in [9].
where U is defined by z where U 2N×2N = diag{J, J, . . . , J} that satisfies (U ) −1 = (1/ 2)(U ) H [9]. Next, we can find a permutation matrix P such that where U Δ = PU that satisfies U −1 = (1/2)U H since P −1 = P T . Using the Wirtinger derivatives in (3), we obtain ∂g ∂z which establishes the first-order connection between the complex gradient and the real gradient. By applying the two derivatives (3) recursively to obtain the second-order derivative of g, we obtain Equality 1 is already proved in [18]. Equality 2 is obtained by simply rearranging the entries in ∂ 2 g/∂ z * ∂ z T to form ∂ 2 g/∂ z * ∂ z T . Therefore, the second-order Taylor expansion given in (8) can be rewritten as which demonstrates that the C 2N×2N Hessian in (15) can be decomposed into three C N×N Hessians in (8).
The mappings given in Proposition 1 are similar to those defined in [9]. However, the mappings given in [9] include redundancy since they operate in C 2N and the dimension cannot be further reduced. This is not convenient since cost function g(z) is normally defined in C N and the C 2N mapping as described by z cannot be always easily applied to define g( z), as observed in [18].
In the following two propositions, we show how to use the same mappings we defined above to obtain first-and second-order derivatives, and hence algorithms, in C N in an efficient manner.

Proposition 2.
Given functions g and f defined as in Proposition 1, one has the complex gradient update rule which is equivalent to the real gradient update rule where z and w are as defined in Proposition 1 as well.
Proof. Assuming f is known, the gradient update rule in the real domain is Mapping back into complex domain, we obtain The dimension of the update rule can be further decreased as Proposition 3. Given functions g and f defined as in Proposition 1, one has the complex Newton update rule which is equivalent to the real Newton update rule where Proof. The pure Newton method in the real domain takes the form given in (22). Using the equalities given in Proposition 1, it can be easily shown that the Newton update in (22) is equivalent to Using the definitions for H 1 and H 2 given in (23), we can rewrite (24) as If where and H − * 2 denotes (H * 2 ) −1 . Since ∂ 2 g/∂z * ∂z T is Hermitian, we finally obtain the complex Newton rule as The expression for Δz * is the conjugate of (28).

Matrix case
The extension from the vector gradient to matrix gradient is straightforward. For a real-differentiable g(W, W * ) : C N×N × C N×N → R, we can write the first-order expansion as where ∂g/∂W is an N × N matrix whose (i, j)th entry is the partial derivative of g with respect to w i j . By arranging the matrix gradient into a vector and by using the Cauchy-Schwarz-Bunyakovski inequality [16], it is easy to show that the matrix gradient ∂g/∂W * defines the direction of the maximum rate of change in g with respect to W. For local stability analysis, Taylor expansions up to the second order is also frequently needed. Since the first-order matrix gradient takes a matrix form already, here we only provide the second-order expansion with respect to every entry of matrix W. From (8), we obtain We can use the first-order Taylor series expansion to derive the relative gradient [19] update rule for the complex case, which is usually directly extended to the complex case without a derivation [5,13,20]. To write the relative gradient rule, we consider an update of the parameter matrix W in the invariant form (ΔW)W [19]. We then write the firstorder Taylor series expansion for the perturbation (ΔW)W as to determine the quantity that maximizes the rate of change in the function. The complex relative gradient of g at W is then written as (∂g/∂W * )W H to write the relative gradient update term as Upon substitution of ΔW into (29), we observe that Δg = −2μ (∂g/∂W * )W H 2 Fro is a nonpositive quantity, thus a proper update term. The relative gradient can be regarded as a special case of natural gradient [21] in the matrix space, but provides the additional advantage that it can be easily extended to nonsquare matrices. In Section 4.2, we show how the relative gradient update rule for independent component analysis based on maximum likelihood can be derived in a very straightforward manner in the complex domain using (32) and Wirtinger calculus.

APPLICATION EXAMPLES
We demonstrate the application of the optimization framework introduced in Section 3 by three examples. The first two examples demonstrate the derivation of the update rules for complex-valued nonlinear signal processing. In the third example, we show how the relationship for Newton updates given by Proposition 3 can be utilized to derive efficient update rules such as the conjugate gradient algorithm for the complex domain.

Fully complex MLP for nonlinear adaptive filtering
The multilayer perceptron filter-or network-provides a good example case for the difficulties that arise in complexvalued processing as discussed in the introduction. These are due to the selection of activation functions for use in the filter structure and the optimization procedure for deriving the weight update rule.
The first issue is due to the conflict between the boundedness and differentiability of functions in the complex domain. This result is stated by Liouville's theorem as: a bounded entire function must be a constant in the complex domain [1], where entire refers to differentiability everywhere. For example the sigmoid nonlinearity, which has been the most typically used activation function for real-valued MLPs, 6 EURASIP Journal on Advances in Signal Processing has periodic singular points. Since boundedness is deemed as important for the stability of algorithms, a practical solution when designing MLPs for the complex domain has been to define nonlinear functions that process the real and imaginary parts separately through bounded real-valued nonlinearities as in [2] f for acomplex variable z = x + j y using functions f : R → R. Another approach has been to define joint-nonlinear complex activation functions as in [3,4], respectively, As shown in [10], these functions cannot utilize the phase information effectively, and in applications that introduce significant phase distortion such as equalization of saturatingtype channels, are not effective as complex domain nonlinear filters. The second issue that arises when designing MLPs in the complex domain has to do with the optimization of the chosen cost function to derive the parameter update rule. As an example, consider the most commonly used MLP structure with a single hidden layer as shown in Figure 1. If the cost function is chosen as the squared error at the output, we have where y k = h( n w kn x n ) and x n = g( m v nm z m ). Note that if both activation functions h(·) and g(·) satisfy the property [ f (z)] * = f (z * ), then the cost function assumes the form J(V, W) = G(z)G(z * ) making it clear how practical the derivation of the update rule will be using Wirtinger calculus, since then we treat the two variables z and z * as independent in the computation of the derivatives. On the other hand, when any of the activation functions given in (33) and (34) are used, it is clear that the evaluation of the gradients will have to be performed through separate real and imaginary part evaluations as traditionally done, which can easily get quite cumbersome [2,10].
Any function f (z) that is analytic for |z| < R with a Taylor series expansion with all real coefficients in |z| < R satisfies the property [ f (z)] * = f (z * ). Examples of such functions include polynomials and most trigonometric functions and their hyperbolic counterparts. In particular, all the elementary transcendental functions proposed in [12] satisfy the property and can be used as effective activation functions. These functions, though unbounded, provide significant performance advantages in challenging signal processing problems such as equalization of highly nonlinear channels [10] in terms of superior convergence characteristics and better generalization abilities through the efficient representation of the underlying problem structure. The nonsingularities do not pose any practical problems in the implementation, except that some care is required in the selection of their parameters when training these networks. Motivated by these examples, a fundamental result for complex nonlinear approximation is given in [12], where the result on the approximation ability of the multilayer perceptron is extended to the complex domain by classifying nonlinear functions based on their singularities. To establish the universal approximation property in the complex domain, a number of elementary transcendental functions are first classified according to the nature of their nonsingularity as those with removable, isolated, and essential singularities. Based on this classification, three types of approximation theorems are given. The approximation theorems for the first two classes of functions are very general and resemble the universal approximation theorem for the real-valued feedforward multilayer perceptron that was shown almost concurrently by multiple authors in 1989 [22][23][24]. The third approximation theorem for the complex multilayer perceptron is unique and related to the power series approximation that can represent any complex number arbitrarily closely in the deleted neighborhood of a singularity. This approximation is uniform only in the analytic domain of convergence whose radius is defined by the closest singularity.
For the MLP filter shown in Figure 1, where y k is the output and z m the input, when the activations functions g(·) and h(·) are chosen as functions that are C → C as in [11,12], we can directly write the backpropagation update equations using Wirtinger derivatives.
For the output units, we have ∂y k /∂w * kn = 0, therefore We define δ k = −(d k − y k )h ( n w * kn x * n ) so that we can write ∂J/∂w * kn = δ k x * n . For the hidden layer or input layer, first we observe the fact that v nm is connected to x n for all m. Again, we have ∂y k /∂v * nm = 0, ∂x n /∂v * nm = 0. Using the chain rule once again, we obtain Thus, (36) and (37) define the gradient updates for computing the hidden and the output layer coefficients, w kn and v nm , through backpropagation. Note that the derivations in this case are very similar to the real-valued case as opposed to what is shown in [2,10] where separate evaluations with respect to the real and imaginary parts are carried out.

Complex maximum likelihood approach to independent component analysis
Independent component analysis (ICA) for separating complex-valued signals is needed in a number of applications such as medical image analysis, radar, and communications. In ICA, the observed data are typically expressed as a linear combination of independent latent variables such that x = As where s = [s 1 , s 2 , . . . , s N ] T is the vector of sources, x = [x 1 , x 2 , . . . , x N ] T is the vector of observed random variables, and A is the mixing matrix. We consider the simple case where the number of independent variables is the same as the number of observed mixtures. The main task of the ICA problem is to estimate a separating matrix W that yields the independent components through s = Wx. Nonlinear ICA approaches such as the maximum likelihood provide practical and efficient solutions to the problem. When deriving the update rule in the complex domain, however, the optimization is not straightforward and can easily become cumbersome [13,25]. To alleviate the problem, the relative gradient framework of [19] has been used along with isomorphic transformations C N → R 2N to derive the update equations in [25]. As we show next, Wirtinger calculus allows a much more straightforward derivation procedure, and in addition, provides a convenient formulation for working with probabilistic descriptions such as the probability density function (pdf) in the complex domain. We define the pdf of a complex random variable X = X R + jX I as p X (x) ≡ p XRXI (x R , x I ) and the expectation of g(X) is given by E{g(X)} = g(x R + jx I )p X (x)dx R dx I for any measurable function g : C → C. The traditional ICA problem determines a weight matrix W such that y = Wx approximates the source s subject to the permutation and scaling ambiguity. To write the density transformation, we consider the mapping C→ R 2N such that y = Wx = s, where Given T independent samples x(t), we write the loglikelihood function as [26] l (y, W) = log det (W) where p k is the density function for kth source. Maximization of l is equivalent to minimization of l where l = −l . Simple algebraic and differential calculus yields where ψ(y) is a 2N × 1 column vector with components We write log p s (y R , y I ) = log p s (y, y * ) and using Wirtinger calculus, it is straightforward to show ψ T (y)d y = ψ T y, y * dy + ψ H y, y * dy * , where ψ(y, y * ) is an N ×1 column vector with complex components Defining a 2N × 2N matrix P = (1/2) I jI jI I , we obtain Therefore, we can write (39) as dl = −tr dWW −1 − tr dW * W − * + ψ T y, y * dy + ψ H y, y * dy * .
Using y = Wx and defining dZ = (dW)W −1 , we obtain dy = (dW)x = dW W −1 y = dZy, By treating W as a constant matrix, the differential matrix dZ has components dz i j that are linear combinations of dw i j 8 EURASIP Journal on Advances in Signal Processing and is a nonintegrable differential form. However, this transformation greatly simplifies the expression for the Taylor series expansion without changing the function value. It also provides an elegant approach for the derivation of the natural gradient update for maximum likelihood ICA [26]. Using this transformation, we can write (44) as dl = −tr(dZ) − tr dZ * + ψ T y, y * dZy Therefore, the gradient update rule for Z is given by which is equivalent to by using dZ = (dW)W −1 .
Thus the complex score function is defined as ψ * (y, y * ), as in [27], which takes a form very similar to the real case [26], but with the difference that in the complex case the entries in the score function are defined using Wirtinger derivatives.

Complex conjugate gradient (CG) algorithm
The equivalence condition given by Proposition 3 allows for easy derivation of second-order efficient update schemes as we demonstrate next. As shown in Proposition 3, for a real differentiable function g(z, z * ) : C N × C N → R and f : R 2N → R such that g(z, z * ) = f (w), the update for the Newton method in R 2N is given by and is equivalent to in C N . To achieve convergence, we require that the search direction Δw is a descent direction when minimizing a cost function, which is the case if the Hessian ∂ 2 f /∂w∂w T is positive definite. However, if the Hessian is not positive definite, Δw may be an ascent direction. The line search Newton-CG method is one of the strategies for ensuring that the update is of good quality. In this strategy, we solve (49) using the CG method, terminating the updates if Δw T (∂ 2 f /∂w∂w T )Δw ≤ 0. When we do not have the definition of function f but only have the knowledge of g, we can obtain the complex conjugate gradient method with straightforward algebraic manipulations of the real CG algorithm (e.g., given in [28]) by using the three equalities given in (12), (13), and (14). We let s = ∂g/∂z * to write the complex CG method as shown in Algorithm 1, and thecomplex line search Newton-CG algorithm is given in Algorithm 2.
The complex Wolfe condition [28] can be easily obtained from the real Wolfe condition using a procedure similar to Given some initial gradient s 0 ; Set x 0 = 0, p 0 = −s 0 , k = 0; while |s k | / = 0 the one followed in Proposition 3. It should be noted that the complex conjugate gradient algorithm is a linear version such that the solution of a linear equation is considered. The procedure given in [28] can be used to obtain the version for a given nonlinear function.

DISCUSSION
We describe a framework for complex-valued adaptive signal processing based on Wirtinger calculus for the efficient computation of algorithms and their analyses. By enabling to work directly in the complex domain without the need to increase the problem dimensionality, the framework facilitates the derivation of update rules and makes efficient secondorder update procedures such as the conjugate-gradient rule readily available for complex optimization. The examples we have provided demonstrate the simplicity offered by the approach in the derivation of both componentwise update rules as in the case of the backpropagation algorithm for the MLP and direct matrix updates for estimating the demixing matrix as in the case of independent component analysis using maximum likelihood. The framework can also be used to perform the analysis of nonlinear adaptive algorithms such as ICA using the relative gradient update given in (48) as shown in [29] in the derivation of local stability conditions.

ACKNOWLEDGMENT
This work is supported by the National Science Foundation through Grants NSF-CCF 0635129 and NSF-IIS 0612076.