Robust and Sparse M-Estimation of DOA

—A robust and sparse Direction of Arrival (DOA) estimator is derived for array data that follows a Complex Elliptically Symmetric (CES) distribution with zero-mean and ﬁnite second-order moments. The derivation allows to choose the loss function and four loss functions are discussed in detail: the Gauss loss which is the Maximum-Likelihood (ML) loss for the circularly symmetric complex Gaussian distribution, the ML-loss for the complex multivariate t -distribution (MVT) with ν degrees of freedom, as well as Huber and Tyler loss functions. For Gauss loss, the method reduces to Sparse Bayesian Learning (SBL). The root mean square DOA error of the derived estimators is discussed for Gaussian, MVT, and ǫ -contaminated data. The robust SBL estimators perform well for all cases and nearly identical with classical SBL for Gaussian noise.


I. INTRODUCTION
Heavy-tailed sensor array data arises e.g. due to clutter in radar [3] and interference in wireless links [4]. Such array data demand statistically robust array processing. There is a rich literature on statistical robustness [5]- [8]. Here we derive, formulate, and investigate a statistically robust approach to Sparse Bayesian Learning (SBL) based on a model for Complex Elliptically Symmetric (CES) array data.
Due to the central limit theorem most noise are modeled as Gaussian and SBL was derived under a joint complex multivariate Gaussian assumption on source signals and noise [9]. Direction of arrival (DOA) estimation for plane waves using SBL is proposed in Ref. [10, Table I]. In some cases the noise contains non-Gaussian outliers, thus it is important to have robust processing so that these unlikely events also gives good estimates.
SBL provides DOA estimates based on the sample covariance matrix of the array data sample. The sample covariance matrix is a sufficient statistic under the jointly Gaussian assumption, but it is not robust against deviations from this assumption [11]. The SBL approach is flexible through the usage of various priors, although Gaussian are most common [12]. For Gaussian priors this has been approached based on minimization-majorization [13] and with expectation maximization (EM) [14]- [19]. We estimate the hyperparameters Manuscript  A preliminary version of the proposed DOA M-estimation algorithm and obtained results were published in [1]. Codes with implementation of the proposed DOA M-estimation are on GitHub [2].
iteratively from the likelihood derivatives using stochastic maximum likelihood [20]- [22]. A numerically efficient SBL implementation is available on GitHub [23]. Recent investigations showed that Sparse Bayesian Learning (SBL) is lacking in statistical robustness [1], [11], [24]. A Bayes-optimal algorithm was proposed to estimate DOAs in the presence of impulsive noise from the perspective of SBL in [25]. In the following, we derive robust and sparse Bayesian learning which can be understood as introducing a data-dependent weighting into the sample covariance matrix estimate.
Previously, a direction of arrival (DOA) estimator for plane waves observed by a sensor array based on a complex multivariate Student t-distribution array data model was studied. A qualitatively robust and sparse DOA estimate was derived as Maximum Likelihood (ML) estimate based on this model [8,Sec. 5.4.2], [11], [24].
Here, we solve the DOA estimation problem from multiple array data snapshots in the SBL framework [12], [14] and use the maximum-a-posteriori (MAP) estimate for DOA reconstruction. We assume a CES array data model with unknown source variances for the formulation of the likelihood function To determine the unknown parameters, we maximize a Type-II likelihood (evidence) for this CES array data model and estimate the hyperparameters iteratively from the likelihood derivatives using stochastic maximum likelihood.
We propose a SBL algorithm for DOA M-estimation which, given the number of sources, automatically estimates the set of DOAs corresponding to non-zero source power from all potential DOAs. Posing the problem this way, the estimated number of parameters is independent of snapshots, while the accuracy improves with the number of snapshots.
We incorporate priors with potentially strong outliers by allowing various loss functions in the formulation of Mestimators. This leads to a robust and sparse DOA estimator which is based on the assumption that the array data observations follow a centered (zero-mean) CES distribution with finite second-order moments.
The outline of the paper is as follows: We introduce the notation and the array data model in Sec. II. Thereafter, we formulate the objective function and specific loss functions used for DOA M-estimator in Sec. III and describe the proposed algorithm. Simulation results for DOA estimation are discussed in Sec. IV and report on convergence of the algorithm and associated run time in Sec. V.
II. COMPLEX ELLIPTICALLY SYMMETRIC ARRAY DATA MODEL Narrowband waves are observed on N sensors for L snapshots y ℓ and the array data is We model the snapshots y ℓ by a scale mixture of Gaussian distributions, which have a stochastic decomposition of the form where τ ℓ > 0 is a random variable independent of v ℓ . The unknown zero-mean complex source amplitudes are the elements of X = [x 1 . . . x L ] ∈ C M×L where M is the considered number of hypothetical DOAs on the given grid {θ 1 , . . . , θ M }. The source amplitudes are independent across sources and snapshots, i.e. x ml and x m ′ l ′ are independent for (m, l) = (m ′ , l ′ ). If K sources are present in the ℓth array data snapshot, the ℓth column of X is K-sparse and we assume that the sparsity pattern is the same for all snapshots. The sparsity pattern is modeled by the active set The noise N = [n 1 . . . n L ] ∈ C N ×L is assumed independent identically distributed (iid) across sensors and snapshots, zeromean, with finite variance σ 2 for all n, ℓ.
The M columns of the dictionary A = [a 1 . . . a M ] ∈ C N ×M are the replica vectors for all hypothetical DOAs. For a uniform linear array (ULA), the dictionary matrix elements are A nm = e −j(n−1) 2πd λ sin θm (d is the element spacing and λ the wavelength). The K "active" replica vectors are aggregated in with its kth column vector a m k , where m k ∈ M. The source and noise amplitudes are jointly Gaussian and independent of each other, i.e. x ℓ ∼ CN M (0, Γ) and n ℓ ∼ CN N (0, σ 2 I N ). It follows from (1), that v ℓ ∼ CN N (0, Σ) with where γ = [γ 1 . . . γ M ] T is the K-sparse vector of unknown source powers. The matrix Σ is interpretable as the covariance matrix cov(v ℓ ) of the Gaussian component v ℓ , but this is not observable in this model and the sensor array only observes the scale mixture y ℓ . The matrix Σ is called scatter matrix in the following.
Since y ℓ |τ ℓ ∼ CN N (0, τ ℓ Σ), the density of y ℓ is where the so-called density generator g(·) is evaluated by The form of (7) shows that the distribution of y ℓ is CES with mean zero 0 [26]- [28]. If the random scaling √ τ ℓ = 1 in (1) for all ℓ then the commonly assumed Gaussian array data model is recovered, which relates the array data snapshot y ℓ to the source amplitude x ℓ by a linear regression model. This model results in Gaussian array data, y ℓ ∼ CN N (0, Σ). We will not use (9) for the derivation of DOA M-estimation using SBL. The scaling mixture (1) is assumed instead. In array processing applications, the complex Multi-Variate t-distribution (MVT distribution) [29], [30] can be used as an alternative to the Gaussian distribution as array data model in the presence of outliers because the MVT-distribution has heavier tails than the Gaussian distribution. The MVTdistribution is a suitable choice such data and provides a parametric approach to robust statistics [8], [24]. The complex MVT distribution is a special case of the CES distribution, for details see Appendix A.
For numerical performance evaluations of the derived Mestimator of DOA, three array data models are used in Sec. IV: Gaussian, MVT, and ǫ-contaminated. The Gaussian and MVT models are CES, whereas the ǫ-contaminated model is not.

A. Covariance matrix objective function
We follow a general approach based on loss functions and assume that the array data Y has a CES distribution with zero mean 0 and positive definite Hermitian N × N covariance matrix parameter Σ [31], [32]. Thus An M-estimator of the covariance matrix Σ is defined as a positive definite Hermitian N × N matrix that minimizes the objective function [8, (4.20)], where y ℓ is the ℓth array snapshot and ρ : R + 0 → R + , is called the loss function. The loss function is any continuous, non-decreasing function which satisfies that ρ(e x ) is convex in −∞ < x < ∞, cf. [8,Sec. 4.3]. Note that the objective function (11) is a penalized sample average of the chosen loss function ρ where the penalty term is log det Σ. A specific choice of loss function ρ renders (11) equal to the negative log-likelihood of Σ when the array data are CES distributed with density generator g(t) = e −ρ(t) [33]. If the loss function is chosen, e.g., as ρ(t) = t then (11) becomes the negative log-likelihood function for Σ for Gaussian array data.
The term b is a fitting coefficient, called consistency factor, which renders the minimizer of the objective function (11) to be equal to Σ when the array data are Gaussian, thus b = E[ψ( y 2 )]/N, y ∼ CN N (0, I), where ψ(t) = t dρ(t)/dt and f χ 2 2N (t) denotes the pdf of chisquared distribution with 2N degrees of freedom. To arrive from (12) to (13) we used that Minimizing (11) with b according to (13) results in a consistent M-estimator of the covariance matrix Σ when the objective function is derived under a given non-Gaussian array data assumption (as in Sec. III-B) but is in fact Gaussian (y ℓ ∼ CN N (0, Σ)). A derivation of the consistency factor b for the loss functions used in this paper is given in Appendix B.

B. Loss functions
We discuss four different choices of loss function ρ(·) and these are chosen so that (11) becomes an negative loglikelihood function for the corresponding distribution: These loss functions are summarized in Table I. For each loss function, we also summarize the consistency factor b and the weight function u(t) = dρ(t)/dt associated with the loss function ρ.
1) Gauss loss: corresponds to loss function of (circular complex) Gaussian distribution: ρ Gauss (t) = t (14) in which case the consistency factor b = 1 and the objective in (11) becomes the Gaussian (negative) log-likelihood function is the sample covariance matrix, (·) H denotes Hermitian transpose, and the minimizer of which isΣ = S Y . In this case b in (13) becomes b = 1 as expected, since S Y is consistent to Σ without any scaling correction. For Gauss loss u Gauss (t) = 1.
The threshold c is a tuning parameter that affects the robustness and efficiency of the estimator. Huber loss specializes the objective function (11) to the negative log-likelihood of Σ when the array data are heavy-tailed CES distributed with a density generator of the form e −ρ Huber (t;c) . The squared threshold c 2 in (16) is mapped to the qth quantile of (1/2)χ 2 2Ndistribution and we regard q ∈ (0, 1) as a loss parameter which is chosen by design, see Table I. It is easy to verify that b in (13) for Huber loss function is [8,Sec. 4 where F χ 2 2N (x) denotes the cumulative distribution of the χ 2 2N distribution. For Huber loss (16) the weight function becomes Thus, an observation y ℓ with squared Mahalanobis distance (MD) y H ℓ Σ −1 y ℓ smaller than c 2 receives constant weight, while observations with a larger MD are heavily downweighted. loss loss weight loss name 3) MVT loss: which corresponds to the ML-loss for (circular complex) multivariate t (MVT) distribution with ν loss degrees of freedom, The ν loss parameter in (20) is viewed as a loss parameter which is chosen by design, see Table I. The consistency factor b MVT for ρ MVT (t; ν loss ) is computable by numerical integration.
For MVT-loss (20) the corresponding weight function is which is the limiting case of ρ MVT (t; ν loss ) for ν loss → 0 and of ρ Huber (t; c) for c → 0. To obtain this limit using Huber loss function, first note that we may replace ρ Huber (t; c) with ρ * . This is the ML-loss of the Angular Central Gaussian (ACG) distribution [8, Sec. 4.2.3] which is not a CES distribution. For Tyler loss (22) the weight function becomes u Tyler (t) = N/t. In this case, we can not use (12) since Tyler's M-estimator estimates the shape of the covariance matrix Σ only. Namely, for Tyler loss and b = 1, ifΣ is a minimizer of (11) then so is bΣ for any b > 0. Thus the solution is unique only up to a scale. However, a consistent estimator of the covariance matrix at the normal distribution can be obtained by multiplying any particular minimumΣ byτ given in (49) or in (51). The latter approach is more robust as it uses medians (instead of the means) of distancesd 2 It also consistently outperformed the sample mean based estimate (51) in our simulations (not reported in this paper). Thus, we computeτ using (51) and set the concistency factor as b = 1/τ . More details of these estimators are given in Appendix B.

C. Source Power Estimation
Similarly to Ref. [10, Sec. III.D], we regard (11) as a function of γ and σ 2 and compute the first order derivative where u(t) = dρ(t)/dt is the weight function associated with the loss function ρ. Equation (23) is identical to Ref. [10,Eq.(21)] except for the weight function u(y H ℓ Σ −1 y ℓ ). For the Gaussian array data model, the weight function is the constant function u Gauss (t) ≡ 1.
Setting (23) to zero gives where R Y is the weighted sample covariance matrix,  (24) by γ m and obtain the fixed-point equation which is the basis for an iteration to solve for γ numerically. The active set M is then selected as either the K largest entries of γ or the entries with γ m exceeding a threshold.
To stabilize the noise variance M-estimate (28) for non-Gauss loss, we define lower and upper bounds forσ 2 and enforce σ 2 The original SBL algorithm [10], [23] does not use/need this stabilization because for Gauss loss, the weighted sample covariance matrix estimate (25)  1: input Y ∈ C N×L array data to be analyzed 2: select the weight function u(·; ·) and loss parameter 3: constant A ∈ C N×M dictionary matrix 4: K ∈ N, with K < N , number of sources 5: δ ∈ R + small positive constant 6: SNRmax ∈ R + upper SNR limit in data 7: γrange ∈ [0, 1], dynamic range for DOA grid pruning 8: jmax ∈ N iteration count limit 9: z ∈ N with z < jmax convergence criterion 10: upper limit onσ 2

E. Algorithm
The proposed DOA M-estimation algorithm using SBL is displayed in Table II with the following remarks: 1) DOA grid pruning: To reduce numerical complexity in the iterations, we introduce the pruned DOA grid P by not wasting computational resources on those DOAs which are associated with source power estimates below a chosen threshold value γ floor , i.e. we introduce a thresholding operation on the γ new vector. The pruned DOA grid is formally defined as an index set, where γ floor = γ range max γ new and we have chosen γ range = 10 −3 .
2) Initialization: In our algorithm we need to give initial values of source signal powers γ and the noise variance σ 2 . The initial estimates are computed via following steps: a) Compute S Y and CBF output powers b) Compute the initial active set by identifying K largest peaks in the CBF output powers, c) Compute the initial noise variancê d) Compute initial estimates of source powers: where δ > 0 is a small number, guaranteeing that all initial γ new m are positive.

3) Convergence Criterion:
The DOA Estimates returned by the iterative algorithm in Table II are obtained from the active set M. Therefore, the active set is monitored for changes in its elements to determine whether the algorithm has converged. If M has not changed during the last z ∈ N iterations then the repeat-until loop (lines 14-29 in Table II) is exited. Here z is a tuning parameter which allows to trade off computation time against DOA estimate accuracy. To ensure that the iterations always terminate, the maximum iteration count is defined as j max with z < j max .

IV. SIMULATION RESULTS
Numerical simulations are carried out for evaluating the root mean squared error (RMSE) of DOA versus array signal to noise ratio (ASNR) based on synthetic array data Y . Synthetic array data are generated for three scenarios with K = 1, . . . , 3 incoming plane waves and corresponding DOAs as listed in Table III The RMSE of the DOA estimates over N run = 250 simulation runs with random array data realizations is used for evaluating the performance of the algorithm, where θ r k is the true DOA of the k source andθ r k is the corresponding estimated DOA in the rth run when K sources are present in the scenario. This RMSE definition is a specialization of the optimal subpattern assignment (OSPA) when K is known, cf. [37]. We use e max = 10 • in (35). Thus maximum RMSE is 10 • .

A. Data generation
The scaling √ τ ℓ and the noise n ℓ in (1) are generated according to three array data models which are summarized in Table IV and explained below: a) Gaussian array data:: In this model τ ℓ = 1 for all ℓ in (1) and y ℓ = v ℓ ∼ CN (0, Σ), where Σ is defined in (4).

.2.2].
c) ǫ-contaminated array data:: This heavy-tailed array data model is not covered by (1) with the assumptions in Sec. II. Instead, the noise n is drawn with probability (1 − ǫ) from a CN (0, σ 2 1 I) and with probability ǫ from a CN (0, λ 2 σ 2 1 I), where λ is the outlier strength. Thus, y ℓ is drawn from CN (0, AΓA H +σ 2 1 I N ), using (4) with probability (1−ǫ) and with outlier probability ǫ from CN (0, AΓA H + (λσ 1 ) 2 I N ). The resulting noise covariance matrix is σ 2 I N similar to the other models, but with The limiting distribution of ǫ-contaminated noise for ǫ → 0 and any constant λ > 0 is Gaussian. The proposed DOA M-estimation algorithm using SBL is displayed in Table II. The convergence criterion parameter z = 10 is chosen for all numerical simulations and the maximum number of iterations was set to j max = 1200, but this maximum was never reached.
Additionally, the Cramér-Rao Bound (CRB) for DOA estimation from Gaussian and MVT array data are shown. The corresponding expressions are given in Appendix C for completeness. The Gaussian CRB shown in Figs. 1(a,d,g) is evaluated according to (53). The bound for MVT array data, C CR,MVT (θ), is just slightly higher than the bound for Gaussian array data, C CR,Gauss (θ). For the three source scenario, N = 20, L = 25, Gaussian and MVT array data model, the gap between the bounds C CR,Gauss (θ) and C CR,MVT (θ) is smaller than 3% in terms of RMSE in the shown ASNR range. The gap is not observable in the RMSE plots in Fig. 1 for the chosen ASNR and RMSE ranges. The MVT CRB is shown in Figs. 1(b,e,h). We have not evaluated the CRB for ǫ-contaminated array data. In the performance plots for ǫ-contaminated array data, we show C CR,Gauss (θ) for ASNR = N/σ 2 using (36), labeled as "Gaussian CRB (shifted)" in Figs. 1(c,f,i). We expect that the true CRB for ǫ-contaminated array data is higher than this approximation.

B. Single source scenario
A single plane wave (K = 1) with complex circularly symmetric zero-mean Gaussian amplitude is arriving from DOA θ 8001 = −10 • . Here, ASNR = N/σ 2 , cf. [38,Eq. (8.112)]. Figure 1 shows results for RMSE of DOA estimates in scenarios with L = 25 snapshots and N = 20 sensors. RMSE is averaged over 250 iid realizations of DOA estimates from array data Y . There are more snapshots L than sensors N , ensuring full rank R Y almost surely.
Simulations for Gaussian noise are shown in Fig. 1(a). For this scenario, the conventional beamformer (not shown) is the ML DOA estimator and approaches the CRB for ASNR greater 3 dB. All shown M-estimators for DOA perform almost identically for Gaussian array data in terms of RMSE due to the consistency factor b introduced in (16) and (20) and just slightly worse than the CBF. Figure 1(b) shows simulations for heavy-tailed MVT distributed array data with degrees of freedom parameter ν data = 2.1 being small. We observe that the M-estimator for MVTloss ρ MVT for loss parameter ν loss = 2.1 performs best, closely followed by the M-estimators for Tyler loss and Huber-loss ρ Huber for loss parameter q = 0.9. Here, the loss parameter ν loss used by M-estimator is identical to the MVT array data parameter ν data and thus is expected to work well for this array data model. In Fig. 1(b), the Mestimator for MVT-loss ρ MVT closely follows the Gaussian CRB for ASNR > 3 dB, although a small gap at high ASNR remains. Instead of comparing to the Gaussian CRB, we should compare to the CRB for CES distributed data derived in [39,Eq. (20)] and [40,Eq. (17)]. The assumption that ν data is known a priori is somewhat unrealistic. However, methods to estimate ν data from data are available, e.g., [41]. Gauss loss exhibits largest RMSE at high ASNR in this case.
Results for ǫ-contaminated noise are shown in Fig. 1(c) for outlier probability ǫ = 0.05 and outlier strength λ = 10. The resulting noise variance (36) for this heavy-tailed distribution is σ 2 = 5.95σ 2 1 . The M-estimators for Tyler loss and MVTloss ρ MVT perform identically and best in terms of their DOA RMSE followed by the M-estimator for Huber loss ρ Huber with an ASNR penalty of about 2 dB. Poorest RMSE exhibits the (non-robust) DOA estimator for Gauss loss ρ Gauss indicating strong impact of outliers on the original (nonrobust) SBL algorithm for DOA estimation [10].

C. Two source scenario
Next we consider the two source scenario (K = 2) in Table III Simulations for Gaussian noise are shown in Fig. 1(d). Here, the M-estimate for Huber loss ρ Huber for loss parameter q = 0.9 performs equally well as the M-estimate for Gauss loss ρ Gauss which is equivalent to the original (non-robust) SBL algorithm for DOA estimation [10]. They approach the CRB for ASNR greater 9 dB. The DOA estimator for Tyler loss performs slightly worse than the previous two. Here, MVTloss ρ MVT for loss parameter ν loss = 2.1 has highest RMSE in DOA M-estimates. Figure 1(e) shows simulations for heavy-tailed MVT array data with parameter ν data = 2.1 being small. We observe that M-estimation with Tyler loss and MVT-loss ρ MVT for loss parameter ν loss = 2.1 perform best, closely followed by M-estimation with Huber loss ρ Huber with loss parameter q = 0.9. The non-robust DOA estimator for ρ Gauss performs much worse than the other two showing an ASNR penalty of about 6 dB at high ASNR.
Here, the loss parameter ν loss used by the M-estimator for MVT-loss ρ MVT is identical to the MVT array data parameter ν data used in generating the MVT-distributed array data and it is expected to work well for this array data model. In Fig. 1(e), the RMSE result for ρ MVT closely follows the Gaussian CRB for ASNR > 9 dB, although a small gap at high ASNR remains. Instead of comparing to the Gaussian CRB, a semiparametric stochastic CRB for DOA estimation under the CES data model would be more appropriate for MVT noise [28].
Results for ǫ-contaminated noise are shown in Fig. 1(f) for outlier probability ǫ = 0.05 and outlier strength λ = 10. M-estimation with Tyler loss and MVT-loss ρ MVT for loss parameter ν loss = 2.1 show lowest RMSE followed by Huber loss with slight ASNR penalty. Poorest RMSE is exhibited by the (non-robust) DOA estimator for Gauss loss ρ Gauss .

D. Three source scenario
Array data Y are generated for the three source scenario (K = 3) with complex circularly symmetric zero-mean Gaussian amplitude from DOAs θ 8701 = −3 • , θ 9201 = 2 • , θ 16501 = 75 • according to the array data models in Table  IV

E. Effect of Loss Function
The effect of the loss function on RMSE performance at high ASNR = 30 dB is illustrated in Fig. 2. This shows that for Gaussian array data all choices of loss functions perform equally well at high ASNR. For MVT data in Fig. 2 (middle), we see that the robust loss functions (MVT, Huber, Tyler) work well, and approximately equally, whereas RMSE for Gauss loss is factor 2 worse. For ǫ-contaminated array data in Fig. 2 (right) the Gauss loss performs a factor worse than the robust loss functions. Huber loss has slightly higher RMSE than MVT and Tyler loss.

F. Effect of Outlier Strength on RMSE
For small outlier strength λ for MVT data, the Gauss loss performs fine, but as the outlier noise increases the robust processors outperform, see Fig. 3. As λ increases, the total noise changes, see (36). We here chose to keep the total noise constant in Fig. 3(left) by decreasing the background noise with increasing λ, or having the background noise constant in Fig. 3(right) whereby the total noise increases. For large noise outlier, Tyler loss clearly has best performance in Fig. 3(left) and does not breakdown in Fig. 3(right).

G. Effect of Dictionary Size on RMSE
Due to the algorithmic speedup associated with the DOA grid pruning described in Sec. III-E, it is feasible and useful to run the algorithm in Table II with large dictionary size M which translates to the dictionary's angular grid resolution of δ = 180 • /(M − 1).
The effect of grid resolution is illustrated in Fig. 5 for a single source impinging on a N = 20 element λ/2spaced ULA. The Gaussian array data model is used. Fig. 5 shows RMSE vs. ASNR for a dictionary size of M ∈ {181, 361, 1801, 18001, 180001}. In Fig. 5(a), the DOA is fixed at −10 • , cf. single source scenario in Table III, the DOA is on the angular grid which defines the dictionary matrix A. In Fig. 5(b) the DOA is random, the source DOA is sampled from −10 • + U (−δ/2, δ/2) (δ = 180 • /(M − 1) is the angular grid resolution). The source DOA is not on the angular grid which defines the dictionary matrix A.
For source DOA on the dictionary grid, Fig. 5(a), the RMSE performance curve resembles the behavior of an ML-estimator at low ASNR up to a certain threshold ASNR (dashed vertical gridresolution.eps Source on grid Source at random DOA lines) where the RMSE abruptly crosses the CRB and becomes zero. The threshold ASNR is deduced from the following argument: Let a m be the true DOA dictionary vector and a m+1 be the dictionary vector for adjacent DOA on the angular grid. Comparing the corresponding Bartlett powers, we see that DOA errors become likely if the noise variance exceeds 2(|a H m a m | − |a H m a m+1 |)/N = 2 − 2|a H m a m+1 |/N . For source DOA off the dictionary grid, Fig. 5(b), the RMSE performance curve resembles the behavior of an ML-estimator at low ASNR up to a threshold ASNR. In the random DOA scenario, however, the RMSE flattens at increasing ASNR. Since the variance of the uniformly distributed source DOA is δ 2 /12, the limiting RMSE = δ/ √ 12 for ASNR → ∞. The limiting RMSE (dashed horizontal lines) depends on the dictionary size M through the angular grid resolution δ. The asymptotic RMSE limits are shown as dashed horizontal lines in Fig. 5(b).

V. CONVERGENCE BEHAVIOR AND RUN TIME
The DOA M-estimation algorithm in Table II uses an iteration to estimate the active set M whose elements represent the estimated source DOAs. The required number of iterations for convergence of M depends on the source scenario, array data model, and ASNR. Figure 6 shows the required number of iterations for the three source scenario versus ASNR and all three array data models. Figure 6 shows fast convergence for high ASNR, and the number of iterations decreases with increasing ASNR. At ASNR < 5 dB, where the noise dominates the array data, the number of iterations is around 100 and approximately indepent of the ASNR. to find the best maching DOAs. Peak number of iterations is near 160 at ASNR levels between 12 and 15 dB. the choice of loss function does not much influence the CPU time. At ASNR > 18 dB, MVT and Tyler consume just slightly less CPU time than for Huber and Gauss loss. For low ASNR, CPU time increases by a ratio approx. proportional to dictionary size 180001/181=1000, but at high SNR this ratio reduces to 50, due to the efficiency of the DOA grid pruning, cf. Sec. III-E. The M = 180001 dictionary is quite large in this scenario, but in other scenarios for localization in 3 dimensions, this is an expected dictionary size.

VI. CONCLUSION
Robust and sparse DOA M-estimation is derived based on array data following a zero-mean complex elliptically symmetric distribution with finite second-order moments. The derivation is based on loss functions which can be chosen freely subject to certain existence and uniqueness conditions. The DOA M-estimator is numerically evaluated by iterations and made available on GitHub [2]. A specific choice of loss function determines the RMSE performance of the resulting DOA M-estimate for different array data distributions. Four choices for loss function are discussed and investigated in numerical simulations with synthetic array data: the ML-loss function for the circular complex multivariate t-distribution with ν degrees of freedom, the loss functions for Huber and Tyler M-estimators. For Gauss loss, the method reduces to Sparse Bayesian Learning.
We discuss the robustness of these DOA M-estimators by evaluating the root mean square error for Gaussian, MVT, and ǫ-contaminated array data. The robust and sparse Mestimators for DOA perform well in simulations for MVT and ǫ-contaminated noise and nearly identical with classical SBL for Gaussian noise.

A. MVT array data model
Here we show how the MVT array data model relates to the scale mixture model for the array data (1). By assuming that the distribution of τ ℓ in (1) is an inverse gamma distribution with shape parameter α and scale parameter β, the corresponding density generator evaluates to We substitute w = 1/τ and dτ = − 1 w 2 dw, giving Finally, we specialize the shape and scale parameters to α = β = ν data /2, giving which agrees with the density generator of the multivariate t-distribution t ν data (0, Σ) [8, Sec. 4.2.2, Ex. 11 on p. 107]. Thus, placing an inverse gamma prior over the random scaling τ ℓ results in the MVT array data model.

B. Consistency Factor
Here the consistency factor b is evaluated for Huber loss (16), MVT loss, (20), and Tyler loss (22). For elliptical distributions an M-estimator is a consistent estimator of αΣ, where the constant α is a solution to [31, eq. (49)]: where ψ(t) = t u(t) = t dρ(t)/dt as defined below (13) . Assuming y ∼ CN N (0, Σ), then we scale the chosen loss function ρ(t) such that (44) holds for α = 1. Namely, for where b is a scaling constant defined in (12), it clearly holds . This implies that α = 1 and that the M-estimator with loss ρ b (·) is consistent to the covariance matrix Σ when the array data follows the CN N (0, Σ) distribution. For Huber loss function (16) the b in (13) can be solved in closed form as [8,Sec. 4 where F χ 2 n (x) = P{X ≤ x} is the cumulative distribution of a central χ 2 distributed random variable X with n degrees of freedom.
For the MVT loss (20) we evaluate (13) by numerical integration.
For Tyler loss ψ(t) ≡ N ∀t, indicating that the consistency factor for Tyler loss can not be found based on (44). Tyler M-estimator is unique up to a scale, and thus any particular solutionΣ is Fisher consistent to V = Σ/τ for some unknown scale τ > 0 where Σ = E[xx H ] is the covariance matrix. Although the information of τ is lost, it is possible to estimate τ from distancesd 2 ℓ = x H ℓΣ −1 x ℓ /N , ℓ = 1, . . . , L. Two approaches are given below.
a) Using the sample mean ofd 2 ℓ -s: This estimator is derived as follows. Note that Taking matrix trace in (50) gives τ = E[x H V −1 x H ]/N . The sample estimate of τ is then (49). Thus for any particular distribution (not just Gaussian) and for any solutionΣ to (11) using Tyler loss with b = 1, the scaled Tyler M-estimatorτΣ estimates the covariance matrix Σ. b) Using the (scaled) sample median ofd 2 ℓ -s: where Median(χ 2 2N /(2N )) is the median of a random variable with χ 2 2N distribution scaled by 1/ (2N ). This estimator is derived as follows. Note that for large enough sample size L, Then note that d 2 ℓ = N −1 x H ℓ Σ −1 x ℓ ∼ χ 2 2N /(2N ) when x ℓ ∼ N N (0, Σ). Thus, the chosen valueτ in (51) makes the sample median of the empirical distribution ofd 2 ℓ -s on the left and the median of the scaled target distribution of d 2 ℓ on the right hand side of (52) coincide. Then the following scaled Tyler M-estimatorτΣ again estimates the covariance matrix Σ when the data follows a Gaussian distribution.
The above estimates are consistent in the random matrix theory (RMT) regime, where L, N → ∞ with L > N and their ratio tending to constant N/L → γ ∈ (0, 1). Namely, let Σ be the the solution to (11) with Tyler loss and b = 1, and verifying tr(Σ) = N so that Σ is an estimator of V = Σ/τ with τ = tr(Σ)/N . Then, it was shown in [42], [43] that max ℓ |τ /d 2 ℓ − 1| → 0 almost surely. This result was derived for i.i.d. Gaussian N N (0, I) data [42] while [43] extended these results for more general distributions. Thus sinced 2 ℓ -s concentrate on τ , estimate in (51) is consistent estimators in RMT regime as well. Note that Median(χ 2 2N /(2N )) → 1 as N → ∞ and thus this scaling constant has effect only for small N .