A Machine Learning Perspective on Automotive Radar Direction of Arrival Estimation

Millimeter-wave sensing using automotive radar imposes high requirements on the applied signal processing in order to obtain the necessary resolution for current imaging radar. High-resolution direction of arrival estimation is needed to achieve the desired spatial resolution, limited by the total antenna array aperture. This work gives an overview of the recent progress and work in the field of deep learning based direction of arrival estimation in the automotive radar context, i.e. using only a single measurement snapshot. Additionally, several deep learning models are compared and investigated with respect to their suitability for automotive angle estimation. The models are trained with model- and data-based approaches for data generation, including simulated scenarios as well as real measurement data from more than 400 automotive radar sensors. Finally, their performance is compared to several baseline angle estimation algorithms like the maximum-likelihood estimator. All results are discussed with respect to the estimation error, the resolution of closely spaced targets and the total estimation accuracy. The overall results demonstrate the viability and advantages of the proposed data generation methods, as well as super-resolution capabilities of several architectures.


I. INTRODUCTION
Automotive radar is a key sensor for reliable detection of obstacles and road users. It finds wide application in advanced driver assistant systems (ADAS) [1] and is also considered as one key sensing technology considered for highly-automated driving (HAD) [2]. Many advantages like small form factor, low cost compared to e.g. lidar, good robustness towards poor weather conditions, as well as accurate range, velocity and angle estimation, make radar the preferred choice for many ADAS functions like adaptive cruise control (ACC) or lane change assist (LCA) [3], [4]. Today's high performance radar has evolved towards an imaging sensor, visualizing the whole environment in multiple dimensions [4], without comprising any privacy issues.
Although radar is a mature technology in the automotive domain, still several challenges exist in radar signal The associate editor coordinating the review of this manuscript and approving it for publication was Hasan S. Mir. processing: Multipath reflections arising from sensors positioned behind the vehicle bumper, detection and mitigation of mutual interference due to other cars' radar sensors operating in the same frequency bands, or resolving targets within one of the three measurement dimensions, range, velocity and angle. In radar signal processing, the physical angle (azimuth or elevation) of a target with respect to the radar sensor is estimated by appropriate direction of arrival (DOA) estimation techniques. Furthermore, the performance of automotive radar can be measured with respect to the unambiguous measurement range in all measurement dimensions, and its ability of separating targets in some dimension, referred to as resolution [5]. While range and velocity resolution of radar sensors primarily rely on the waveform parameters, such as high available bandwidth for high range resolution and a long coherent observation time for velocity resolution, the spatial or angular resolution is limited by the physical aperture of the sensor array.
Angular resolution becomes necessary in many cases where targets cannot be resolved sufficiently in both range and velocity domain. Even better resolution is needed in highly dynamic scenarios, such as two cars on a highway 100 m to 200 m ahead of the ego vehicle, with a lateral spacing between two lanes of around 3.75 m. In order to distinguish the reflections in this particular situation, an angular resolution of approximately 1.1 • is necessary (for a distance of 200 m). This is in line with the requirements of less than 1 • resolution for both azimuth and elevation, specified for Level 4 and Level 5 autonomous driving [1]. Trying to achieve this resolution with a single antenna array, around 100 antenna elements are needed.
This in turn leads to the rational consequence of using multiple-input multiple-output (MIMO) technology, to synthesize a larger virtual aperture of the sensor with less physical antennas. MIMO concepts are used today in many commercial automotive radar sensors [3], [4], [6], i.e. a typical automotive radar using 3 transmit and 4 receive channels, yields 12 virtual antenna elements or about 9 • of angular resolution. As the previous example showcases, angular resolution is still a limiting factor for automotive radar, thus necessitating high-resolution or super-resolution techniques. In this work we focus explicitly on a setup with small number of antennas, to emphasize the effects of super-resolution techniques.
A framework for high-resolution frequency estimation in automotive applications has been studied and proposed in [7], [8]. It focuses on the optimal selection of the resolution dimensions of targets, rather than solely utilizing the angular domain. Other techniques commonly used in array signal processing, like multiple signal classification (MUSIC) [9], estimation of signal parameters via rotational invariance techniques (ESPRIT) [10] or minimum variance distortionless response (MVDR) [11], either assume the availability of many measurement snapshots, or need additional preprocessing, effectively reducing the total aperture. Subspace-based methods like MUSIC are sensitive to correlated or coherent sources which degrades their performance [1], [12], [13]. While performing multiple measurements is often not practical in highly dynamic automotive environments, target reflections stemming from the same transmit pulse are always fully correlated. A two stage processing approach using uniform circular arrays, applies subarray processing to spatially filter target returns first and processes them independently later [14]. The performance of this particular approach is however found to be strongly dependent on the results of its preceding target detection stage. Model based estimators, like the maximum likelihood estimator (MLE), can only perform well under correct assumptions, ensuring the validity of the model.
Recently, deep learning (DL) [15] -a subset of machine learning -has found increased attention in radar signal processing applications [16]. Compared to the well studied traditional DOA estimation techniques for radio frequency (RF) signals, neural network based estimators have been proposed since about 25 years. In general, DL based approaches can provide several key advantages over traditional methods, such as increased super-resolution capabilities in case of a single snapshot [17], high flexibility with respect to antenna configurations as well as good generalization capability regarding sensor arrays perturbations [18]. With increased availability of computational power, data and machine learning frameworks, many approaches to deep learning based DOA estimation have been proposed [17]- [36].
This work should give a detailed overview of the fundamentals to automotive DOA estimation, the latest advances in deep learning based DOA estimation as well as an analysis of state-of-the-art architectures in the context of automotive radar. It is organized as follows: We first introduce the data model of state-of-the-art MIMO radar in chapter II, since it represents the basis for formulating the DOA estimation problem as well as for synthetic data generation. A detailed survey of reported approaches to deep learning based DOA estimation is then provided in chapter III. Chapter IV presents our approach to generate valid training and test data based on synthetic, measurement, as well as hybrid approaches. Subsequently, three different DL architectures from literature are selected, investigated in the context of automotive radar and discussed in chapter V. Finally, after training on identical datasets, their performance and results are assessed in chapter VI and a conclusion and outlook is given in chapter VII.

II. SIGNAL MODELS AND BASIC PROCESSING STEPS
A representative illustration of a typical MIMO radar system configuration as used in the automotive domain is illustrated in Fig. 1. The radar transceiver (TRX), today typically implemented as a single monolithic microwave integrated circuit (MMIC), or even system on chip (SoC) including digital functions [4], generates the radar transmit (TX) waveform and radiates it via the antennas connected to its TX channels. The wave propagates through space until it hits one or several targets. A portion of the incident wave flux density proportional to the target radar cross section (RCS) is captured and re-radiated towards the radar, assuming isotropic radiation of the target. According to the common terminology a target is hence assumed to be an isolated point-like scatterer. Physical objects such as cars, pedestrians, or the environment are composed of several point-like targets, and the sum of the individual target responses then is the overall receive (RX) signal. The radar then compares TX and RX signals, from which the basic parameters of the sensing task can be estimated: number of targets K tgt as well as range r k tgt , radial velocity v r,k tgt , azimuth φ k , and elevation k for each target k tgt . This collection of targets detected in each measurement cycle is referred to as the target list and it can be interpreted as a pointcloud obtained from radar measurements. Measurements are cyclically repeated with a period of T cyc , for automotive applications typically in the range of T cyc ≈ 30 ms . . . 70 ms. Subsequent processing steps in automotive radar then would be clustering, i.e., identifying targets belonging to the same physical object and tracking, i.e. deriving information and FIGURE 1. Illustration of MIMO radar scenario, signal models and basic processing steps. A physically present object consists of several point-like scatterers, the targets. In each measurement cycle, the radar system outputs the target list, a collection of detected targets. Several processing steps are involved, which are described in detail in chapter II. Capital letters refer to sections containing the detailed block description and data model. statistics of the physically present objects from sequences of target lists, which however is out of scope of this article. An important part of this work is to give an overview on how deep learning can be used to generate target lists with high resolution in azimuth and elevation, i.e. how the DOA estimation step can be implemented using DL approaches. Furthermore, the datamodels for generating training data are derived from the introduced radar signal models. Therefore, the basic processing steps in automotive radar and the involved data models are outlined in the following sections.

A. TIME-DOMAIN CHIRP-SEQUENCE RADAR
Although the DOA estimation methods described in this paper are not limited to a particular radar waveform, in this chapter we establish a relation to fast-chirp or chirp-sequence frequency-modulated continuous-wave (FMCW) radar, the predominant radar waveform used in the automotive domain today [4]. As illustrated in Fig. 2, the transmit signal s(t) here is given by a sequence of Q chirp waveforms s c (t) spaced at the chirp-chirp distance T cc where each individual linear chirp waveform of duration T c is given by The chirp rate, i.e., slope, is given by µ, the start (low) frequency by f l and the stop (high) frequency by f h = f l + W , where W = µT c is the bandwidth of the chirp. The center frequency is defined as f c = f l + W /2. The length of the chirp-sequence T seq = Q T cc − (T cc − T c ) determines the total coherent processing interval (CPI). Each point-like target then contributes as a delayed and weighted copy of the TX signal to the RX signal where the target attenuation coefficient a k tgt takes into account weighting due to free-space propagation as well as target RCS. The round trip time T d,k (t) is time dependent, since it takes into account a possible movement of the targets. Typically the CPI is in the order of few milliseconds or less, hence short compared to the target speed, and T d,k (t) can then be well approximated by It is the fundamental principle of FMCW radar, that the TX signal s(t) and the RX signal r(t) are multiplied and the mixing product is lowpass-filtered to remove the sum component oscillating at the sum frequency (around twice the carrier frequency). The lowpass-filtered signal, also referred to as the IF or beat signal, is then sampled synchronously to the TX signal timing. For details see [3], [8] and references therein. For each chirp, sampling results in a vector of P samples. Sampling of Q chirps consequently results in Q vectors, each of length P, which are organized in a matrix B = (b p,q ), B ∈ C P×Q with elements b p,q . The index p VOLUME 10, 2022 is referred to as the fast time, the index q as the slow-time.
Since we prefer an interpretation of the matrix elements as multi-dimensional signals, we use the notation b[p, q] = b p,q in the remainder of this work. Assuming typical modulation parameters, this 2D sequence of elements of the radar data matrix in fast-timeslow-time domain then can be expressed by [3], [8] The normalized radian frequency corresponding to range dimension is given by the normalized radian frequency corresponding to Doppler dimension by and the target phase by From (5), (6), (7), and (8) it can already be seen how the essential parameters K tgt , r k tgt , and v r,k tgt are present in the signal model.

B. TIME-DOMAIN ARRAY SIGNAL MODEL
To also establish a relation to target azimuth and elevation, an array of N RX RX antennas can be used, which would be referred to as a single-input, multiple-output (SIMO) configuration. To further improve the angular resolution, several TX and RX antennas in a co-located coherent MIMO configuration are typically used in modern automotive radar sensors, resulting in a number of N = N TX N RX virtual receive antennas. A wide range of possible MIMO configurations exist, see [1], [3], [4]. Fig. 3 illustrates the radar sensor and its possible MIMO configuration used in this work. The singlechannel model from (5) can then be extended by the antenna dimension n. Assuming targets in the farfield of the co-located TX and RX antennas, the coordinate system origin be defined in the vicinity of the antennas, and the narrow band condition, i.e. the targets appear with approximately same λ and ζ at the antenna elements, the corresponding time-domain (TD) MIMO model is given by a k tgt e j(λ k tgt p+ζ k tgt q) e jϕ k tgt e j 2π Since the 3D sequence from (9) describes the elements of a 3D tensor B = b p,q,n , b p,q,n = b[p, q, n], the data structure at this point typically is referred to as the TD radar-data cube, see Fig. 1. Using spherical coordinates defined according to ISO8000:2 [37] the unit-length direction vector (i.e. the DOA) pointing towards target k tgt is given by where φ k tgt , θ k tgt are the azimuth and inclination of the target, and its elevation is given by k tgt = π − θ k tgt . The positions p n of the virtual antennas n = 0 . . . N − 1 are given by which is illustrated in Fig. 3 b) and c) for a radar sensor with N TX = 3 and N RX = 4 antennas. It is obvious that the DOA of the target enters (9) via an additional phase (2π/λ)u k tgt p n dependent on the position of the virtual antenna. ϕ k tgt is the phase with which the target would appear for an antenna at the origin of the reference coordinate system. In the remainder of this work, we do not consider any specific details of the MIMO radar implementation, e.g. how orthogonality between the TX waveforms is achieved [1], [3] or how the effective parameters such as T cc might be influenced by the MIMO principle used [38]. Instead, we will just work with the model that we observe the incident target echos using N RX antennas at positions p n .

C. RANGE-DOPPLER-DOMAIN ARRAY SIGNAL MODEL
Due to the regular sampling grid in range as well as Doppler dimensions, as well as the comparably high inherent resolution in range and Doppler dimensions (see section II-D) it is common practice to apply a 2D discrete Fourier transform (DFT), typically implemented by means of a hardware-accelerated fast Fourier transform (FFT) in modern automotive radar SoCs, along the fast-time and slow-time dimension of the radar data cube. Using the discrete-time Fourier transform (DTFT) the model for the elements of the radar data cube in range-doppler-domain (RDD) then can be expressed by [8] x n (λ, ζ ) = VOLUME 10, 2022 Note that (13) is expressed in a hybrid form, since the DTFT across fast-time p and slow-time q indices yields a continuous-valued 2D function, but the antenna dimension n still is discrete. It is obvious that (13) is a sum of the DFTs of the window functions W ft (λ) and W st (ζ ), applied in fast-time and slow-time, shifted to the respective target frequencies λ k tgt and ζ k tgt . Both the common phase term ϕ k tgt (8) as well as the phase distribution e j 2π λ (u k tgt ·p n ) across the RX channels remain unchanged, and n (λ, ζ ) is the frequency-domain noise. The transition to the discretized frequency indices l and m, as obtained from a DFT, are given by The discretized model for the elements of the radar-data cube in range-Doppler domain X = x l,m,n , again using our preferred notation x l,m,n = x[l, m, n], can then be expressed by The DFT grid in range dimension is l = 0, . . . , P FFT − 1 and in Doppler dimension m = 0, . . . , Q FFT − 1.

D. TARGET DETECTION, DETECTION LIST, AND RANGE-DOPPLER RESOLUTION
Targets are typically detected by applying a peak-search algorithm, referred here as pks(·), to the magnitude-square of (15) along the range and Doppler indices. I.e. the Periodogram is used as estimator for K tgt , r k tgt , and v r,k tgt . Therefore the antenna dimension n is flattened, which could be realized by selecting a single channel N det for detection an incoherent addition of all N or a subset N of the available antenna channels or alternative approaches, such as coherent addition of the antenna channels [39]. The latter would correspond to a range-Doppler (RD) image obtained from RX beamforming, i.e. detection in a slice through the radar data cube in range-Doppler-angular domain. In any case, the detection operation results in a set of K det detected peaks each represented by a tuple of their associated parameters. This set is referred to as the detection list in the remainder of this work. The mapping of peak k det indices l k det , m k det to its corresponding physical target parameters such as r k tgt and v r,k tgt is given by (14) as well as (6) and (7). Peak interpolation methods [40] allow for a finer accuracy of λ, ζ and hence r, v r compared to what is given by the DFT grid resolution, and strategies exist to resolve ambiguities occurring if target range and/or speed exceed the unambiguous interval [41]. Without further elaboration on the peak search strategy pks(·) details, common methods in automotive radar are typically based on variants of constant false alarm rate (CFAR) detection principles, see [3], [8] and references therein. Full 3D or 4D target detection in range-Doppler-angular domain utilizing rectangular uniform antenna arrays and a 4D DFT has been reported [42], but is not considered in this work where all angular processing should be implemented by means of machine learning approaches.
Resolution of targets at this stage is hence provided by resolution in range or Doppler, and for both it is dictated by the resolution of the DFT used for harmonic analysis [43].
Via (6) and (7) the DFT resolution can then be linked to the resolution r ≈ c 2W (19) in range and v r ≈ c 2f c T seq (20) in Doppler dimension. The high bandwidths W of up to 4 GHz at frequencies between 77 GHz and 81 GHz, together with a CPI duration of several ms enable range resolution in the order of r ≥ 37.5 mm and velocity resolution in the order of r ≥ 0.3 m s −1 . Super-resolution frequency-estimation techniques could be applied in range-Doppler domain [8] to further reduce those limits, but compared to a periodogrambased estimation are computationally much more complex, have reduced robustness or required more detailed models or preprocessing steps.

E. DOA ESTIMATION AND TARGET LIST
In the DOA estimation step, first the relevant angular information is added to all tuples from the detection list (18). Therefore, each tuple from the detection list is assigned the corresponding receive vector which is extracted from the radar data cube in range-Doppler domain (15) using the peak indices l k det , m k det obtained during detection (18), see also the inset of Fig. 1. Based on (13), this vector can be modeled by i.e., only a subset of K doa targets contribute to that receive vector. The other targets are separated thus far in range-Doppler dimension, that their contribution is negligible. The complex weight s k doa of each contribution is given by VOLUME 10,2022 and takes into account both its attenuation coefficient a k doa as well as all effects associated with the fact that its range and Doppler frequency might not exactly correspond to the range-Doppler bin l k doa , m k doa . The steering vector a(φ k doa , k doa ) models the distribution of target signals to the elements of the RX array and the noise term ω is commonly assumed as spatially and temporally complex white Gaussian distributed, independent for all receive channels [8], [44]. It is obvious, how (21) -(24) relate the particular case of automotive radar processing to the common array signal model [45]. This relation can be established similarly for different transmit waveforms, eventually resulting in the general signal model from (22), where s k doa then includes any waveform and targetspecific signal contributions. In a general sense, DOA estimation can be referred to as estimating K doa , φ k doa , and k doa given the receive vector x k det of that peak and the known array geometry p n . Two important subclasses can be distinguished: DOA estimation without resolution capability, based on the á-priory assumption of K doa = 1 and returning a (possibly erroneous) pair φ 1 , 1 even if the data comes from the scenario K doa > 1 and DOA estimation with resolution capability, which can resolve up to K doa ≤ K doa,max targets. In the framework of automotive radar signal processing, the former can then be formulated by the following mapping assigning a single azimuth φ k tgt and elevation k tgt estimate to each peak from the detection list, i.e. mapping a single peak to a single target. A common representative method for this is, e.g., the phase-monopulse or phase-matching algorithm [46, sec. 1.3.3], seeking a solution purely based on the RX phases from (24) for the case K doa = 1. The latter, DOA with resolution capability, can be described by i.e. a single peak from the detection list is spatially resolved into multiple targets. At this stage, the description closes the loop to what is illustrated in Fig. 1: we have conceptually modeled how a point-cloud, the target list, is generated from each radar measurement.

F. TRADITIONAL DOA ESTIMATION METHODS AND ANGULAR RESOLUTION
A variety of methods is known as specific implementation for the DOA task with resolution described by (26). Those range from diffraction-limited but simple Bartlett beamforming to highest-resolution but computationally demanding full maximum-likelihood searches [3]. Several approaches in between have been proposed, typically trading off resolution with computational complexity. Examples are, e.g., MVDR, MUSIC, ESPRIT and min-norm [47], to more recent ideas such as the application of compressive sensing [48]. As illustrated before, typically DOA estimation is performed individually for each detection. The index k det is consequently dropped from the receive vector in the remainder of this work and we only work with a receive vector x. An important quantity in array processing and fundamental input for many DOA estimation techniques is the spatial covariance matrix R xx i.e., the second-order statistics across the antenna array element signals. Subspace-based DOA estimation algorithms like MUSIC determine the signal and noise subspace based on the eigenvalues of R xx [1]. The signal sources in automotive radar are however correlated or coherent, R xx becomes rank deficient, and thus the eigenvalues cannot be determined anymore. Preprocessing techniques such as spatial smoothing [49] can be applied to particular array geometries to resolve coherence, but reduce the number of available array elements at the output of the preprocessor. For a rather small antenna configuration as found in automotive radar systems, those techniques are consequently not attractive. Instead, DOA estimation techniques which can directly deal with coherent signals are of interest for the automotive radar use case.
As the true covariance matrix is unknown in real applications, sample covariance matrix is typically used as an estimate for R xx , using N s observation snapshots [44]. Clearly, for R xx ≈R xx numerous snapshots are required. However, as presented before, all of the multiple snapshots obtained in a radar measurement B, i.e. fasttime and slow-time samples are, utilized for range-Doppler processing. Only a single snapshot of the receive vector x remains, rendering automotive DOA estimation as a typical single-snapshot task.

G. ARRAY PERTURBATIONS
The model from (22) is also referred to as the ideal array model, since it assumes ideal point-like sensors spatially sampling the incident wavefield. However, true antenna arrays add various perturbations to this model: the antenna elements are non-isotropic, adding a spatially dependent amplitude and phase characteristic to the signals. Even if identical antenna elements are used, mutual coupling between the elements of the array or effects of individually different environment result in a possibly different amplitude and phase characteristic for each element. Especially in the case of automotive radar sensors, those effects are further influenced by the radome used to cover the sensors, as well further perturbations emerging from the integration of the sensors behind a bumper. Consequently, (22) is extended to where the complex vector includes all perturbations of the ideal array response.

III. DEEP-LEARING FOR DOA ESTIMATION: A SURVEY
The idea of applying deep learning to DOA estimation, is to replace the model-based estimation of K doa , φ k doa , and k doa from (29) by a data-driven approach. In general, there are several ways to employ DL methods in this context, where models can be used for regression, classification or model order estimation (MOE) tasks. Although a 2D DOA estimation is possible, the main direction of interest for automotive use cases is high-resolution in the azimuth angle φ k doa . Similar to the majority of reviewed literature, we consider azimuth DOA estimation only for the remaining work. As illustrated in Fig. 4, the different approaches all utilize the complex receive vector x as starting point, while the actual DL model implements different stages of the processing pipeline. In the following, we discuss the available literature on DL based DOA estimation in detail, with respect to the performed tasks, used network architectures and selected model parameters.
All examined literature is also summarized and compared in Tab. 1, to highlight the similarities and differences of the state-of-the-art.

A. REGRESSION BASED DOA ESTIMATION
The estimation of the signal incident directions φ k doa from (29), can be framed in terms of a regression problem of the vectorφ. Therefore, the K doa variablesφ k doa shall be estimated from the single snapshot covariance matrixR xx , whereφ k doa denotes the estimate of the k doa -th signal DOA φ k doa . A straightforward deep neural network design, directly uses the receive vector x or the covariance matrixR xx as input, one or multiple hidden layers and K doa outputs to represent the estimatesφ. Earlier neural network designs for DOA estimation were based on the so-called radial basis function networks (RBFNs), utilizing a Gaussian kernel for each neuron [19], [20]. While the RBFN of Zooghby, Christodoulou, and Georgiopoulos (1997) [19] was able to outperform MUSIC with respect to the estimation accuracy, actual target resolution capabilities (compare section II-D) were not assessed in these experiments. Another commonly known, DL based universal function approximator is the multi-layer perceptron (MLP) [50], also referred to as fully-connected neural network (FCN) or multi-layer feed-forward network (FFN). Milovanovic et al. (2012) [20] analyze different antenna configurations as well as network architectures like RBFNs, MLPs and probabilistic neural networks (PNNs). Finally, they combine different networks to form a two-stage sectorization network, to obtain high-resolution DOA estimates. Training a MLP based deep neural network with simulation and measurement data from automotive radar sensors has successfully been demonstrated in [17]. The regression based DOA estimation of two targets is performed based on a single snapshot and the overall performance of this approach has been found to be close to the deterministic maximum likelihood estimator.
During training, usually an objective function, e.g. the mean squared error (MSE) between estimatesφ and true DOAs φ, is minimized to adapt the weights of the network according to the gradient of the objective function [15]. Therefore, the target angles should obey a deterministic order, to ensure convergence of the model during training, otherwise the calculated gradient values will be dependent on the order of target angles and decelerate the convergence. In order to avoid this, the vectorsφ and φ can be sorted prior to calculating the appropriate error. For two targets, one can also apply a surjective transform on the vector φ = {φ 1 , φ 2 }, to obtain an equivalent representation φ * (φ 1 , φ 2 ) = {m, d}. The representation in the equivalent space then is independent of the sequence of the two angles, such that φ * (φ 1 , φ 2 ) = φ * (φ 2 , φ 1 ) holds [17]. The proposed transform in [17] is given by the mean and absolute distance of two angles, m 12 = (φ 1 + φ 2 )/2 and d 12 = |φ 1 − φ 2 |.

B. MODEL ORDER ESTIMATION (MOE)
In regression based approaches [17], [19], [20] the ANNs require exactly K doa output nodes for the DOA estimatesφ. Naturally this presumes prior knowledge of K doa for a particular network design. While the number of sources is VOLUME 10, 2022 also a common prerequisite for several DOA estimation algorithms like MUSIC -since it relies on separating signal and noise subspace -the outputs of a neural network are usually fixed and thus the number of targets to be estimated cannot be changed during inference. Hence, a prior estimation of the model order K doa is necessary to serve as an additional input to an appropriate estimation algorithm or model, as outlined in Fig. 4 a). The model order can e.g. be estimated using a generalized likelihood ratio test (GLRT) algorithm or a simplified variant, in order to reduce computational complexity [8].
Deep learning based MOE has also been reported in literature, as Tab. 1 indicates. Gardill, Fuchs et al. [21], [22] introduced a MLP based MOE framework, where superior performance compared to the GLRT was demonstrated for simulations and validated with automotive radar measurement data. Rogers, Ball and Gurbuz (2019, 2021) [23], [24] propose a FFN classifier to estimate the number of sources and compare it with state-of-the-art methods like Akaike information criterion (AIC) or minimum description length (MDL). Additionally, different combinations of input features are tested, with the best accuracy obtained using the spatially smoothed covariance matrix and its eigenvalues as inputs.   [25] also study the model order selection based on a FFN, while the actual DOA estimation is done with a MLE, using the regression based estimates from a second FFN as initialization. They also suggest using a cost function, which minimizes the error between all possible permutations of the target estimatesφ. While this adds significant computational load to the training procedure, further investigations also showed convergence of the network to ordered estimates when receding to the element wise error. Bialer, Garnet and Tirer (2019) [26] compare the computational complexity as well as the performance of deep neural networks with the MLE for joint estimation of the number of sources and their DOAs. Cong, Wang, et al. (2021) [27] propose a DL autoencoder to achieve spatial filtering of the signals with a separate source number detection. Several directed acyclic graph (DAG) networks then perform regression of the DOA estimates, while the necessary MOE is implemented utilizing a simple FFN.

C. SPECTRUM BASED DOA ESTIMATION
Another approach replaces the regression and preceding MOE block with a spectrum estimation, combining both model order and DOA estimation. The building blocks of a spectrum based approach are highlighted in Fig. 4 b). Similar to the spectrum formed by conventional digital beamforming or the MUSIC algorithm, the network output represents a discrete grid o ∈ R 1×L o with a cardinality of L o and each gridpoint relating to a particular angle from the angular range of interest φ FOV = [φ min , φ max ]. Thus, the networks response at the output node o n can be interpreted as the probability of a target being present at the specific angle φ FOV,n [29]. Accordingly, the network is trained using the targets likelihood at a given angle as a ground truth where φ FOV,i is the angle at grid index n within the field of view (FOV) and φ tgt is the set of existing target angles. For off-grid angles, the value of the bin closest to the actual DOA can simply be set to o n = 1. A different possibility is to apply peak interpolation, and describe off-grid targets via interpolated values at neighboring grid points [18], [33].
is also referred to as multiple-hot encoded ground-truth vector, whereas the estimated spectrum is denoted byô. While the regression based approach does not need an explicit target detection step, a peak search is necessary to extract the DOA estimates from the estimated spectrum, similar to traditional grid based DOA estimation.
Ozanich, Gerstoft, and Niu (2019) [28] propose a spectrum based FFN for underwater acoustic localization of two sources using a single snapshot. They compare the performance of their network with a compressive sensing technique, sparse Bayesian learning (SBL), as well as conventional beamforming for both coherent and incoherent sources. Results indicate a similar performance of their network to the SBL approach, with a much faster prediction time. Spectrum based DOA estimation for a single snapshot in automotive radar is shown by   [29], [30], where different output transformations yield superior performance compared to a deterministic maximum likelihood (DML) estimator. I.e. the performance and robustness of the model were increased by including array perturbations, similar to (29), into their data model.
A popular multi-stage DL approach for DOA estimation has been proposed by Liu, Zhang, and Yu (2018) [18] (compare Fig. 4 d)). Multiple spatial filters are realized with an autoencoder (AE) structure, while a set of FFNs are trained to perform the final DOA estimation in each subregion. The performance is evaluated with respect to array imperfections, where the results demonstrate robustness of the network towards different array perturbations like gain, phase or sensor position errors. Further analysis of the mentioned autoencoder approach has been done, e.g. for coherent sources by Ahmed, Eissa, and Sezgin (2020) [31], and for low signalto-noise ratios (SNRs) and sparse arrays by Papageorgiou and Sellathurai (2020) [32]. Papageorgiou and Sellathurai, however, use the AE output for DOA estimation with MUSIC, taking advantage of the denoised sample covariance matrix R xx , highlighting the denoising capabilities of autoencoders in general.
Another ANN architecture, commonly known from image processing, is the convolutional neural network (CNN). As the name suggests, a convolutional layer performs a convolution of the input matrix with learned kernel weights, typically smaller than the input matrix [15]. This concept has been adapted and applied to DOA estimation in [33], comparing the performance for low SNRs to MUSIC and Root-MUSIC algorithms via simulations. The proposed CNN architecture maps the 3 dimensional tensor inputs to a one dimensional output spectrum, as visualized in Fig. 4 c). A direct comparison of a simple FFN with different CNN architectures for N s = 1 and N s = 10 snapshots and coherent sources is presented by Ozanich, Gerstoft, and Niu (2020) [34]. Additionally, the authors compare different network design parameters, highlighting the importance of hyperparameter tuning using validation data sets when designing particular networks.
Yao, Lei and He (2020) [35] demonstrate accurate DOA estimation under coherent signals and unknown model order, using convolutional-recurrent neural network (CRNN). Therein, CRNN blocks are used as spatial filters and subsequent multi-label classifiers, to obtain the combined DOA spectrum estimate. Additionally, a Toeplitz matrix reconstruction of the covariance matrix R xx is embedded in the network and used to estimate the model order. Similarly,   [36] propose an approach to DOA estimation utilizing DL for reconstruction of the array covariance matrix. They train a multi-layer FFN to learn the mapping of the sample covariance matrices of sequentially sampled subarrays to the covariance matrix of the full array.

D. DL DATASETS FOR DOA ESTIMATION
Deep learning techniques rely heavily on the availability of large amounts of data, to train and evaluate the models. In 2016 it was suggested, that DL models need to be trained with about 5 million labeled data points to match human performance [15]. While automatic feature learning during training requires significant amounts of data, in RF applications explicit feature engineering and domain knowledge can leverage this constraint to some extent [16]. However, human expertise is often indispensable for dataset labeling, which makes the creation of RF datasets costly and timeconsuming. Generating labeled data for automotive radar scenarios, e.g. by using a camera image for reference, is not a trivial task, even for human experts. One reason is that a target, e.g. as indicated in Fig. 1, cannot be retraced effectively. However, in the context of DOA estimation, the true position of a target (r k tgt , φ k tgt , k tgt ) must be known, to generate a ground truth. From the direct comparison of the described literature in Tab. 1, it is evident that the majority of approaches to DL based DOA estimation have been presented and evaluated on simulations only. Training and evaluating models using simulated data only, requires a very careful selection of the signal model used to generate the data. If either the dataset or the signal model itself contain some fundamental issues (e.g. a data set imbalance or inherent assumptions on target locations, like a symmetric or regular spacing), training will result in non-generic network performance, while results on similar testing data would still look superior. Potential solutions to this are training or testing on measurement data or a combination of measurements and simulation data.
To create labeled measurement datasets for DOA estimation, conventional antenna measurements with a single corner reflector and a rotary positioner for the radar sensor can be used in a laboratory environment (e.g. an anechoic chamber), as described in [22]. The combination of several measurements at different angular positions, enables the flexible creation of multi-target scenes. This yields datasets with certain advantages, as e.g. the true antenna perturbations are inherently captured by the measurement data according to (29). More sophisticated data samples can be generated by artificially altering model parameters, like the target signal strength, or the total SNR of a measurement snapshot [30].
In terms of practicability, training distinct networks for individual sensors is not an option for commercial automotive radar. Thus, using sufficient measurement data from several sensors and antenna arrays is vital to incorporate multiple sensor models into the DL model. Previous works [17], [22], [30] have shown successful training of DL networks for single-snapshot DOA estimation, with hybrid approaches using real measurements for testing and combining simulations and measurements to generate large training datasets. These networks are able to generalize well on different radar sensor data and achieve super-resolution performance.

IV. EXPERIMENTAL SETUP AND DATA GENERATION
In order to compare the performance of different approaches to DL based DOA estimation with respect to our own dataset, we employ several data generation methods utilizing simulations and measurements. First, data is generated from the ideal signal model according to (22), without adhering to any sensor specific effects like individual antenna patterns, phase distortions or mutual coupling. Afterwards, single target measurements from 414 commercial automotive radar sensors are used to artificially generate scenes containing multiple-targets. We then combine the model-and the databased approach to a hybrid data synthesis, where data points are sampled from a hybrid model, taking into account sensor specific effects, like antenna gain and phase distortions. Finally, normalization and feature preprocessing of the data is performed, to create the ANN inputs.

A. EXPERIMENTAL SETUP
All measurement data used in this work is obtained from antenna array calibration measurements of commercial automotive radar sensors in an anechoic chamber. The measurements were conducted using a single transmit channel N TX = 1 and a total of N RX = 4 receive channels for all sensors. The sensors receive-antennas form a uniform linear array (ULA) with half wavelength inter-element spacing d ULA = λ/2. Each sensor is placed on a rotary positioner while a corner reflector is put at boresight at the other end of the chamber (compare Fig. 5 a)). Overall the antenna array response of 414 different sensors is evaluated within the horizontal FOV of ±60 • . Note, however, that not every sensor measurement was conducted with the same granularity with respect to the step size φ step of the positioner. In order to  realize a uniform dataset across all measurements, we interpolate gain and phase characteristics using piecewise polynomials to obtain the finest measured grid of φ step = 0.2 • . The most important sensor parameters used for measurements are summarized in Tab. 2.

B. SYNTHETIC DATA GENERATION
In order to generate purely synthetic training data for DOA estimation, we can simply sample data points from the ideal signal model according to (22). Assuming an angular FOV of ±60 • and a uniform grid with steps of φ step = 1 • yields 7,260 unique angular combinations for two targets. Likewise, three targets using the same grid already results in 287,980 unique combinations. Further decreasing the step size as well as considering the other variable parameters from (22), quickly leads to enormous datasets rendering a data generation in advance impracticable. Due to these reasons, we employ online data generation, where a specific input vector x is sampled from the model using uniform distributions (unless specified otherwise) for all parameters, with their minimum and maximum values outlined in Tab. 3. Data is generated for one, two and three targets, in order to ensure the network can be utilized as a reliable replacement for the DOA estimation step without the need of prior model order estimation. In [51], the estimation performance of the DML estimator based on a simplified signal model, has been shown to degrade heavily, when dismissing disparate signal strengths, i.e. not jointly estimating the amplitude and DOA of targets. The unequal signal contributions to a single RD bin, can also be interpreted as a displacement of two targets in range and/or velocity [51]. Thus, each targets signal attenuation a k doa is randomly varied, to simulate different target RCS and range-Doppler displacements. Since the DOA estimation use-case only considers targets in the same RD bin, we can limit each individual contribution of a k doa to [0.5, 1], corresponding to a target at the border and the center of the RD bin, respectively. Furthermore, the SNR is varied between 10 dB to 40 dB, in an attempt to train all networks on a wide range of SNR. In this work, we define the SNR similar to [52] as the ratio of the weakest target signal power to the noise power. In general, the minimum training SNR is expected to limit the overall performance of the estimation on low SNR data [33]. On the other hand, the estimation accuracy on high SNR data is not limited by the training SNR, which results in a trade-off between training complexity and prediction accuracy [29].

C. DOA MEASUREMENTS OF MULTIPLE TARGETS
While it is simple to generate infinite samples using simulations, acquiring a sufficient amount of measurements with ground truth is usually much harder. As described VOLUME 10, 2022 in the section IV-A we use the single-target measurements from 414 automotive radar sensors to create the multi-target data. The normalized magnitude and phase response from all measurements are shown for the N = 4 channels in Fig. 5 b) and c) respectively. The magnitude of the complex receive signal at a specific angle corresponds to the angle dependent antenna gain for each receive channel. By evaluating the relative phase of each measurement with respect to the first antenna element, essentially yields the phase of the complex term g(φ k doa , 0)a(φ k doa , 0) from (29), thus combining steering vector and antenna array perturbations. In Fig. 5 c) the mean phase value over all measurements for each channel is plotted as a solid line, while the transparent area highlights the range of phase values within the measurements, attributed to the antenna array perturbations of different sensors.
As the DOA signal model from (29) is a linear combination for all targets, a summation of different azimuth measurements x meas k doa with a single target will result in a valid multi-target scene. The resulting signal x meas for a particular multi-target scene with targets at azimuth angles φ = [φ 1 , . . . , φ K doa ] is thus obtained through This will reduce the effective SNR by 10 log 10 (K doa ) dB, but in general should not limit the estimation performance, as the anechoic chamber measurements already yield very high SNR. I.e. the average SNR determined from multiple anechoic chamber measurements for a single target at φ k doa = 0 • , was determined as 78.2 dB.

D. HYBRID DATA SYNTHESIS
The application of DL models in real world scenarios, e.g. integrating it into the sensor digital signal processor (DSP) or similar on-board processing units, necessitates a sensor specific training to incorporate real world effects into the models. Additionally, the models require a certain robustness with respect to SNR or other model and antenna variations, e.g. as described in (29). Consequently we propose a combination of measurements and simulated data to create synthesized RF data. Doing so yields the combined benefits of a flexible and infinite data generation while ensuring that models can be validated by means of real measurement data. Combining the previous approaches, the antenna array perturbations from (29) are substituted with measurements of the real antenna patterns depicted in Fig. 5. Accordingly we obtain the hybrid model for multiple targets where s k doa = a k doa e jϕ k doa models the complex target signal with uniformly distributed phase and ω is modeled as complex white Gaussian noise. Using (29), the steering vector a meas k doa φ k doa = g(φ k doa , 0)a(φ k doa , 0) models both the measured antenna array perturbations from Fig. 5 b) and c), as well as the signal distribution for a target at φ k doa . The simulation parameters for the hybrid data sythesis are kept unchanged from the ideal signal model, compare Tab. 3.

E. FEATURE SELECTION AND PREPROCESSING
Similar to many model-based DOA estimation methods, a majority of the reviewed literature use the snapshot covariance matrixR xx , or derivations ofR xx , as inputs for the DL estimator (compare Tab. 1). As the covariance matrix is a complex-valued quantity, a transformation from C N ×N to R N ×N ×2 is performed first. Depending on the network implementation, either the tensor of dimension R N ×N ×2 or its flattened vector representation of size R 2NN ×1 can then be used as input vector i. Liu et al. (2018) [18] proposed to use only the upper triangular part of the real and imaginary values ofR xx , due to the hermitian nature of the covariance matrix In (34) triu denotes the operation of taking the offdiagonal upper right matrix and vectorizing the result. The authors state, that diagonal elements are excluded on purpose because of the unknown noise variance. Consecutive publications [31], [32], [53], utilizing the same autoencoder concept, have also adapted this input format. In contrast, convolutional networks use images with multiple channels (i.e. the third image dimension), also referred to as input tensor I. Ozanich et al. (2020) [34] directly input the already square covariance matrix into the network by using real and imaginary parts in the channel dimension [R{R xx }; I{R xx }] ∈ R N ×N ×2 .   [33] use an input tensor with three channels, appending the complex argument to real and imaginary part In the single snapshot scenario, the covariance matrix representation does not yield any additional information compared to the receive signal x. Thus,   [29] directly use real and imaginary parts of the receive vector x as inputs for the network To ensure a specific range of input values or remove bias from the input data, feature preprocessing or standardization methods are used. A normalization to achieve zero mean and unit variance is often used in DL applications, since it has been shown to speed up the training convergence, and avoids unwanted emphasis of certain features [54]. For the automotive use case, this is equivalent to removing the influence of total receive energy from the signals. However only a small subset of the reviewed literature apply explicit normalization of the input data. Liu et al. (2018) [18] perform a normalization of real and imaginary parts of the covariance matrix, by dividingR xx by its Frobenius norm Similarly,   [29] use the L2 norm of x for normalization to achieve a range of i n ∈ [−1, 1] for all input values.

V. DEEP LEARNING BASED DIRECTION-OF-ARRIVAL ESTIMATION
Now, we use the described datasets and methods in order to evaluate three selected deep neural network designs from literature for spectrum based DOA estimation in the complex and challenging automotive environment. The architectures (compare Fig. 6) are analyzed with respect to their inputs, building blocks, contained layers, output representations and the used hyperparameters for training. All parameters and configurations are adopted directly from the respective literature without altering or optimizing where possible. This is explicitly done to emphasize a comparison ''as is,'' and analyze the differences, advantages and drawbacks between different architectures and approaches.

A. SPATIAL FILTERING AUTOENCODER (AE)
The first network in Fig. 6 a) corresponds to the multistage approach from Liu et al. (2018) [18], consisting of a multitask autoencoder followed by a series of parallel MLPs classifiers. Autoencoders are DL networks which perform a reconstruction task, typically trained by unsupervised learning. After reducing inputs to a compressed representation, the latent variables, a decoder network reconstructs the input from this intermediate representation [15]. This can e.g. be used for data denoising, or to find efficient compressed data representations. Additional supervised training with noisy inputs and an undistorted ground truth will explicitly enhance the denoising capability of AEs [55]. The analyzed spatial filtering AE uses the upper right part ofR xx as inputs (compare (34)) fed to the encoder. A decoder reconstructs the signal components from P distinct spatial regions, i.e. the covariance matricesR p i xx containing only signal components from a specific spatial subregion p i . The output of subregions not containing any target signal should consequently be zeroed. Since the autoencoder uses linear activations to achieve an additive mapping AE(z a + z b ) = AE(z a ) + AE(z b ), it actually extracts the principal components of the input. I.e. linear autoencoders have been shown to be equivalent to a principal component analysis (PCA) [56]. Thus, the spatial filtering process creates more concentrated distributions within each subregion, which enables an easier generalization for subsequent DOA estimators.
The actual DOA estimation is done by using P parallel classifiers, each implemented as MLPs with two hidden layers and tanh activation functions and an output layer with linear activation. The covariance matricesR p i xx of the subregions are used as inputs, while each classifier outputs a subset of the spectrum grid. Concatenating all outputs then yields the estimated spatial spectrumô for the full FOV.
A detailed overview of all layers in the autoencoder and the parallel classifiers can be found in [18]. The particular implementation used in this work is adjusted to match the requirements for a single snapshot DOA estimation using the radar parameters from Tab. 2. Due to the much lower Rayleigh resolution limit of the used four-element ULA, we only use P = 4 subregions and MLPs instead of 6. The AE part of the framework is then trained separately on the reconstruction and spatial filtering task in advance. After finishing training of the first stage, the full network is assembled and trained again, while all weights of the AE are frozen to not be further adjusted during training of the second stage. Both, the AE and the full network are trained with a squared L2 loss o −ô 2 2 for optimization. The final trained network has only 13,657 total trainable parameters.

B. DEEP CONVOLUTIONAL NEURAL NETWORK (CNN)
The second architecture in Fig. 6 b), shows a convolutional neural network trained for DOA estimation as multilabel classification on a discrete grid. Papageorgiou et al. (2021) [33] designed a CNN with 24 layers in total, where the convolutional part consists of 4 blocks with convolutional layers, batch normalization and relu(z) = max(0, z) activations. A succeeding flatten layer converts the multidimensional tensors into a vector representation that are fed to a MLPs with 4 fully connected layers with relu activations. The output layer uses a sigmoid activation to obtain the resulting output spectrumô with o i ∈ [0, 1]. As described in (35), real-part, imaginary-part and angle ofR xx are stacked to form a 3D input tensor, handed to the first convolutional layer. Compared to the AE, where the first stage performs spatial filtering of the input signal, the CNN approach lacks interpretability of the convolutional operations within the CNN architecture. This is a common problem in DL, which is recently gaining more attention. Thus, research is being conducted towards explainable and interpretable machine learning [57], [58].
A detailed overview and explanation of all layers within the CNN architecture can be found in [33]. Our explicit implementation of the CNN network differs from the literature only in the first layer, since the number of elements inR xx is much smaller for a four-element ULA. As a result, it is not possible anymore to use the first convolutional layer with a kernel size of (3 × 3) and strides of 2, in combination with an input of dimension R 4×4×3 . In consequence, we adjust the kernel size (1 × 1) and strides (1 × 1) of the first layer to yield the exact same dimensions as the model from [33] after the first layer. In total the number of trainable parameters is smaller compared to the original work, due to the smaller input FIGURE 6. Overview of the different evaluated DL architectures for spatial spectrum estimation. The autoencoder (a) based architecture from [18] performs spatial filtering and constructs the spectrum from several parallel MLPs. The deep CNN based architecture (b) from [33] uses several convolutional layers, followed by MLPs to extract the spectrum estimate. The third approach consists of a single MLPs (c) from [29] to obtain the spectrum estimate directly from inputs. Different network input representations and the total number of parameters in each network are shown on the left. dimensions and thus less first layer weights. With 12,454,776 total parameters the CNN is still the largest of the examined network architectures.

C. DEEP MULTI-LAYER PERCEPTRON (MLP)
The third evaluated network is a deep fully-connected MLPs presented by   [29], [30], with the network architecture as displayed in Fig. 7. It features an input layer, directly fed with the signal receive vector, according to (36), followed by three hidden layers (HL) with 256, 512, and 1024 nodes, respectively. The exponential linear unit (elu) and tanh are used as activations, where the elu function is given by A fourth hidden layer with the same dimensions as the output layer uses a sigmoid activation to obtain likelihood values within the range [0, 1]. Ultimately, the output layer (OL) uses a softmax activation, that converts individual probabilities to an actual probability density spectrum (PDS). A custom loss function combining the categorical cross-entropy with focal loss, as proposed in [59], is used for optimization. The focal loss balances the influence of positive (i.e. an existing target at the specific output grid angle) and negative examples (i.e. no target at the specific grid angle), since there are always significantly more grid points without targets than points Since the predicted output spectrumô in the singlesnapshot case will not contain extremely sharp peaks, the multiple-hot encoded ground truth o won't be matched very closely. In other words, an estimate very close to the actual DOA will still yield the same penalty as a completely wrong guess, which could impede the convergence of the model or produce a very noisy spectrum. To mitigate this effect, a so-called ''bellshape'' is sampled from the sinc function and used instead of the sharp peaks from (31), resulting in less penalization of estimates very close to the true target DOA [30].

VI. PERFORMANCE EVALUATION OF DEEP LEARNING BASED ESTIMATORS
In the following section, we train the aforementioned DL based DOA estimators on datasets created according to the methods explained in section IV. After introducing the training and testing conditions, the models are evaluated for each scenario with respect to the achievable accuracy as well one of the most important quantities for automotive radar, namely the resolution of targets. The results are interpreted taking into account the differences in their DL architecture, while comparing with benchmarks of high-resolution techniques like the DML estimator.

A. TRAINING AND TESTING SCENARIOS
For training the networks on spectrum estimation, the output grid o ∈ R 1×L o needs to be fixed, essentially limiting the achievable estimation accuracy of the network without explicit post-processing. Liu et al. (2018) [18] use L o = 120 output nodes, while the authors in [33] use L o = 121 nodes.   [30] utilize the finest grid with L o = 1201 output nodes for the MLP. Since the output of MLP has been found to be very noisy and requires special post-processing [30], while the AE architecture by design needs a number of output nodes divisible by P, we chose to employ a common compatible output grid for all three estimators. Therefore, the grid with L o = 120 grid points and the FOV φ FOV = [−60 • , 60 • ] is used, resulting in output bins of the size φ o = (φ max − φ min )/(L o − 1). Accordingly, we set the minimum separation of two targets to the minimum detectable separation within the output grid φ min = φ o ≈ 1.01 • . Since we effectively have a ULA with N = 4 antenna elements, it is a well known fact that the number of targets which can be resolved in the case of single-snapshot DOA estimation of coherent sources is K doa = 3 [13]. This limits the maximum number of targets we are considering. The number of targets is varied randomly for each generated sample, in order to emphasize the hard examples with K doa > 1 in the training set, the probabilities are adjusted to make a sample with K doa = 2 twice as likely as K doa = 1 and K doa = 3 three times as likely respectively. The angle for the first target is always sampled uniformly from φ FOV , while the angles of other targets need to comply with the restrictions imposed by φ min and thus do not comply with being uniformly distributed. Additionally, the SNR is sampled uniformly from 10 dB to 40 dB. Models are trained using a batch size of 32 for 200 epochs with 3,200 steps being performed during each epoch, consequently each model has been trained with a total of 20,480,000 data samples. The Adam optimizer with default parameters (α lr = 0.001, β 1 = 0.9, β 2 = 0.999) is used for all scenarios since it shows a robust performance and has been used in the reference literature [18], [30], [33]. All training parameters are summarized in Tab. 4.
Due to the large amount of variables and the online data generation process, the possibility of a network memorizing the dataset and thus over-fitting, can be ruled out. However, networks are still tested with different configurations (e.g. different SNR, measurement data or off-grid DOAs), which have not been seen during training. Using identical training data sets as well as the matching output layers ensure that all performance metrics can be directly compared.

1) SCENARIO A
The first evaluation scenario, considers a training procedure with parameters from Tab. 4 and pure simulation data as described in section IV-B and Tab. 3. All performance evaluations of scenario A are also performed with purely synthetic data. Therefore, it is then used to evaluate the general performance of models with respect to the single snapshot signal model, ignoring array and sensor specific perturbations.

2) SCENARIO B
The second scenario is used to investigate the influence of realistic gain and phase perturbations of real automotive radar sensor arrays. Therefore, the trained networks from scenario A are used and tested on the hybrid datasets derived from real measurement data as described in section IV-D. Here, mainly the effects of unknown sensor perturbations on the pre-trained network responses are evaluated.

3) SCENARIO C
Finally, we employ the training procedure with parameters from Tab. 4, and data generated according to the hybrid model, using data from 414 different sensors (compare (33) and Tab. 2). A split of the measurement data into training and testing sets is performed, such that 85% (352 individual sensor measurements) are used for the data generation during training and the remaining 62 sensor measurements are used for evaluation of the performance metrics. VOLUME 10, 2022 FIGURE 8. Exemplary normalized angular spectrum (blue), predicted by three different ML architectures for two targets of equal strength. The true azimuth angles (φ 1 = −10 • and φ 2 = 10 • ) are depicted as green triangles, the actual estimates -determined from the spectrum maxima -are marked with red crosses.

B. PERFORMANCE METRICS
An important measure for automotive super-resolution DOA estimation is the resolution of targets. This will be measured in terms of probability of resolution with 1000 Monte Carlo trials for a specific source separation φ of two targets. The two targets are placed symmetrically around the origin (zero azimuth) with a total separation of φ. Therefore, the following condition for a scene being counted as successfully resolved is used I.e. a scene is successfully counted as resolved, only if both estimates lie between the true DOA φ 1 or φ 2 and the center point 0 • . The second and third metric evaluate the mean squared error of K doa = 2 and K doa = 3 targets with a constant separation, calculated and evaluated over a range of SNR values. Finally, we use the accuracy of predictions, where we define a scene of arbitrary number of targets K doa to be estimated correctly if Thus, only guesses with less than 3 • absolute error from the true DOA are counted as correct.
In order to provide a baseline for DOA estimation we compare the networks performance with the conventional (Bartlett) beamformer, the Root-MUSIC algorithm and the DML estimator [12], [60], [61]. The specific DML implementation is based on the ideal signal model from (22) and does not take into account the perturbations described in (29). Note, that due to the extreme computational complexity, we only use the DML for comparison in scenes with two targets. The grid based estimators (Bartlett and DML) are evaluated on the same grid as the networks output layer. The actual extraction of target estimates from all estimated spectra, is performed using a simple 1D search for lokal maxima, which need to be separated by at least φ min . In this work, we apply the find_peaks function from the scipy.signal Python package, since our focus lies in the direct comparison of the DL algorithms by using the proposed data generation methods. Some performance metrics require the correct estimated number of targets for calculation (e.g. MSE and accuracy). If fewer peaks than the true number of targets K doa are detected, the scene is counted as ''not resolved,'' which means it is just excluded from the MSE metric, but still counted as p acc,i = 0 for the accuracy metric. If more peaks are detected, only the largest K doa peaks are used for calculations. Final estimates are then sorted in ascending order for all metric evaluations.

C. EVALUATION OF SCENARIO A
The predictions of all three networks for a two target scene with φ 1 = −10 • , φ 2 = 10 • and SNR = 40 dB are compared in Fig. 8 a)-c). Since φ = 20 • > φ 3 dB holds, all networks in this specific case achieve super-resolution and more or less correctly identify the targets. The CNN in particular has the narrowest spectrum response, which is a desired property in spectrum estimation. Also, both angles are estimated perfectly with a small residual difference, resulting from the off-grid DOAs. The MLP also correctly identifies both targets, however, the peaks are a little wider and the first target estimate shows some bias. Finally, the AE networks response is found to be very noisy, which results in an unreliable peak detection and extraction of the DOA estimates [30]. This can be attributed to the large difference in total trainable parameters in the networks (compare Fig. 6) and thus the AE being unable to fully adapt the single-snapshot signal model for multiple targets, while providing robustness to varying noise levels.
The performance with respect to resolving two targets of different angular separation is shown in Fig. 10. The CNN even surpasses the DML estimator for small source separations, while both achieve a probability of 77% for a separation  of φ = 12 • . In comparison, the MLP achieves 80% at around φ = 16 • , while the AE barely surpasses 80% for any target separation. The Root-MUSIC estimator fails for the majority of the trials up to 35 • , due to its dependence on the use of multiple snapshots or spatial smoothing. The Bartlett beamformer reaches 100% resolution probability at around 46.6 • , which is exactly the diffraction limit of the 4 element ULA [29]. Another observation that can be made, are the dips in probability being present for all three networks. The probability breaks down to some point and then goes up again around 30 • to 40 • for the CNN and MLP or between 40 • and 50 • for the AE. A similar pattern has also already been observed in [30], mostly independent of the evaluated SNR. As a possible reason, Gall et al. found a high co-linearity between two target and multiple three-target scenes, causing an ambiguity which the networks might be unable to resolve. An additional effect might arise from scenes with higher source separation which occur less frequently in our training data set, since there are fewer possibilities to place two targets with φ = 60 • separation then with φ = 10 • on the same grid.
Next, the average errors of the estimates are compared for a two target scenario with arbitrary angles φ 1 = 5.7 • ,  Fig. 11. Bartlett and Root-MUSIC both lack any dependence on the SNR, and also have the largest MSE of all estimators. For lower SNR up to 15 dB, the AE slightly outperforms the other networks as well as the DML, which is most likely a result of the noisy output spectrum resulting in random detections, close to the actual DOAs. CNN, MLP and DML all saturate in MSE after around 30 dB, with the MLP slightly outperforming both DML and CNN. In comparison, the AE does not achieve the same performance with an MSE less than 10 −2 , since the noisy spectrum does not improve for higher SNR.
In terms of estimation accuracy (compare Fig. 12), the AE, Bartlett and Root-MUSIC achieve 'accurate' estimates in only less than 7% of all trials. As the evaluated scene with a target separation φ = 20 • is well below the diffraction limit, this is to be expected for the Bartlett beamformer. Both, MLP and CNN network achieve around 65% to 70% accuracy for SNR values above 30 dB, while the DML falls slightly short with a maximum of 61% accuracy. This clearly proves the difficulty of obtaining highly accurate DOA estimates for multiple targets in the single-snapshot case. Note, that the authors in [33] explicitly train the CNN with ideal data and evaluate in the low SNR region, achieving a much better performance. All evaluations in [33] are, however, performed with N s ∈ [100, 10000] total snapshots (compare Tab. 1), which effectively mitigates the challenge of low SNR, but is not feasible for an automotive use case. In Fig. 13 a)-d), different two target scenes across the FOV are evaluated, with a fixed target separation of φ = 30 • and SNR = 40 dB. The true target locations are moved on a grid with φ step = 0.1 • depicted by the black dashed line, while the two estimatesφ 1 (yellow)φ 2 (green) are shown as dots and squares. The Bartlett beamformer (Fig. 13 a)) clearly fails to obtain the correct estimates for angles closer to the end of the FOV. However, even for smaller DOAs there remains some ambiguity, resulting in alternating estimates between the correct DOA and an erroneous estimate offset by 40 • in the wrong direction. The AE in Fig. 13 b) once again has very noisy estimates, where the estimation mostly only works correctly within ±20 • FOV. Both the CNN (Fig. 13 c)) and MLP (Fig. 13 d)) obtain pretty accurate estimates with some remaining outliers. The MLP also failed to learn accurate two target scenes if one angle is outside ±40 • , indicated by the dots forming lines with constant DOA estimates. An additional effect is evident from this plot, where the estimation variance increases towards the end of the FOV for both networks. This is also to be expected due to the increasing attenuation of the included antenna pattern (compare Fig. 5 b)) as well as vanishing phase differences between RX channels for larger DOAs. Finally, we also assess the performance with respect to scenes containing three targets. Fig. 14  for various SNRs. The three networks MSE decreases until around 20 dB to 23 dB after which it stays constant, indicating the estimation limits imposed by the networks output grid resolution, similar to the two target case. However, the performance of the Bartlett beamformer in this case shows the best MSE of all estimators, which is kind of counterintuitive. Further investigations show, that the beamformer actually fails to resolve all three targets in 70 to 80% of all trials. This is also in line with the low resolution probability of Bartlett in the two target case at SNR = 30 dB and φ = 20 • from Fig. 10. Since the MSE for unresolved cases (i.e. |φ| < K doa ) cannot be calculated and the resulting 20 to 30% result in reasonably good estimates, the overall MSE does not reflect issues with resolving targets. Thus, the MSE metric should be only used in conjunction with other metrics such as the resolution probability or accuracy to reliably judge the performance of an estimator.

D. EVALUATION OF SCENARIO B
Now we evaluate the three already trained networks on the datasets generated using the sensor measurements. Fig. 15 shows the resolution probability for scenario B. The performance of Bartlett and Root-MUSIC are mostly identical to scenario A with pure simulation data. However, all three networks, as well as the DML show degraded performances, where the 80% mark is only reached for a higher source separation or not at all for the AE. In general the DML still outperforms the networks in scenario B. Similar to the DML estimator, the networks do not have any prior information on the sensor perturbations introduced by the hybrid data generation approach. There are still dips around 30 • to 40 • (CNN and MLP) as well as 40 • to 50 • (AE) present. Compared to scenario A these are less pronounced, presumably due to the stronger impact from sensor gain and phase deviations. In terms of MSE of the estimates, very similar results as in scenario A can be observed from Fig. 16. The DML now additionally outperforms all three estimators for SNR > 25 dB. The performances of Bartlett and Root-MUSIC are identical to scenario A, while the lowest achievable MSE decreased for all networks and the DML estimator. This is to be expected, since the included measurement data imposes more degrees of freedom on the data model and the estimators are not adapted to it.
The two target scenes accross the FOV with a separation of φ = 30 • and SNR = 40 dB are shown in Fig. 17. Compared to Fig. 13, the variance of all three network estimates increases significantly, while the beamformer performs similar on measurement data. In some cases, the green and yellow predictions even overlap, meaning the two targets could not be separated successfully. Additionally, from Fig. 17 it is evident that the accuracy of 3 • for all estimates, as defined in (41), cannot be met for either estimator and thus this metric will be omitted for scenario B.   As expected, all networks also perform worse in the three target scenario, compared to scenario A (Fig. 14), with now insignificant discrepancies between all three DL networks. Results and explanations of the beamformer outperforming the other estimators, are identical to scenario A.

E. EVALUATION OF SCENARIO C
In the last scenario we evaluate the advantages of training the networks with data from the hybrid signal model, while VOLUME 10, 2022 testing and evaluating all metrics with completely different unknown sensor data. The resolution probability versus varying source separation is shown in Fig. 19. When compared to the previous scenario, all three networks show an increase in the resolution capabilities for two targets. Notably the MLP network outperforms the other estimators, even surpassing the DML to some extend. Also the AE now shows a performance superior to the Bartlett beamformer for separations of up to 40 • . The performance of DML, Bartlett and Root-MUSIC matches their performances from scenario B, since they do not take into account any prior information about sensors and measurements. The results from the two-target MSE for scenario C show the successful adaption of the DL estimators to the measurement model. As shown in Fig. 20, the original MSE from scenario A is reached again for the CNN for high SNR, while the MLPs performance falls slightly behind the DML estimator. The CNN as the largest DL model, has the most potential of learning the additional degrees of freedom from the measurement data model, resulting in superior performance. Accordingly, the AE performance decreases when compared to scenario A, which is a clear sign of the network having difficulties capturing the signal model, presumably due to the low amount of network parameters.
Again we visualize the estimation performance of two targets with a separation of φ = 30 • and SNR = 40 dB. In Fig. 21 (a) the DML is shown as a reference, since the beamformer results do not change from scenario B to C. As the implementation of the DML estimator is based on the ideal signal model, the estimates contain high variance, especially for higher DOA angles, where the antenna gain decreases significantly. This is also in line with the observations from [51], where the DML estimates degrade for targets with varying separation or amplitudes. Compared to Fig. 17, the variance of all three network estimates is increased. Thus, the DL models are able to inherently capture the sensor signal models and generalize to even on unseen sensor data. The distortions and individual outliers towards the end of the FOV are visible for all models, similar to as found in scenario A. Some individual trials remain, where the two targets are not separated successfully.  Evaluating the estimation accuracy according to (41), once more reveals the issues of the Bartlett beamformer, Root-MUSIC algorithm as well as AE network. As Fig. 22 indicates, all three fail to provide accurate estimates independent of the SNR. Both, the MLP and the CNN network surpass the benchmark provided by the DML. Note, however, that the total probability of accurately estimating the DOAs with respect to the 3 • threshold barely surpasses 60%. Finally, the performance with respect to three target scenes is evaluated in scenario C. Fig. 23 depicts the MSE evaluated for three target scenes and various SNRs. This time the performance of the DL models is equal (AE and CNN), or even decreased slightly (MLP), when compared to Fig. 18 from scenario B. Again only the Bartlett beamformer fails to resolve most of the trials, while the networks performance does not increase compared to the previous scenario.

VII. SUMMARY AND DISCUSSION OF OUR FINDINGS
To summarize the outcomes of this study, Tab. 5 concludes the source separation φ at a probability of resolution p res = 0.8 for all three testing scenarios. Estimators are roughly ordered from best to worst source separation, while no value (−) means the estimator never reached 80% in this scenario. In general, the Bartlett beamformer reaches p res = 0.8 shortly before the Rayleigh resolution limit, while Root MUSIC fails to reach it for all scenarios. The AE model slightly exceeds the results of the beamformer in scenario A and C. MLP and CNN network both almost reach or even outperform the DML in some scenarios, while they clearly benefit from the additional training on measurement data in scenario C. Additionally we computed the average execution times for all estimators on an Intel Core i5-9600K (3.7 GHz) for 1000 estimation cycles. The large disparity in execution times is directly evident, where a single DML estimate takes over half a second on current CPU models. Consequently, for most automotive scenarios the DML estimator is practically not relevant, since its computational complexity outweighs the super-resolution benefits. The network estimates take around 30 ms to compute, which is within the range of typical automotive radar sensor update rates [1]. Overall our experiments prove the viability of deep learning based super-resolution DOA estimation for automotive radar with a single snapshot. The importance of large and meaningful datasets for training these models for single snapshot estimation is demonstrated via different scenarios. Training solely with synthetic data will result in degraded performance when testing with real sensor data (scenario B). The deterministic maximum likelihood estimator outperforms all networks in fair comparison scenarios (scenario A and B), but can be matched for scenario C where the networks possess prior knowledge of the measurement and sensor data model through training. A big advantage of all evaluated DL models -compared to the DML -is the computational complexity, resulting in around 20 times faster estimation performance.

VIII. CONCLUSION AND OUTLOOK
This work provides an overview of recent advances and an investigation of different approaches to deep learning based direction of arrival estimation. A model-and databased data generation approach is demonstrated utilizing simulated, hybrid and real world sensor data. Three promising DL models are trained for single snapshot scenarios to estimate the DOA of up to three targets. Their performance is explicitly compared with the hyperparameters from reference literature, without further optimizations. Findings prove that super-resolution can be achieved with varying network sizes, while the overall estimation accuracy is increasing for larger networks. Fast inference times and sufficient generalization are among the most important benefits, e.g. when compared to the deterministic maximum likelihood estimator. Furthermore, the importance of using real world data is successfully demonstrated by training and testing networks on sensor data from more than 400 different real automotive radar sensors. With the very limited availability of public RF datasets our findings clearly highlight the advantages of the proposed hybrid data generation approach.
Automotive radar has evolved towards an imaging sensor, requiring high spatial resolution. The trend towards higher frequencies, as well as new concepts like antenna in package, will result in smaller absolute dimensions. Accordingly, the aperture of future automotive MIMO radar sensors is limited and high-resolution techniques remain necessary to achieve the required angular resolution. Deep learning techniques are already used in several fields of radar signal processing, opening up new fields of research. This work provides an entry point into the basics of automotive radar DOA estimation, proposes new concepts of hybrid RF dataset generation, and highlights the different opportunities and challenges for DL based DOA estimation. A further evaluation of the performance of DL estimators on embedded hardware with e.g. limited precision or fixed point arithmetic are still open questions. Especially hardware specific optimization and the integration of DL based DOA estimation on embedded systems have to be evaluated in future research. Additionally, with the rising threat of mutual radar interference, an investigation of the impact on and robustness of DL based DOA estimation techniques is of major interest to the community.