Characterization of Moving Sound Sources Direction-of-Arrival Estimation Using Different Deep Learning Architectures

Sound source localization is an important task for several applications and the use of deep learning for this task has recently become a popular research topic. While a number of previous works have focused on static sound sources, in this article, we evaluate the performance of a deep learning classification system for localization of moving sound sources. In particular, we evaluate the effect of key parameters at the levels of feature extraction (e.g., short-time Fourier transform (STFT) parameters) and model training (e.g., neural network (NN) architectures). We evaluate the performance of different settings in terms of precision and F-score, in a multiclass multilabel classification framework. In our previous work for localization of moving sound sources, we investigated feedforward NNs (FNNs) under different acoustic conditions and STFT parameters and showed that the presence of some reverberation in the training dataset can help in achieving better detection for the direction of arrival of the sources. In this article, we extend the work to show that the window size does not affect the performance of static sources but highly affects the performance of moving sources, a sequence length has a significant effect on the performance of recurrent architectures, and a temporal convolutional NN can outperform both recurrent and feedforward networks for moving sound sources.

Such methods were developed under a free-field propagation model, which can lead to severe performance degradation in reverberant and noisy scenarios. To overcome these degradations, different algorithms using various neural network (NN) architectures and different sets of features have been considered in recent years [3].
A common practice in processing complex-valued features (such as STFT features) using NN is to represent the features as two arrays of real numbers. However, it is also possible to adapt the NN model to work directly with complex numbers as in [7]. Another difference among different algorithms is in how to formulate the direction-of-arrival (DOA) detection problem. For instance, Guirguis et al. [8], Adavanne et al. [12], Diaz-Guerra et al. [13], and Huang et al. [14] formulated the SSL problem as a regression task. A more popular option has been to group DOAs into "classes" and formulate the DOA detection as a classification problem. Takeda and Komatani [7], Guirguis et al. [8], Adavanne et al. [12], Diaz-Guerra et al. [13], and Huang et al. [14], Me et al. [16], Pang et al. [17], Vecchiotti et al. [21], and Xiao et al. [22] considered scenarios where only one speaker is active, while Chakrabarty and Habets [5], [6], Ma et al. [15], and Grumiaux et al. [19] have considered multiple speaker scenarios.
The use of CNNs has been prevalent in image processing for their exceptional ability to detect local patterns. A popular use of this NN architecture in acoustic DOA detection is to analyze spectrograms generated by STFTs and other similar methods [5], [6], [17], [20], [21]. While these works share a similar NN model, authors have focused on optimizing different aspects of their algorithms. For example, Chakrabarty and Habets [5] studied the effect of training the CNN model using synthesized noise signals, while Chakrabarty and Habets [6] focused on reducing the computational cost of CNNs. Jung et al. [9] combined convolutional bidirectional recurrent NN (CBRNN) with transfer learning, to deal with This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the issue of overfitting that occurred from increased model complexity. Guirguis et al. [8] introduced a more robust and hardware-friendly architecture based on a temporal convolutional network (TCN). Adavanne et al. [4], Cakir et al. [10], Zinemanas et al. [11], Huang et al. [14], and Grumiaux et al. [19] combined a CNN with a recurrent NN (RNN). Hammer et al. [24] used a fully convolutional network (FCN), while in some other work, simple fully connected feedforward NNs (FNNs) have been used [7], [8], [9], [10], [23].
Most of the previous work has been for the DOA estimation of static acoustic sources. Estimation of moving acoustic sources DOA can be required for different applications in reallife conditions, e.g., DOA estimation of a moving drone based on a microphone array, DOA estimation of people talking while moving near a listener wearing binaural hearing aids, and so on. Knowing the DOA information then allows further multisensor signal processing, for example, beamforming and multichannel signal enhancement for noise and interference reduction. For static sources, the stationarity of the input data is not a limitation, i.e., it is possible to use as many available past input samples as needed in order to predict sources' DOAs. In the case of moving sources, the nonstationarity of the input signal makes the problem more challenging, as only recent samples are relevant to predict the current sources DOAs. This limits the size of the different signal windows that can be used as inputs to an ML-based DOA estimation system, as well as the sequence length that can be used in ML-based systems using sequences of inputs.
Some work on DOA estimation of moving sources has recently been performed [8], [12], [13], [19], [24]. However, the works in [8], [12], and [13] only considered the case of a single moving source. Grumiaux et al. [19] used a high tolerance in their work to get a good performance. Also, in [24], the work was done only for noise-free scenarios, with no recurrent network architecture considered.
Most of the research done in this field is concerned about finding the best NN model that is capable of mapping a certain set of features, measured from the directional sound source to the corresponding DOA. The process of finding a good model implicitly involves another important process: finding a good training dataset. Designing a good dataset requires a lot of domain-specific expertise. However, the effect of different parameters is not always understood because NN models are black-box systems. This is why most of the research work in this field focuses on reporting a working setup that leads to good performance, but without providing an in-depth analysis of the effect of different hyperparameters, including the ones that did not work.
This article is an extension of our previous conference proceedings paper [25], and the objective of these papers is to provide a systematic study of the effect of a wide range of parameters for DOA estimation of moving acoustic sources. Given the simplicity and ease of implementation of FNN neural models, as they do not require the use of sequences of data, feedback, or recurrent architectures, the investigation of the performance of FNN for the case of moving acoustic sources is still of interest, despite the availability of more sophisticated architectures, and it was investigated in the previous paper. The previous paper also evaluated the effect of different hyperparameters and acoustic conditions for DOA classification of moving acoustic sources, such as the performance for different signal-to-noise ratios (SNRs), reverberation times (RTs), static and moving sources, different angular speeds for moving sources, and STFT parameters such as frequency bins and hop length. In this article, we extend the previous work by evaluating the effect of the key window size hyperparameter in the STFT analysis. Also, in addition to feedforward networks, we consider RNN networks and TCN networks. We conducted a direct comparison of these architectures and examined their performance under identical acoustic conditions. Different sequence lengths, different datasets featuring stationary and moving sound sources, different levels of diffuse-like background acoustic noise (i.e., different SNRs), and different source angular velocities have also been studied.
We formulate the problem of DOA estimation of moving acoustic sources as a classification problem and evaluate the performance of using different hyperparameters and model architectures based on metrics, including precision and Fscore. It is worth noting that we do not attempt to predict the distances of the sources in this article. This is a challenging task because both the distance and the original level of the sources affect the level of signals received at the microphone array. By estimating the DOA of the sources at different times, it would be possible to estimate the angular speed of the sources, if required.
The rest of this article is organized as follows. Section II presents our system design, followed by evaluation of the results in Section III. This article is then concluded in Section IV.

II. SYSTEM DESIGN
The proposed system consists of three main steps shown in Fig. 1. First, we generate scenarios of multichannel directional sources in a simulated noisy room. The tunable parameters in the configuration file represent different acoustic conditions. The main two parameters are RT (which is determined by the room size and wall reflectivity) and background noise level. The result of this step is referred to as "sample-based database" in Fig. 1, to emphasize that the processing is done in the time domain. Second, we extract STFT features and process them so that they can be used by our NN models. The choice of STFT hyperparameters, such as window size, hop length, and the number of frequency bins, determines the "quality" of the dataset, which affects the performance of the subsequent classification step. Finally, we train our deep learning models to perform the DOA classification on the acoustic scenarios dataset and study the effect of different parameters and configurations. The detail of each step is further explained in the following.

A. Signal Generator
The signal generator is used to generate multichannel directional sound sources in a reverberant environment. Our  simulator combines three public-domain available tools in a unified pipeline and implements the necessary utility functions that allow experimentation for research. In particular, we used the room impulse response (RIR) generator from [27], the spatial noise generator from [28], and the STFT function from the nnAudio library [29]. In the following, we explain these components in more detail.
We first generate synthetic white noise signals to create mono sound sources x(t) with a desired total length T (s) and a sampling rate f s (Hz). These signals become multichannel directional sources later, after convolutions with room RIRs.
Each source with signal x i (t) can be stationary or moving. As shown in Fig. 2, in the case of moving sources, we define the angular velocity ω i (in rad/s) and the movement trajectory in a configuration file. We assume that each source is moving at a constant ω i angular speed on an arc-like path. The path is defined along a constant radial distance ρ (in meters) between the centers of the microphone array. Once the source position is calculated at every time step in the spherical coordinates, we convert it to Cartesian coordinates as required by the RIR generator. Note that for a given value of ω i , it is possible that a source goes through the whole path in a time shorter than the desired length T . In this case, we support two options: either to keep going back-and-forth along the same path until the desired length is achieved or to keep selecting new end angles ϕ e,i randomly and keep moving the source with respect to the microphone.
For each source, the multichannel RIR h i (t) was generated using the image method [31]. The image method models the propagation of a source in a rectangular, rigid-wall room of dimension D meters. For simplicity, in this section, D is assumed to be constant for all the room (x, y, z) dimensions. The resulting impulse response h i (t), as given in (1), simulates the room reverberation when convolved with any source signal x i (t). This is done by modeling all the possible paths that a sound can follow. This is achieved by replacing each reflected path with a virtual source on a line-of-sight with the receiver where its direction is in the direction of the reflection and its position has to produce the same level of attenuation. For a given room, each impulse response in a multichannel RIR depends on the position r (t) of a source and the position of each microphoner m . Allowing the source to move results in a time-varying RIR expressed as [31] h t; r (t),r m = p∈P q∈Q where β is the reflection coefficient of each wall, d pq is the distance between a source image and a microphone, t is the time, τ pq is the time delay of arrival of a reflected unit impulse for a source image δ(t), P is the set of all (eight) binary triples ( p x , p y , p z ) (i.e., if any p i = 1, then the image of the source in that direction i is considered), Q is a set of all desired triples c is the speed of sound, and N is the length of the RIR impulse response in samples. Note that for a given N , this algorithm computes 8(N /D + 1) 3 different paths [31].
To simulate a scenario with multiple moving sources, the generated output signal s m (t) as measured by a microphone m positioned atr m is computed by convolving each anechoic source signal x i (t) moving along an arc-like path with the corresponding h(t, r i (t),r m ) as follows: where M is the number of microphones in a microphone array and " * " represents a linear convolution sum. It is worth noting that while the simulator [27] supports microphones with different beam patterns, we decided to work with omnidirectional sensors, for generality and simplicity. In order to evaluate the robustness against acoustic noise when performing DOA detection with moving sources, a multichannel spatial noise v m (t) is added to the generated directional sounds s m (t) at a desired SNR level (in dB). We call the resulting noisy signal y m (t). We consider cylindrical spatially isotropic noise, where the ceiling and the floor are considered to be covered with an absorbing material. For computational efficiency, the noise generator from [28] creates the signals directly in the frequency domain and then transforms them back to the time domain. The process combines two ideas. First, the reference microphone receives a noise signal v o (t) composed of a superposition of 64 uncorrelated waves uniformly distributed on the surface of a cylinder. Second, the noise components received by the other microphones v m (t) are calculated from v o (t) in a way that preserves the correct spatial coherence (which is given by a Bessel function in this case [27]). The labels (DOA directions ϕ) for each time step are based on the true path taken by each source. We selected the RIR update period (time step) as t = 32/ f s seconds, as in [27].

B. Short-Time Fourier Transform
It is possible to train the deep learning model on raw signals. However, it is more efficient to find a feature representation that better summarizes the data. While the use of other transforms could certainly be of interest, in this work, we chose to use a common window size to compute STFT components at all frequencies, leading to a constant physical frequency resolution across frequencies, as opposed to having variable window sizes and variable resulting frequency resolutions across frequencies (as in wavelet transforms or nonuniform filter banks). This was chosen because of two reasons: 1) the use of STFT has been fairly widespread in recent years for NN-based DOA estimation of static acoustic sources and to simplify comparison we wanted to evaluate the performance of the same features under dynamic/moving sources and 2) in free field, phase differences between STFTs of different sensors are known to be directly related to the time difference of arrival (TDOA), which is well-known criteria for azimuth angle of arrival estimation.
Unlike a Fourier transform computed over all the available signal length, where frequency components of a signal are extracted but with no explicit information on when they occur, the STFT provides information about different frequency components across time, which can be extracted from the received signal at each microphone as follows: where l and k are the time frame and the frequency indices, respectively, the index i represents each active source, and H (l, k; r i (l),r m ) is the frequency response of each corresponding RIR. As opposed to training the NN end-to-end on raw datasets, using the STFT as input features for the NN introduces a few hyperparameters that require careful tuning. Some of these parameters are the window's size W (in s), the amount of hop H (shift) between windows (in s), and the number of frequency bins K . One of the main objectives for our previous paper [25] and this extended paper are to study the effect of these parameters on NN performance, for different acoustic sources' angular speed.
Another challenge of using the STFT as features is that the magnitude and phase components are signal-level dependent. One way to overcome this issue is to "normalize" the STFT by using the magnitude ratios and phase differences relative to one of the microphones. The phase difference is then also more directly related to the TDOAs, which are known to be a useful indicator for a source DOA.
To produce ground-truth labels for each STFT frame, we average the sample-by-sample labels using a 1-D convolution with a kernel size equal to the STFT window length and a stride corresponding to the overlap factor. Note that the nnAudio library also uses a 1-D convolution operator to calculate the STFT. This makes it easier to generate the frame labels with the correct shape. Once the labels are generated for each frame, we quantize the labels to produce different NN classes and then perform a one-hot-encoding step This last step is the standard way of representing the ground truth for a multiclass classification problem when multilabels are expected to be active at the same time, which is the case in this work.

C. Deep Learning Models for DOA Estimation
An FNN consists of multiple layers of computational units, where each layer does a nonlinear transformation of its inputs x to produce an output y that is fed to the next layer. The transformation at any layer is given by where W and b are learnable parameters representing the weights and biases, respectively, and σ (. ) is a nonlinear function called the activation function. A shallow FNN can perform well on simple tasks but is limited in what it can learn and achieve. Using deep NN models allows to learn many complex patterns that are useful for the task at hand. However, deep FNN models typically suffer from two well-known problems that hinder their ability to learn: the vanishing gradient and the exploding gradient problems. To overcome these issues, we adopted the architecture proposed in [30]. This FNN model uses scaled exponential linear units (SELUs) as an activation function σ (. ) to self-normalize its activations around zero mean and unit variance. This behavior is guaranteed as long as the NN initial weights are properly initialized. Following this approach allows to use FNNs beyond two or three layers while avoiding the vanishing gradient and the exploding gradient problems and without the necessity of using strong regularization or dropout levels.
One big disadvantage of FNNs is that they cannot use previous NN states when the current set of data is being processed. Recurrent models, on the other hand, have an architecture that allows them to memorize short-term dependencies via a discrete hidden state. In RNNs [32], the output at time t not only depends on the current inputs and weights W but also on the hidden state, which summarizes all previous inputs using a general function as follows: where x t is the input at time t, s t−1 is the hidden state that summarizes the previous inputs up to the previous time step, and s t is the current output. Studies have shown that simple RNNs can be used effectively up to a small number of time steps [32]. The gated recurrent unit (GRU) was proposed in 2014 [33] to allow recurrent units to adaptively capture the dependencies of different time scales. Both "vanilla" RNNs and GRUs have feedback in their structure, and both have the same number of ports (number of inputs and outputs). However, GRU differs from RNN mainly in two aspects. First, it has gating units that modulate the flow of information inside the unit. Second, it has a free path that allows the parameters to be transmitted to the next computation step without being affected/multiplied by the activation function. These differences have two advantages. First, the gates are more powerful in modeling. It has the ability to decide which values from the past should be transmitted to the new state and which to erase. Second, this addition effectively creates the shortcut paths that bypass multiple temporal steps. These shortcuts allow the error to be backpropagated easily without too quickly vanishing as a result of passing through multiple, bounded nonlinearities, thus reducing the difficulty due to vanishing gradients. The equations of the GRU process can be expressed as where W and U are learnable parameters representing the weights, b represents biases (offsets), and σ (. ) is a nonlinear activation function. r t ,s t , and z t are the outputs of each gate, and y t is the final output. TCNs [34] are a new family of CNNs that have been introduced in 2016 as a strong alternative for capturing long-term dependencies. The main principles of this NN architecture are: 1) the architecture can take an input of any sequence length and produce an output of the same length and 2) there is no information leakage from future to past. The first principle can be ensured by using a 1-D FCN, where each hidden layer has the same length as the input layer, and zero padding asymmetrically is added to preserve the sequence length. The second principle is achieved by using a causal convolution. The causality property is important for real-time applications, in which there is no access to future samples. In order to ensure a large receptive field, a dilation factor is used for causal convolutions where x is the input, y is the output, f are filter coefficients (weights), d is a dilation factor, k is the filter size, and t − di represents the past time values. With dilated convolutions, the network can look back far in time without increasing the number of weights. However, this results in extremely sparse hidden connections, which prevents seeing input values between x t and x t−d . To overcome this issue, several stacked convolution layers are used with different dilation factors.
Since TCNs do not have recurrent connections, they do not use backpropagation through time, and therefore, they can be trained in parallel. This makes training faster than for recurrent models. Despite bringing some improvements, the TCN architecture also has two main disadvantages. First, the design of TCNs needs a deep network to accomplish a long effective history size. Second, the receptive field is fixed a priori, which means that input values outside the receptive field cannot be considered for the calculations of the output at a particular position.

III. EVALUATION RESULTS
In this section, we present our experiments and evaluate the performance of the acoustic source DOA estimation system.

A. Experimental Setup
We consider a room of size (L x , L y , L z ) = (5, 4, 6) m, with equal reflectivity coefficient (β) from all walls, ceiling, and floor. This value of β is calculated from the RT60 RT as in [16]. The simulator assumes normal temperature and pressure (NTP) conditions of 20 • C and 1 atm, respectively. A microphone array with M = 2 microphones separated by d = 0.05 m is placed at the center of the room, as shown in Fig. 2. By avoiding a location close to a wall (which could lead to strong short echoes) and remaining far from the sources (which increases the importance of late echoes/reverberation and reduces the importance of direct path propagation), we reduce the effect of specific microphone locations on the results. The vertical position of the microphone array is chosen to be 1.5 m.
We allowed two active sources to move for T = 50 s at a radial distance ρ = 0.5 m from the center of the microphone array at different angular speeds. All the acoustic scenarios were generated with mono sound files sampled at f s = 16 kHz, with a length of 50 s for each scenario. Regarding the RIR parameters, we defined the filter length as N = 1024 samples and set the reflection order to −1 (i.e., the maximum order supported in [7] will then be used). The outputs of the samplebased stage then have the following shape: [two channels (microphones), T × f s = 50 × 16000 = 80000 samples].
For the frame-based stage, L frames were generated based on the window size W × f s (in samples), the padding We kept the resulting STFT output frequencies from 50 to 6000 Hz and divided them into K frequency bins distributed uniformly. Since the STFT outputs produce a complex ratio between the frequency content of the two microphones, we squeezed the first dimension and reshaped the features as follows: [L , 2K L], where 2 represents the magnitude ratios and phase differences.
In our previous work [25], we discretized the DOA range [0 • , 180 • ] into 13 classes with 15 • resolution and we studied the effect of changing the RTs, the number of frequency bins, and hop length, using an FNN architecture. In this work, in addition to measure the impact of the key window size hyperparameter, we also consider RNN and TCN NN architectures, and we improve the resolution to 5 • , which resulted in 37 DOA classes for the same range of [0 • , 180 • ]. We set the value of RT to 0.2 s and vary the SNR level to include [∞, 5, 10, 15] dB.
Note that for each case under investigation, we used different training datasets with increasing size, as shown in Table I. For speeds below 15 • /s, the initial position was chosen such that the distance between the sources in the multisources scenarios is varied between [20 • , 180 • ], with 15 • increment. For higher speeds, the distance was selected from [20 • , 45 • , 60 • , 95 • , 120 • ] to limit the dataset size and to include all speeds up to the desired value. The two sources were moving

B. Models
We chose a self-normalized FNN as well as RNN, GRU, and TCN models, with a comparable network size of approximately 785 k parameters. The FNN model, shown in Fig. 3, consists of four layers each with 500 neurons. The inputs (shown as circles) are flattened along the time axis if a sequence of inputs is provided. Both RNN and GRU networks consist of two layers with 500 and 300 neurons, respectively, as shown in Fig. 4. Both networks process the data sequentially, in a many-to-one fashion, where only the output of the last step is used. The TCN network consists of two residual blocks, as shown in Fig. 5, with a total of four dilated causalconvolution layers each with 500 filters, a dilation factor d = 1, filter size k = 3, and stride s = 1. The input shape of the TCN is similar to the RNN. However, the time information is processed by the 2-D convolutional layer.
All networks are optimized using the Adam optimizer [26] and are trained for 25 epochs on a single GPU (Nvidia RTX 3080 Ti). The dropout used in all models is the rate of 0.2. A ReLU activation function is used in RNN, GRU, and TCN. However, in the FNN architecture, the SELU activation function is used.

C. Evaluation Metrics
We tested the effect of the NN model type (network architecture), with different window sizes, sequence lengths, sources speed, and SNR levels. For this work, we did not use a cross validation or "leave-one-out" type of training and testing because we had no limitation in the generation of the synthetic acoustic datasets used for training, validation, and testing, i.e., we could use large separate training and testing datasets.
The performance was evaluated in terms of precision and Fscore. The F-score (or more specifically the F1-score here) is a well-known combination (harmonic mean) of the precision and recall metrics. Precision measures the ratio of positive predictions made which are correct, whereas recall measures the ratio of the positive cases in the data that are correctly detected. Recall results are not provided here due to lack of space, but they can be obtained from the provided precision and F-scores.

D. Results
In this section, we investigate the effect of changing the window size, sequence length, SNR, and source angular speed.
1) Window Size and Sequence Length: The window length (and the sequence length in the case of models processing a sequence of input windows) determines the total length of the signal (in seconds) considered by the proposed DOA estimation system. For moving sources, the input signal is nonstationary, and using a signal length that is too long leads to use obsolete past samples, which do not help to predict the current sources DOAs. On the other hand, using a window length that is too small leads to a very poor frequency resolution (i.e., spectrum smoothed too much) and to a lack of sufficient information for the ML-based DOA estimation system to make a good prediction. Therefore, when varying the window lengths (and sequence lengths for models using sequences of inputs), it can be expected that the performance will decay if the resulting total signal length is too short or too long.
Figs. 6-9 present the effect of changing the window size for different sequence lengths, given the following parameters: K = 15 frequency bins, overlap of 50%, reverberation RT = 0.2, and SNR = 15 dB. We consider different sequence lengths [1], [4], [9], [19] frames. For a given history of T s seconds, the sequence length (L s ) in frames is calculated as follows: where W and H are the window size and hop length, respectively. Note that the FNN model does not have memory. Therefore, to study the effect of the sequence length, all past frames were concatenated and then fed as an input to the model. From Fig. 6, we observe that different sequence lengths have no effect on the FNN model. In other words, the model did not benefit from having access to past frames. We notice that the FNN was able to detect DOAs at low speeds (i.e., 0 • /s − 5 • /s). However, it was barely detecting sources at medium speeds (i.e., 15 • /s − 45 • /s) and failed to have any detection at high speeds (i.e., 95 • /s − 140 • /s).
Figs. 7-9 present the performance of the RNN, GRU, and TCN models, respectively, which are equipped with memory of previous states. We notice that a larger sequence length significantly improves the performance for all three models and the best sequence lengths were found to be [9], [19] frames. For example, when the sequence length is increased from 1 to 19 frames, the best achieved performance for RNN, GRU, and TCN was, respectively, increased by 12%, 14%, and 8% for low-speed scenarios and increased by 30%, 17%, and 16%, respectively, for high-speed scenarios. In static scenarios, most of the improvement was achieved by increasing the sequence length from one frame to four frames. Any further increase had negligible effect. For lowspeed scenarios, the TCN was different from the recurrent models (RNN and GRU) in two ways: the total improvement from using past frames is lower despite having comparable results, and most improvement can be obtained using only the four past frames. This means that the TCN is more efficient at learning from recent history. On the contrary, recurrent models mostly benefit from a sequence length of nine past frames. However, the performance drops at 19 frames. We believe that this degradation in performance at longer sequence lengths is due to the nonstationary input signal within the sequence.
Regarding the effect of the window size, for static cases, Figs. 6-9 show that for a given sequence length, there is benefit to increase the window size. While the best values for sequence length and window size in the GRU and RNN models are similar, we can see from Figs. 7 and 8 that the GRU model outperforms the RNN model by around 5% in performance. This suggests that the DOA detection does not require a long history, which is the main difference between the RNN and the GRU models. By looking at Figs. 8 and 9, we notice that the performance of TCN and GRU models are quite similar. However, TCN slightly outperforms GRU by around 4% in performance at high speed. This is possible because the TCN was more efficient in extracting features from short sequence lengths.
From the figures, we see two differences between TCN and recurrent models (RNN and GRU).
1) Beyond window size = 0.4 s, the performance of TCN in moving sources drops faster. We believe that this is because the TCN is not recurrent and does not have a hidden state that can serve as extra input, so a larger sequence length will not be able to compensate for the low-quality features (due to nonstationarity) generated by a larger window size.
2) The gap between performance for sequence lengths of one frame and 19 frames is smaller in TCN, but the improvement is consistent regardless of window size or sources' angular speed.
2) Window Size and Speeds: When comparing the performance of window size at different speeds using the four models (FNN, RNN, GRU, and TCN), we notice that the TCN performance is slightly better than the GRU and (in decreasing order of performance) the RNN and FNN models. In particular, the TCN model was able to maintain its performance at high speed despite being slightly worse at low speed. Fig. 10 shows that the best performance is achieved at a window size in the interval [0.1, 0.2] s. With a window size of 0.2 s, it is possible to achieve the performance above 85% using the RNN, GRU, and TCN models for speeds up to 60 • /s, 95 • /s, and 130 • /s, respectively. However, the performance collapses in moving sources when the FNN model is used.
3) Signal-to-Noise Ratio: Fig. 11 presents the effect of changing the sequence length at different SNRs, given the following parameters: K = 15 frequency bins, overlap of 50%, reverberation RT = 0.2 s, and W = 0.2 s. We can very clearly notice a drop in the performance for all models when the SNR drops from 15 to 5 dB. Also, we notice that using a longer sequence length improves the performance but does not improve the model robustness against noise. We measure the robustness by keeping the sequence length fixed at a certain value and reducing the SNR. For speeds below 95 • /s and an SNR = 5 dB, only the TCN was able to maintain a performance above 80% using a sequence length of at least nine frames. For lower noise levels (e.g., SNR above 10 dB), the GRU matches the performance of the TCN and outperforms the RNN. The degradation in the RNN model performance can be due to the feedback connection propagating some noise signal from the noisy features. The TCN on the other hand uses a convolution, which can smooth/filter out some of the noise in the signal.

IV. CONCLUSION AND FUTURE WORK
In this work, we have introduced a source DOA detection system (or SSL system) that works with moving sources. A comprehensive study was conducted on the effect of different window sizes, sequence length, SNR, and sources' angular speed, for four deep learning NN architectures. We found that the TCN and GRU models have comparable performance. However, the TCN is better at utilizing short sequences and maintaining good performance at high sources' speeds. On the other hand, the GRU outperforms the TCN at lower speeds. However, it is slightly more sensitive to noise and requires a longer sequence length for best results. Future work can involve testing the proposed approach with different types of datasets (e.g., speech moving sources) and a different number of microphones and geometries for the microphone array, as well as the effect of different acoustic environments (e.g., room acoustics and array location) and variable source angular speeds. Jana Rusrus received the B.Eng. degree in biomedical engineering from Palestine Polytechnic University, Hebron, Palestine, in 2018. She is currently pursuing the M.A.Sc. degree in electrical and computer engineering with the University of Ottawa, Ottawa, ON, Canada.
She is working on the direction-of-arrival estimation of moving sources using deep learning. Her research interests include applied deep learning for acoustic source localization and speech enhancement.
Shervin Shirmohammadi (Fellow, IEEE) received the Ph.D. degree in electrical engineering from the University of Ottawa, Ottawa, ON, Canada, in 2000.
After spending three years in the industry as a Senior Architect and a Project Manager, he joined the University of Ottawa as an Assistant Professor, where he has been a Full Professor with the School of Electrical Engineering and Computer Science since 2012. He is currently the Director of the DISCOVER Lab, doing research in applied artificial intelligence (AI) and measurement methods for multimedia systems and networks. The results of his research, funded by more than $28 million from public and private sectors, include over 400 publications; three best paper awards; over 70 researchers trained at the post-doctoral, Ph.D., and master's levels; 30 patents and technology transfers to the private sector; and a number of awards.
Dr. Shirmohammadi is an IEEE Fellow for contributions to multimedia systems and network measurements and a Senior Member of Association for Computing Machinery (ACM). He has been an AdCom Member of IEEE Instrumentation and Measurement Society (IMS) since 2014, currently serves as its Executive Vice President, and was a member of the IEEE Martin Bouchard (Senior Member, IEEE) received the B.Ing., M.Sc.A., and Ph.D. degrees in electrical engineering from the Université de Sherbrooke, Sherbrooke, QC, Canada, in 1993Canada, in , 1995Canada, in , and 1997 In January 1998, he joined the School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON, Canada. In 1996, he cofounded SoftdB, Inc., Quebec City, QC, Canada, which is still active today with over 100 employees. Over the years, he has conducted research activities and consulting activities with over 20 private sector and governmental partners, supervised over 50 graduate students and post-doctoral fellows, and authored or coauthored over 45 journal articles and 95 conference papers. He is an inventor of over 25 patents in different countries. His current research interests include signal processing and machine learning methods and their applications. Dr