Source Detection With Multi-Label Classification

Radio source detection through conventional algorithms has been unreliable when trying to solve for large number of sources in the presence of low SINR and less number of snapshots. We address this by reformulating source detection as a multi-class classification problem solved using deep learning frameworks. Incoming waveforms are sampled using a centro-symmetric linear array with omni-directional elements and the normalized upper triangle of the autocorrelation matrix is extracted as the input feature to a modified convolutional neural network with uni-dimensional filters, trained to detect the sources in the presence of both uncorrelated and correlated signals. Two detection algorithms are introduced and referred to as CNNDetector and RadioNet, and subsequently benchmarked against the conventional source detection algorithms. By including pre-processing in forward backward spatial smoothing, RadioNet can also resolve the number of uncorrelated sources in the presence of correlated paths. Finally, the algorithms are stress tested under challenging operational conditions and extensive evaluations are presented showing the efficacy and contributions of the introduced predictive models. To the best of our knowledge, this is the first time the source detection problem has resolved L-1 sources, for an antenna array of L elements using a deep learning framework.


I. INTRODUCTION
Radio events surround the world around us. We live in an increasingly connected environment fueled by hand held devices, household appliances, autonomous vehicles, and wearable devices; the ease of connectivity provided by multi-input multi-output (MIMO) systems and fast data rates of modern communication protocols. As a result, radio source detection has received significant attention over the years [1], [2], [3], [4], [5], and made its way into applications across various domains of science and engineering [2], [6], [7], [8], [9], [10]. High resolution direction of arrival (DoA) estimation algorithms such as MuSiC, root-MuSiC, ESPRIT and several others including non-parametric machine learning and deep learning methods require the knowledge of the number of sources to compute a viable localization estimation [1], [2], [11], [12], [13], [14]. Maximum likelihood estimation (MLE) using a Gaussian assumption is considered the optimal solution to estimate the number of sources [2], [15], [16]. However, MLE is computationally expensive and is not a viable solution for real-time source detection. The conventional approach for estimating the number of sources is to apply information theoretic criterion like the minimum description length (MDL) or Akaike information criterion (AIC) [17], [18], [19]. However, these approaches suffer degradation in detection performance in the presence of low Signal to Interference plus Noise Ratio (SINR), fewer number of samples, and an increasing number of sources to resolve [20]. Eigen methods involving the Eigen value decomposition (EVD) of the autocorrelation matrix have also been employed to estimate the number of sources with good success [21].
More recently, deep neural network (DNN) architectures have been implemented to estimate source detection from the sampled waveform [22], [23], [24], [25], [26]. Some of the early works include vanilla networks such as [22] which made use of a fully connected neural network with 3 hidden layers. A similar model was used in [23] but was extended up to 8 hidden layers. However both these models were designed with 10 and 16 element array respectively but were capable of detecting the presence of no more than 4 source signals [22], [23]. In [24], a convolutional neural network (CNN) of two layers was trained to predict the number of sources by feeding it raw signal waveforms. In it an 8 element antenna array was used to succesfully resolve only up to 4 sources. This approach also ignored the possibility of the autocorrelation matrix being rank deficient, directly affecting the detection performance [27]. This drawback was resolved in [25], where a neural network with one hidden layer was trained by feeding the Eigen vectors obtained from the autocorrelation matrix as the input. This approach improved generalization and the overall detection performance of the algorithm. This model was also able to resolve correlated sources by performing forward backward spatial smoothing (FBSS) on the autocorrelation matrix [27]. However, computing the EVD can add additional computational complexity which can be avoided [1], [3]. It also used a ten element array to resolve only up to 5 sources.
In this paper, we introduce CNN based deep frameworks to detect the number of maximum sources without the need to compute EVD. Furthermore, we introduce residual layers to improve on the performance over vanilla CNNs. Instead of computing the EVD, we extract the upper triangular elements of the autocorrelation matrix as the input feature. Note that the autocorrelation is a symmetric matrix and this reduction preserves all the spatial-temporal cues obtained from the waveform [1], [28]. We proceed to stress test the methods and frameworks in the presence of both uncorrelated and correlated sources to find the performance threshold for the algorithms. The major contributions of this paper are summarized below: 1) We introduce a novel detection framework and evaluate the models up to the operational threshold of L − 1 sources present in the sampled waveform, with L being the number of elements in the array. Literature pertaining to solving this problem considers only the presence of less than or equal to four or five sources to resolve [22], [23], [24], [25], [26]. We consider a ten element array in our experiments and show our models resolving up to 9 sources. 2) Comprehensive detection analysis is done by studying the sensitivity of the algorithms to a varying number of correlated and non-correlated sources. Section IV-A goes over experiments investigating the interactions between the number of correlated and uncorrelated sources and their effects on the detection probability.
We also study the effects on increase in the size of the training data has over the performance of the models in this section. We have not been able to find any work in the literature which carefully controls and investigates the effects of correlated sources. The rest of the paper is organized in four parts. Section II describes the problem formulation in terms of the signal model and feature extraction. Section III introduces our first algorithm -the CNNDetector; and its strength and weakness are investigated. In Section IV-A, we introduce residual learning in the form of RadioNet to improve on CNNDetector. We demonstrate the improvement provided by RadioNet and show the improvement in generalization provided over existing models and methods. Finally, the results are concluded in Section V.

II. PROBLEM FORMULATION A. SIGNAL MODEL
We begin investigating around the ideal case of source detection in the absence of correlated or coherent sources and then extend this to a more complex scenario of source detection in the presence of correlated sources. We consider a centrosymmetric linear array of L identical antennas placed along an axis with the center of the array coinciding with the origin of the coordinate system. The aperture of this array is illuminated by M non-coherent signals, N number of snapshots are captured, and the source location θ M i is chosen arbitrarily from −60 o ≤ θ M i ≤ 60 o . The inter-elemental spacing within the array is denoted byd i . The complex envelope model of the received signal X(t ) ∈ C L×N can be modeled as, where A(θ ) = [a(θ 1 ), a(θ 2 ), · · · a(θ M )] ∈ C L×M is the array manifold matrix, which contains steering vectors of the form a(θ m ) = 1, e j2π d 1 λ sin(θ 2 ) , . . . , e j2π d L−1 S(t) ∈ C M×N consists of the independent signal vectors which can be denoted by s 1 , . . ., s M . These independent signals are generated by applying Quadrature Phase Shift Keying (QPSK) modulation to a random bit sequence. n(t ) ∈ C L×1 is an additive white Gaussian noise (AWGN) vector [29]. λ is the signal wavelength.
After collecting N snapshots, the autocorrelation matrix can be calculated as, This matrix is then considered for feature extraction. This assumption of uncorrelated sources however is often too simplistic when modelling radio frequency (RF) environments in dense urban areas. To accurately model such environments, we introduce varying number of correlated sources in the mix to create another dataset. If one of the original transmitted signals can be denoted by s 1 (t ), then the kth correlated signal of s 1 (t ) can be written as, where ρ k is the amplitude fading factor and φ k is the phase change caused due to multi-path fading. The received signal S(t ) can be modeled as a collection of both independent and correlated signals closely replicating the modern communication environment. S(t ) ∈ C M×N contains both the zero mean independent signals and the correlated signals generated using 4. The presence of correlated signals would result in the matrix being rank-deficient [27]. To avoid this, we use FBSS to smooth the autocorrelation before extracting the feature vectors. To compute FBSS, the array is divided into K overlapped subarrays with K = L − L 0 + 1, L 0 being the dimension of each subarray. The smoothed autocorrelation matrix is given by, where R ff (k) and R bb (k) are the forward and backward autocorrelation matrices constructed from the kth subarray and R fb is the spatially smoothed autocorrelation matrix. The entire smoothing process is denoted in steps 5-15 in Algorithm 2. The resulting smoothed matrix is then used to obtain the feature input.

B. FEATURE EXTRACTION
Due to the symmetry of the autocorrelation matrix, extracting the upper triangular elements along with the diagonals provides sufficient information for learning algorithms [1]. Thus in the preprocessing step of our algorithm we first extract only the upper triangular elements. This is denoted in steps 6-12 in Algorithm 1 and steps 17-23 in Algorithm 2. For each sensor, the real and imaginary part is separated, and then normalized to obtain a column vector of dimension, (L * (L + 1)). This is done as shown in steps 14-19 in Algorithm 1 and 25-30 in Algorithm 2. This way the upper triangle of the computed R xx , or R fb in the case of correlated signals, becomes the input feature for the learning algorithms. Since we train the model using synthetic data generated by realistic simulations, we control the number of signals in sampled waveform for a given time instance which becomes the ground truth or the target variable [1]. This scalar value is defined in our algorithms as the variable y label and is randomly generated between 1 and M.

III. DETECTION FRAMEWORK I
We begin with a premise that the information on the number of sources is contained within the autocorrelation matrix obtained from the sampled waveform [1], [2]. We make use of the observation that a well-designed neural network with substantial depth should be able to learn the mapping from the information present in the autocorrelation to the number of sources present in the sampled waveform as demonstrated in [30], [31], [32].
In this section we propose a detection framework developed with CNN as the core learning unit. CNNs are a class of neural network architectures which make use of filters to extract underlying information within the data and can extract highly abstract features from structured data such as images [1], [33], [34], [35]. As such, they have been adopted and reformulated to perform various computer vision objectives such as object detection, localization, and semantic segmentation. We start the detection architecture design with a rather simple stacked CNN framework referred to as the CNNDetector.

A. CNNDETECTOR
The CNNDetector architecture is composed of stacks of convolutional layers, followed by a fully connected layer to generate the discrete outputs in the form of a one-hot encoded target vector. To account for the uni-dimensional (1-D) feature vector obtained from feature extraction, we employ 1D convolutions with 1-D filters as visualized in Fig. 1. The kernel size is kept constant at [1 × 3], and batch normalization is performed after each layer to normalize the contribution to a layer for every mini-batch [36]. Batch normalization achieves fixed distributions of inputs and prohibits internal covariate shift. Pooling is done in the last four convolutional layers. We optimize the number of layers along with the number of filters used in each layer (as shown in Appendix). We found that the 5 stacked CNN layers provide enough depth for detection without overfitting on the data. The number of filter size is 128, except in the fourth layer where it is increased to 256. The fourth convolutional layer also includes a dropout rate of 40% to facilitate generalization. The output layer was kept constant at [1 × L − 1], L being the number of elements in the array. We minimize a categorical cross-entropy loss defined as, hereỹ i is the i th scalar value in the model output and y i is the corresponding target value [37]. The categorical cross-entropy is a measure of the difference between the discrete distributions corresponding to each possible class in the problem.
Minimizing negative loss ensures that the loss becomes increasingly smaller as the distributions converge. The network parameters were initialized using Xavier initialization. This is done so that the variance remains the same across every layer [38]. Optimization of the weights is done using ADAM algorithm and the weights are updated with a learning rate of 0.001. Once the network is trained, it is evaluated using the detection algorithm for uncorrelated sources presented in Algorithm 1.

B. EXPERIMENTS AND EVALUATION
We begin with evaluating the framework in the context of source detection in the presence of uncorrelated sources. Since we extracted the features from the autocorrelation matrix, the theoretical maximum number of sources this approach can resolve is capped at (L − 1). We use a 10-element array for the studies in this paper. Without any loss of generalization, the theories and methods can be readily extrapolated to arrays of any given size and shape. The data for training and testing the networks were generated using the equations introduced in Section II. 110,000 frames were generated and further partitioned into 90,000 samples for training, 10,000 for each validation and test sets. The distribution of the dataset is shown in Appendix B. Each frame of the dataset is composed of 256 snapshots, which are used to compute the autocorrelation for a given frame. The test set is quarantined while the machine is trained on the training set and validated using the validation set repeatedly. Once the machine has achieved a saturation in terms of loss minimization, the training is stopped, and the test set is used to evaluate the model and generate the evaluation statistics. The number of sources to resolve varies between 0 or no source present in the sampled waveform to 9 or the maximum detection capability for a ten-element array. The detector was able to achieve 89% classification accuracy on the test set. However, classification accuracy is not the ideal metric to evaluate a multi-class classification problem [33], [39]. To properly understand the performance of the model we compute and tabulate the precision, recall, and F1 score in Table 1. Precision is formally defined as the positive predicted value and recall is the percentage of true positives that are correctly classified. The F1 score is the harmonic mean between these two values [40]. Finally, the performance of the framework on the test set is visualized using a confusion matrix in Fig. 2. Each row in this matrix is the actual number of occurrences of each class in our test set, and each column is the respective classification by the algorithm [41]. In Fig. 2, we observe that the network accuracy of classification decreases as the number of sources increase, which is intuitive. The machine was able to detect a low number of sources (0-2)  correctly almost every time. In agreement with our precision and recall values, the classification error increases with an increasing number of sources to resolve. For a large presence such as with 8 sources, CNNDetector correctly classifies only 616 out of 931 (66.16%) instances of the specific class in the test data. It fares slightly better with the class of 9 sources, where it improves its performance and classifies 723 out of 887 (81.5%) samples correctly.
Next, we evaluate the CNNDetector from an operational point of view by varying SINR conditions. For any realistic scenarios the SINR will vary, and the shifting variance has a huge effect on any prediction model. We vary the SINR between 0 dB to 20 dB and compare the detection performance to AIC and MDL. The corresponding accuracy for each of the discrete SINR thresholds probed in this research is presented in Fig. 3. We observe that for 0 dB SINR, the accuracy of the CNNDetector is comparable to the conventional methods of MDL and AIC. Even with the slightest increase in signal power over the noise floor, the CNNDetector performs much better in terms of accuracy than the conventional methods for a wide range of SINR until MDL catches up with CNNDetector when the SINR reaches 20 dB. The accuracy shown in Fig. 3 is obtained by averaging the accuracy over all classes.
The frame size was fixed at 256 for the results presented above. However, we can perform detection with fewer snapshots in a frame, but the exact threshold is ambiguous. To find this threshold, we train different models by varying the frame size. We vary the frame size between 10 and 256 to analyze the change in performance for differing snapshots. The corresponding accuracy for each of the discrete snapshot sizes investigated is presented in Fig. 4. The results are both surprising and intuitive. It turns out that the conventional methods such as MDL and AIC saturate at a relatively smaller number of snapshots, around 20. On the other hand, the CNNDetector is able to leverage the increasing number of snapshots available to it and keeps improving until it reaches the maximum threshold of 256. In agreement with deep learning theory, we gain sufficient improvement by increasing data size alone.
To evaluate the model for the presence of correlated sources, the dataset consisting of correlated sources was considered. In each frame, the number of non-correlated sources was varied from 0 to 5 and the number of correlated sources was varied between 0 and 4 and a 5 element sub-array was considered for FBSS on the sampled signals. Instead of training the network from scratch, we implement transfer learning and resume training the model and fine tune it on the complicated scenario of detection in presence of correlated sources. The presence of the correlated sources brings the test accuracy down to 56.7% from 89% as reported for the presence of only non-correlated sources. Increasing the depth or width of the architecture does not increase accuracy. We hypothesize, one possible approach for improvement is to extend the data size to increase generalization i.e if we have not hit the generalization threshold of the problem already. The other approach is to introduce residual layers to make the network deeper and improve generalization on unseen data. In the next section we implement both these approaches to improve on the performance of the framework introduced in this section. Since we were able to achieve satisfactory performance in the absence of correlated sources, we focus only on the problem of detection in the presence of correlated sources.

IV. DETECTION FRAMEWORK II A. RADIONET
The basic building block of RadioNet is a ResNet module introduced in [35]. ResNets have been shown to perform better than traditional CNNs since it is easier to optimize the residual mapping than to optimize the original complex desired mapping. The residual blocks form the backbone of the ResNet architecture and have outperformed conventional CNNs for a host of machine learning and computer vision applications [35], [42]. Increasing the number of layers results in very deep conventional networks which cause a saturation in learning at a threshold and the network with even greater depth becomes very difficult to optimize. To a great extent, ResNets solve this issue by introducing residual blocks. Instead of learning the unreferenced function, the network has an easier time learning the residual function with reference to the layer inputs. The output of the residual block can be written as, where, x is the input to the block, and y the output. The block is trained to learn F (x), which is less complex than the desired output F (x) + x. Then, at the output, an identity mapping is performed, and the input is added to F (x). This forms the residual block. Stacking multiple such residual blocks were shown to decrease the training error faster while being easier to optimize as compared to a traditional CNN of the same depth [35]. RadioNet consists of 16 residual blocks performing identity mapping with average pooling done before culminating into fully connected layers at the end. Small changes are made to the traditional Resnet34 architecture to accommodate the 1D input data. The size of the filter is [1 × 7] in the first layer, and then kept constant as [1 × 3]. The number of filters varies from 64, 128, 256 and 512. These hyper-parameters are optimized using a grid search. We use the same categorical cross-entropy loss function to train our network. The computed errors are backpropagated during the training phase and used to update the weights with a learning rate of 0.0001. Stochastic gradient descent (SGD) is used to optimize the network parameters with a Nesterov momentum of value 0.9 in batches of size 64 [43]. Once the network is trained, it is tested for correlated sources as shown in Algorithm 2.

B. EVALUATION
We start evaluating the model with the same experimental setup as Section I. Each frame contained both correlated and non-correlated sources and the number of non-correlated sources was varied from 0 to 5 and the number of correlated sources was varied between 0 and 4. The size of the subarray to perform FBSS is kept as 5, while every other parameter is kept unchanged. The accuracy achieved by RadioNet is plotted in Fig. 6 and compared to the other algorithms in Fig. 5.  We witness a significant improvement in detection accuracy with the introduction of the residual layers. To benchmark the improvements we tabulate the precision, recall and F1 scores as computed in Table 2 and compare with CNNDetector from section I. RadioNet with the residual layers comprehensively  outperforms the CNNdetector in the presence of correlated sources for all metrics and for every class in the data set. We also see that the F-1 scores suddenly decreases after 3 sources. Hence we can conclude that RadioNet can classify upto 3 sources in the presence of correlated signals successfully. However as we increase the number of signals above 3, the performance deteriorates. We observe that both deep learning based models introduced in this paper consistently perform better than MDL and AIC with FBSS across all SINR thresholds, as shown in Fig. 6. At SINR values greater than 10 dB, RadioNet performs far better than MDL, AIC, and CNNDetector, and hence, seems to be the best choice for source detection due to its ability to generalize in the presence of both correlated and non-correlated sources.

C. VARYING NUMBER OF SOURCES
We proceed to study the effect of varying the number of sources (both correlated and uncorrelated) on the model performance. Different data sets with varying numbers of correlated and uncorrelated signals consisting of 3000 samples are generated and tested on the trained models. We also use this dataset and test it on FBSS-MDL and FBSS-AIC. All results obtained for each class are summarized in Fig. 7. As seen in this plot, CNNDetector is the most suitable model while considering only 1 or 2 non-correlated signals. The accuracy obtained is comparable to RadioNet. However, its performance is surpassed by RadioNet when we increase the number of non-correlated signals to 3. For the most complex scenario, where we sample the received signal with 3 non correlated and 3 correlated sources we see that RadioNet obtains an accuracy of 72% while CNNDetector only obtains 56%. This can be understood by realizing that simpler networks are sufficient to solve the scenario with low complexity. The efficacy of residual layers is apparent when we realize complex scenarios where traditional CNN performs poorly. We also see that the performance of our models tend to deteriorate if the number of correlated signals gets closer to the number of noncorrelated signals. This trend is evident starting from the most simple scenario of only one non correlated, and one correlated signal. For this case both CNNDetector and RadioNet obtain only an accuracy of 72% and 71%. However for the case of 2 non correlated, and 2 correlated these improve up to a value of 87% and 86%. The accuracy metrics degrade when we consider the case of 3 non correlated and 3 correlated sources. However, both of the deep learning solutions we propose have better accuracy than conventional techniques like MDL and AIC consistently in all the possible cases. From the results, it can be summarized that the deep learning based models perform source detection with much more certainty when the number of non-correlated sources to resolve is less than the number of correlated sources.

D. INCREASING DATA
Increasing the data size has shown to improve model performance in deep learning [1], [33]. We increase our data set and generate 1 million frames containing varying number of correlated and uncorrelated sources to quantify the improvement in generalization with a larger data set. We split the data set into training, validation and test set containing 800000, 100000 and 100000 frames respectively. We use the pre-trained CNNDetector and Radionet models and train them on the extended data set. Note, that there is no overlap between the two training data sets in transfer learning. The optimally trained model is then further evaluated on the test set. The accuracy as computed along with the precision, recall, and F1 scores for every possible source scenario between 0 to 5 and is detailed in Table 3.
The tenfold increase in the training data resulted in significant improvement in the model performance. We witness an overall improvement in accuracy and F1 scores across various source scenarios for both the deep learning models introduced in this paper. The mean accuracy for CNNDetector increased from 56.7% with 80000 training instances to 79.3% with 800000 training instances. Mean accuracy across all classes for RadioNet jumped from 85.9% to 89.4% and mean F1 score across all classes improved from 0.87 to 0.88. The RadioNet trained on the extended data set outperforms every  other model including a RadioNet trained on a smaller data set. Both deep models introduced in this paper generalizes well for a smaller number of sources to resolve and there is not much to choose between them. As the number of sources to resolve increases above 2 along with the presence of correlated sources, RadioNet starts outperforming the vanilla detector comprehensively. It is able to achieve a minimum accuracy of 79% with an F1 score above 0.75 for all possible scenarios as compared to an accuracy of 70% with an F1 score of 0.69 validating the introduction of the architecture.

V. CONCLUSION
In this paper, we introduced two deep learning frameworks to detect the presence of radio sources using multi-label classification. We begin with the relatively simpler scenario of source detection in the presence of uncorrelated sources only, and propose an optimal CNN architecture for such detection. We conduct an extensive evaluation of the framework introduced in this paper and compared them to the existing methods in literature. The findings of our work suggest that a properly trained neural network can accurately detect the presence of up to L − 1 sources with an array of L elements operating at a SINR at or above 5 dB. We progress the research to the complex scenario of source detection in the presence of both uncorrelated and correlated sources in the sampled waveform and introduce changes in the model architecture in the form of residual layers to improve on the detection capability of a deep CNN. We evaluate the proposed framework in realistic scenarios such as a varying number of correlated and noncorrelated sources to detect and benchmark the performance of the model under various SINR conditions. We demonstrate how the proposed detection framework in this paper was able to improve on the existing methodologies and detail the expected performance as well as the scope for improvements in source detection. We do notice that our proposed Ra-dioNet has limitations as we increase the number of correlated sources.

APPENDIX A
This section describes the experiments performed to optimize the number of layers involved in the CNNDetector. The size of the data as it is passed through each layer of RadioNet is shown at the top of each layer in Fig. 1. Thus, after five layers of convolution the feature map of size 110 x 1 is brought down to a single scalar value. Hence we cannot add more than five convolutional layers to this specific experiment. We can stop the convolution after 3 and 4 layers of convolutions. In such scenarios the final fully connected layers size is adjusted according to the size of the data at this particular stage. The model is then trained on the dataset. The results of such an experiment is shown in Table 4. We can see from this table that having maximum number of convolutions is most beneficial for source detection. All the other hyper parameters like the learning rate, stride, batch size are optimized in a grid search.

APPENDIX B
This section describes the data distribution of the involved datasets. Fig. 8 shows the number of samples when only non correlated sources are considered, and Fig. 9 shows the number of samples for the dataset involving correlated sources.
We can see from both these tables that we have almost the same number of samples in each class, and thus are using a balanced dataset to train our models.

REPEATABILITY
The datasets, code and results can be found at the Github repository dedicated to this work: https://github.com/ jkrishnan95v/Signal_detector