An End-to-End Deep Learning Framework for Wideband Signal Recognition

Successful management of the radio spectrum requires, as a first step, detailed information about spectrum occupancy. In this work, we present an end-to-end deep learning (DL) based framework to obtain information from wide spectrum bands through signal detection, localization, and modulation classification. By visually representing the radio signals in spectrograms, we formulate the wideband detection problem as an object detection task from the computer vision field. To this end, the proposed framework consists of two cascaded modules: an object detection network repurposed to detect and classify distinctive signals in wideband spectrograms, and a convolutional neural network (CNN) designed to extend the classification capabilities to support a wide range of analog and digital modulation schemes. To evaluate our framework, we use a public wideband recognition dataset, which we carefully analyze and curate through a series of preprocessing techniques. To tackle the challenges of insufficient training data and class imbalance observed in the dataset, we suggest a training strategy that includes data mixing and transfer learning. Our experimental results on a general test set demonstrate that the proposed approach can detect and classify a variety of narrowband signals with simultaneously high precision (77.1%), recall (81.8%), and localization accuracy, as indicated by an average Intersection over Union (IoU) of 86%.


I. INTRODUCTION
Gathering and analyzing information from the radio spectrum is critical to ensuring its proper utilization. Regulatory authorities continuously monitor the spectrum to enforce the transmitters' compliance with licenses and standards that define a legitimate spectrum usage [1]. By detecting and classifying active emissions within a frequency band, the regulators are able to spot anomalies, hunt interference, and detect and potentially geolocate intrusive devices in order to protect authorized users' transmissions. Authorized users themselves are also interested in acquiring and exploiting such information for protection of their transmissions and reduction of their spectrum license costs. Furthermore, under-The associate editor coordinating the review of this manuscript and approving it for publication was Chengpeng Hao . standing the typical spectrum occupancy is important for regulatory agencies to maintain or redesign smart spectrum allocation strategies that can ensure more efficient utilization of frequency resources.
In addition to fixed spectrum allocation, spectrum analysis is particularly necessary for dynamic spectrum sharing. The latter is the fundamental principle of cognitive radio (CR), in which secondary (unlicensed) users transmit in frequency bands that are not utilized by their assigned primary (licensed) users. Secondary users must sense the spectrum and attempt to detect the presence of transmitted primary signals in order to ascertain the availability of unoccupied bands before transmission [2].
With the rapid expansion of wireless networks, such techniques for efficient spectrum management are becoming indispensable. The recently developed 5G standard supports diverse technologies and high data rate services [3], while the emerging Internet-of-Things (IoT) requires dense connectivity among numerous wireless devices [4]. To accommodate these technologies and their ever-growing spectral demand, intelligent utilization of the resulting congested spectrum is required, which encourages further research into efficient, automated spectrum monitoring solutions.

A. PROBLEM FORMULATION
Extracting information from the spectrum can be realized through different tasks, including detecting the presence of signals in frequency bands of interest, estimating radio frequency (RF) parameters of the detected signals (bandwidth, carrier frequency, power level, etc.), classifying their modulation scheme or wireless technology, identifying and localizing interfering transmissions, detecting anomalies, etc. In this work, we focus particularly on the tasks of signal detection, localization, and modulation classification. Specifically, we consider a radio monitoring receiver that is tuned to monitor the spectral activity in a wide frequency band. At a given time interval, all active signals transmitted in any frequency range within this band should be jointly detected by the receiver. Hence, the overall observed RF signal r RF (t) can be modeled as a superposition of an arbitrary number K of arriving signals and receiver noise n(t), i.e., r RF (t) = K k=1 r k (t) + n(t). (1) Here, each arriving signal r k (t) is subject to an individual channel and can be generally expressed in the RF domain as where ℜ{·} stands for the real part of a complex number. Furthermore, s k (t) represents the equivalent complex baseband (ECB) signal generated at the k-th transmitter via modulation and pulse shaping. The modulated baseband signal is upconverted to the carrier frequency F ck and transmitted through the wireless channel, which is characterized by the ECB weight function h k (τ, t). Based on the observed signal r RF (t), the goals of wideband signal detection and classification are to: 1) identify the presence of the K active signals, 2) determine the bandwidth each narrowband signal occupies by predicting its lowest and highest frequency, 3) determine the time segment during which each narrowband signal is received by estimating their start time and duration, 4) classify the detected narrowband signals based on their modulation schemes.

B. RELATED WORK
There exists an extensive body of work focused on solving the tasks listed above. Traditional algorithmic approaches are mainly based on processing the observed signals, analyzing their statistics, and extracting from them manually designed features that distinguish signals of interest from noise, interference, or other signal types. General methods like energy-based detectors are too simplistic and their results not sufficiently informative, whereas feature-based methods are highly specific and cannot generalize to a wide range of signal types or transmission scenarios [5]. Hence, traditional methods cannot be expected to scale well with the massive amount of spectral data and the complexity of the radio spectrum in current and future wireless networks. In light of these limitations and considering the proven capability of machine learning (ML) and especially deep learning (DL) in dealing with such complex settings, researchers have recently focused on designing DL based methods to learn from the spectrum. The most extensively investigated topic is that of automatic modulation classification, where neural networks with various architectures are employed to classify independent narrowband signals based on their modulation schemes [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19]. The more general task of wideband signal detection has been the subject of fewer research works. Among them, some of the proposed methods focus solely on detecting the presence of narrowband signals in wide frequency bands [20], [21], others enable both detection and localization in time and frequency domains [22], [23], [24], while other works offer classification capabilities on top of detection and localization. In the latter category, research efforts have mainly focused on attributing signals to specific wireless technologies. In particular, in [25] and [26], object detection networks from the computer vision field have been applied for discovering IoT and WiFi signals, respectively, in wide frequency bands. Similarly, a deep convolutional learning framework is proposed in [27] for detecting Morse signals from wideband spectrum data.

C. OUR WORK
In this work, we aim to jointly address the tasks of signal detection, localization, and modulation classification in wide frequency bands through an end-to-end learning framework. In the following, we use the term ''wideband signal recognition'' to refer to all three tasks jointly, according to the notation used in [28]. Such framework, which can be implemented e.g. in a dedicated spectrum monitoring system, should support a general pool of signals modulated with various analog and digital schemes, instead of focusing on specific types of transmission. In line with our objective of localizing signals both in time and frequency domains, we use a time-frequency representation (spectrogram) to visualize the spectral content of a wide frequency band. This representation allows us to reformulate our problem as an object detection task from the computer vision field, whereby detecting signals in a spectrogram is analogous to detecting objects in an image. Such task can be easily solved by employing one of the recent, remarkably successful DL based object detection networks. Following this approach, in [29], independently in [31], whereas, in our work, we provide a joint assessment of the overall framework.
To evaluate the proposed approach, we use a public dataset of wideband signals, which was recently introduced by West et al. [28]. The dataset contains synthetically generated wideband recordings characterized by diverse layouts of a variety of narrowband emissions, which makes the dataset suitable for a fair assessment of our method. Along with the dataset, in [28], the authors propose using U-Net to perform semantic segmentation in spectrogram images in order to detect, localize, and classify signals. The performance evaluation of the proposed approach only demonstrates its detection capabilities for a single wideband signal from the test set, however, it does not provide sufficient information on the performance on the overall test set, and it does not describe the U-Net localization and classification capabilities. Furthermore, to the best of our knowledge, there are no other works that have proposed wideband recognition solutions for this public dataset, even though the latter was introduced for the IEEE SPAWC 2021 Wideband Radio Signal Recognition Challenge [32].
We conducted a detailed analysis of the dataset which brought into light several challenges, mainly related to the lack of sufficient training data and a serious class imbalance in the dataset. To address these challenges, we curated the dataset using a series of data processing, data sampling, and data augmentation techniques. Furthermore, we generated our own dataset to compensate for the lack of sufficient training data and employed a training strategy that includes data mixing and transfer learning to boost the detection performance of the YOLOv5 signal detector.

D. CONTRIBUTIONS AND OUTLINE
The contributions of our work can be summarized as follows: • We propose an end-to-end DL framework purposed for the detection, localization, and modulation classification of a variety of signals in wide frequency bands. The framework is adaptable and can support a wide range of digital and analog modulation classes.
• We use a public wideband recognition dataset to evaluate the proposed framework. The dataset is carefully analyzed and curated through a series of data processing techniques.
• To adjust the framework to the considered representative public dataset and its challenges, we employ suitable training strategies that include data mixing and transfer VOLUME 11, 2023 learning. We believe that these strategies, along with the applied data processing techniques, can be extended to other learning based schemes in communications that are affected by a lack of sufficient or high-quality training data.
• The performance is assessed in detail by adapting object detection metrics that quantify the sensitivity, robustness to noise, and classification and localization accuracy of the proposed method. This paper is organized as follows. Section II provides a thorough description of the proposed wideband recognition framework. Here, we explain in detail the functionality and design of every component of the framework, including the adopted wideband signal representation, the selected YOLOv5 network for the wideband signal recognition step, and, finally, the designed DL module for PAM signal classification. Subsequently, Section III provides a detailed analysis of the dataset and a description of the applied data processing techniques. The conducted experiments and the obtained results are presented in Section IV, along with a discussion of our main findings. Finally, Section V concludes the paper.

II. WIDEBAND RECOGNITION FRAMEWORK
In this section, we describe in detail the proposed wideband recognition framework depicted in Fig. 1, explaining and motivating the design of all its components.

A. WIDEBAND SIGNAL REPRESENTATION
The wideband signal recognition problem involves the localization of signals in both time and frequency domains, hence, it is only natural to rely on a time-frequency representation (TFR) of the observed signal r RF (t). In particular, in accordance with the classical principle of energy detection for signal sensing, quadratic TFRs are of most interest because they provide information about the energy distribution of the signal in the time-frequency plane [33]. In practice, the most extensively used quadratic TFR is the spectrogram, which displays the squared magnitude of the Short-Time Fourier Transform (STFT) of the signal. The spectrogram shows how the spectral content of the wideband signal varies with time, though with a limited representation quality resulting from the fundamental time resolutionfrequency resolution trade-off [33]. A similar representation is offered by the scalogram [34], which equals the squared magnitude of a wavelet transform performed on the wideband signal. The scalogram suffers from the same resolution limitations as the spectrogram, but, in contrast to the spectrogram that offers the same time-frequency resolution throughout the analyzed frequency band, the resolution of the wavelet transform is frequency dependent: higher frequencies are analyzed with poorer frequency resolution [33]. If all analyzed frequencies are equally important, as is the case for a spectrum monitoring problem, such frequency-dependent resolution is not desirable.
Alternative TFRs, such as the Wigner distribution (WD) and its variants [33], [35], offer a significantly improved timefrequency concentration. However, for multiple component signals as in (1), they introduce serious interfering crosscomponent terms, which make the time-frequency analysis of multi-component signals difficult. Related WD based methods, such as the Choi-Williams distribution, Pseudo Wigner distribution, Generalized Exponential distribution, etc., try to address the interference problem by applying a smoothing operation to suppress the interference terms [33]. However, this suppression comes at the cost of a reduced time-frequency concentration, which erodes the main advantage of this category of methods. On the other hand, the interference problem is less pronounced in a spectrogram or scalogram representation, as their cross-component terms are restricted only to those regions in the time-frequency plane where the signal terms overlap. In the public dataset used in this work, the narrowband emissions which are active in a wide frequency band do not overlap, as they are assigned either different frequency subbands, or different transmission time slots within the same subband. Under these conditions, a spectrogram or scalogram representation would not be disturbed by interference terms. In contrast, in the synthetic dataset generated and employed in our previous work [29], a partial overlapping between narrowband signals was allowed. Even in such cases, a spectrogram representation proved to be sufficiently informative to allow the YOLO signal detector used in [29] to detect and correctly localize the partially overlapped signals.
Based on the above comparison of different TFR methods, in this work, we use the spectrogram representation to map the time-domain samples of a wideband signal to a spectrogram image. This image representation allows us to reformulate our problem in the context of object detection from the computer vision field: the task of detecting signals in a spectrogram is analogous to the task of detecting objects in an image. To this end, we propose repurposing a DL based object detection method to detect, localize, and classify signals in wideband spectrograms.
Modulation classification via the object detector network is based on the premise that signals modulated with different modulation schemes exhibit distinctive visual features in the spectrogram. For analog modulation schemes, such as Amplitude Modulation -Double Sideband/Single Sideband (AM-DSB/SSB) or Frequency Modulation (FM), the timefrequency characteristics depend on the spectral content of the information signal. For digital modulation schemes, the modulation parameters (amplitude, phase, or frequency) define the typical appearance in the spectrogram. For example, for an M -ary Frequency-Shift Keying (FSK) modulated signal, the power spectral density is characterized by M distinctive peaks or spectral line components at the frequencies corresponding to different information symbols. Likewise, in a spectrogram, an M -FSK signal would exhibit bright lines or bright narrow bands at the modulated frequencies, which create a distinctive appearance for FSK modulated signals. On the other hand, signals modulated with digital amplitude/phase modulation schemes, such as M -PSK, M -QAM, and M -APSK, exhibit very similar appearances in the spectrogram. To see this, let us consider a transmit signal s(t) modulated with any of the above modulation schemes, which can be expressed as where a n denotes the n-th modulated symbol, g(t) stands for the pulse shaping filter, and T s is the modulation interval. Depending on the employed modulation scheme, i.e., PSK, QAM, or APSK, the information is encoded into the variation of the phase and/or amplitude of the complex coefficients a n = |a n |e jφ n . The STFT of the signal s(t) is defined as [33] STFT Here, γ (τ − t) denotes the analysis window centered around the time t, which is used to localize the time-domain signal for calculating the short-time spectrum. In the following, we assume a rectangular pulse shaping filter with impulse response g(τ ) (extending from 0 to T s ) and a rectangular analysis window γ (τ ) with duration T ≤ T s (extending from − T 2 to T 2 ). When the analysis window γ (τ − t) lies within a symbol interval, the STFT can be simplified to |a n |e jφ n e −jw(τ +t) dτ = |a n |e jφ n e −jwt T sinc Finally, the spectrogram of the modulated signal s(t) can be obtained as the squared magnitude of the corresponding STFT in (5), i.e., , which in the spectrogram image corresponds to a bright wide band with small fluctuations of brightness. For an M -PSK signal that comprises unit-amplitude modulation symbols, i.e., a n = e jφ n , the spectrogram in (6) preserves the same shape regardless of the modulation order M . Similarly, for M -QAM and M -APSK signals, different signal constellations only impact the scaling term |a n | 2 in the spectrogram expression in (6), but not its shape. Hence, a variation of the symbols amplitude only scales the brightness of the spectrogram representation, which is not sufficient to provide distinction between different modulation classes or different modulation orders of the same class. 1 For the case when the analysis window of STFT spans more than one symbol interval, the complete analysis in [31] reaches the same conclusion: the modulation order or modulation constellation variations have an only mild effect on the spectrogram characteristic of digital amplitude/phase modulation schemes, subject to the condition that the symbols a n are independent and identically uniformly distributed, which is typically the case. Fig. 2 shows a spectrogram representation of several signals modulated with analog and digital schemes. From the figure, it can be seen that PSK, APSK, and QAM exhibit an almost identical time-frequency characteristic, unlike other modulation classes.
To cope with this limitation, in our wideband recognition framework, all PSK, QAM, or APSK signals are grouped together in a joint class that we call digital PAM. With respect to these signals, the task of the object detector is simply to detect, localize, and distinguish them from other identifiable classes, such as FSK, AM, FM, Gaussian Minimum Shift Keying (GMSK), etc. In the next step, it is the responsibility of the second DL module, hereinafter referred to as the PAM classifier, to recognize the specific modulation subclass and modulation order of the detected PAM signals. Specifically, as proven in several recent works [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], the PAM classifier network can be trained solely based on the I/Q samples of narrowband PAM signals to learn to distinguish PSK, QAM, APSK modulation classes from each other. Such I/Q samples can be easily extracted from the wideband recordings using the time/frequency localization coordinates predicted by the object detector, as will be further explained in Section II-C. 1 In simulations, it was observed that similar spectrogram representations of M -PSK, M -QAM, and M -APSK signals are obtained even for different pulse shaping filters, such as square-root raised cosine (SRRC) filters, and for different analysis windows.

B. OBJECT DETECTION
To solve the wideband recognition problem, which is formulated in terms of an object detection task, an object detector network is adopted in the first stage of the framework. In the following, we firstly provide a brief overview of the object detection field, before presenting and describing in detail the selected YOLOv5 architecture.

1) OVERVIEW
Object detection is a fundamental problem in computer vision that deals with detecting, localizing, and classifying instances of objects of interest in digital images and videos. This field witnessed a particularly rapid development during the last decade, as the application of DL techniques led to major breakthroughs and the design of successful object detection systems, which are already enabling contemporary real-world applications [36]. The DL based object detection networks are generally grouped into two categories: region based detectors and single-shot detectors. The methods from the first category employ a traditional two-stage pipeline: first, a set of regions (boxes) that might contain an object are extracted from the input image and then, in the second step, a classical CNN based classifier is applied in each candidate box to identify the potentially existing object. This idea was originally exploited in the Regions with CNN features (R-CNN) network [37], which is arguably still one of the most prominent DL based object detectors. R-CNN was succeeded by a variety of improved methods, such as SPPNet [38], Fast R-CNN [39], Faster R-CNN [40], Feature Pyramid Network [41], etc., which attempted to address the drawbacks of R-CNN. Such drawbacks include the redundancy of features computation on a large number of overlapped regions and the computational complexity arising due to the multistage training.
Single-shot detectors are based on a simpler one-stage detection workflow. The initial representative of this category is YOLO, in which object detection is reformulated as a single regression problem [42]. YOLO processes the whole input image (rather than parts/regions of the image) to simultaneously predict multiple bounding boxes and their respective classes. SSD [43] follows a similar detection pipeline while introducing the multiresolution detection technique, which allows detecting objects of different scales in different layers of the network. The simplified one-stage pipeline significantly improves the detection speed, rendering YOLO and SSD state-of-the-art real-time object detectors. However, this speed improvement comes at the cost of detection performance degradation, a problem which was identified and investigated in [44], where RetinaNet, another notable single-shot detector, was introduced.
For our problem of wideband signal recognition, we focus on single-shot detectors. Their high processing speed would guarantee a real-time analysis of wide frequency bands, which is an attractive property for RF monitoring systems. Furthermore, the task of detecting RF signals in spectrogram images should be relatively easier than classical object detec-tion tasks that deal with hundreds of object categories [45]. Hence, the reduced performance compared to region based detectors should not be very pronounced for our problem, as shown experimentally in [29]. In particular, motivated by its simple architecture, real-time detection speed, implementation flexibility, and its overall competitive performance in several renowned object detection challenges [45], [46], [47], we choose to adopt the YOLO network. The following section describes YOLO in greater detail, focusing on its detection mechanism and network architecture.

2) YOLO OBJECT DETECTOR
YOLO leverages supervised learning to learn how to find and classify the objects in an image. For this purpose, every image in the training set is annotated by a label, which comprises the class and location coordinates of each object. To understand its detection mechanism, we first explain how this annotation information is encoded in the YOLO framework.
YOLO divides an input image into an S × S grid and distributes the task of multiple objects detection among the cells of the grid [42]. Specifically, a cell is responsible for detecting an object whose center falls within the grid cell. Every cell predicts a fixed number B of bounding boxes which are defined based on four location coordinates: the (x, y) coordinates of the center of the box, and its width w and height h [42]. Furthermore, each bounding box is characterized by a confidence score that quantifies two criteria: how confident the network is that the box contains an object and how well it believes this predicted box matches the ground truth one, if a ground truth object exists. The first criterion is formally expressed as the probability of objectness Pr(Object), and the second criterion is described by the Intersection-over-Union (IoU) of the predicted and ground truth bounding boxes. Hence, for bounding boxes predicted in cells that contain objects, the confidence score is formally defined as C = Pr(Object) · IoU [42]. In cells that do not contain ground truth objects, the confidence score of a predicted bounding box characterizes simply its estimated probability of objectness. Finally, for the classification task, every grid cell predicts N class probabilities conditioned on the cell containing an object, where N denotes the number of classes. This means that only one set of class probabilities is predicted per cell, regardless of the number of bounding boxes, thus limiting the number of objects that a cell can detect to one [42]. To achieve this, during the training process, the network assigns only the bounding box with the highest IoU with the ground truth box to be responsible for the detection of an object in each grid cell [42].
In summary, every cell predicts a confidence score and 4 localization coordinates for each bounding box as well as N conditional class probabilities. By stacking the predictions from all cells together, the final prediction for an image is represented as an S × S × (5B + N ) tensor. Using this representation, YOLO can unify all three tasks of object detection, localization, and classification and reformulate them as a single regression problem. Specifically, given an input image, the task of the network is to learn how to predict the tensor of localization coordinates and class probabilities directly from the pixel values of the image.
The YOLO network consists of two parts: the backbone, whose role is to extract features from the input image, and the detection head which predicts the bounding boxes and the class probabilities based on the feature maps [42]. In the first training step, the backbone consisting of 20 convolutional layers is pretrained on the ImageNet 1000class competition dataset [45]. Then, in the second step, the backbone is concatenated with the detection head, comprising four other convolutional layers and two fully connected layers, to perform object detection. During this stage of the training, YOLO tries to simultaneously minimize the detection, classification, and localization error by optimizing for a multi-term loss function, defined as [42] Here, 1 obj i is an indicator function that indicates if the i-th cell contains an object, and 1 obj i,j reveals if the j-th bounding box predicted by the i-th cell is responsible for detecting that object. Furthermore, the hat symbol( ·) denotes an estimated quantity. For cells that contain objects, the first two terms of the loss function specify the localization errors and the third term stands for the error in the confidence score prediction. In the latter term,Ĉ i,j is predicted by the network, whereas C i is the true confidence, which takes a value of 1 when the i-th cell contains an object and 0 otherwise. For the i-th cell, only the bounding box responsible for detecting the object contributes to the localization and confidence score errors, while the others are not included. If a cell does not contain an object but the network predicts one, the first three terms would be automatically set to zero and the wrong prediction would be penalized by the fourth term. Here, 1 noobj i takes a value of 1 when the i-th cell does not contain an object. Finally, the classification error for cells that contain objects is accounted for by the last term of the loss function, where p i (c) andp i (c) stand for the true and predicted probability of class c for the i-th cell, respectively. 2 Different penalties can be assigned to different sources of errors by tuning the weighting coefficients λ coord and λ noobj .
When evaluated on benchmark object detection datasets, YOLO exhibits advantageous properties. In addition to ensuring real-time detection speed, YOLO proved to possess high generalization capability, which aids the transferability of the trained model from one domain to another [42]. This ability is of interest for our task of signal detection, considering that the lack of sufficient or qualitative real-world RF data often necessitates training in synthetic or mixed datasets, as will be discussed in Section IV. Furthermore, YOLO reasons globally about the image and encodes contextual information, which is helpful for avoiding background errors and for ensuring a low false alarm rate. On the other hand, the simplified architecture is responsible for a few drawbacks, mainly related to localization inaccuracies and limitations on the number of detectable objects in an image. These drawbacks were later on addressed by the subsequent versions of YOLO, YOLOv2 [48] and YOLOv3 [49].
YOLOv2 adapts a more accurate backbone architecture (Darknet-19) and employs a set of new training and detection techniques, including: batch normalization for better convergence, high resolution classification, and multi-scale training for detection with different image resolutions. Another key element in YOLOv2 is the usage of prior anchor boxes which simplify the bounding box prediction and, consequently, improve the localization. YOLOv3 continues to build upon the previous versions while focusing on improving their accuracy. To this end, the authors of [49] propose a more powerful backbone model, referred to as Darknet-53, with added residual connections across its 53 convolutional layers. On the other hand, the detection mechanism is similar to YOLOv2, except the additional capability of performing detection across different scales. Specifically, the detection head takes as input three feature maps extracted at different stages of the backbone network and outputs a predictions tensor for every feature map. Since the feature maps provide different levels of information granularity, such multi-scale detection mechanism allows detecting objects of various dimensions and sizes.
YOLOv3 is the latest contribution of the original authors on the evolution of the YOLO framework. However, many researchers from the computer vision field carried on its development and published even more YOLO versions: YOLOv4 [50], Scaled YOLOv4 [51], YOLOX [52], YOLOR [53], YOLOv5 [30], YOLOv7 [54], etc. Among these versions, in this work, we select YOLOv5, which shares similar architecture and detection techniques as YOLOv4. However, YOLOv5 is presented as a light, open-source implementation in PyTorch, hence offering more ease of use and flexibility in repurposing and exporting it to a new domain. The YOLOv5 network architecture is shown in Fig. 3. The first part of the network, i.e., the backbone, is based on the Darknet-53 model introduced in YOLOv3 but its design is further enhanced by the Cross Partial Network (CSPNet) strategy [55]. In the original Darknet-53, a residual module employs two consecutive convolutional layers, whose output is summed with the input of the residual module through a skip connection. In contrast, in CSP-Darknet-53, the residual module is replaced by a CSP module. Here, the input feature map is split channel-wise into two parts. The first part goes through a simple convolutional block, while the second part goes through the traditional Darknet-53 residual block, as shown in Fig. 3. At the output of the CSP module, the filtered parts are merged together through concatenation. This feature split and merge strategy is beneficial as it improves the flow of information and gradients through the layers, and it encourages feature reuse in a computationally efficient way.
A Spatial Pyramid Pooling (SPP) module [38] is placed at the top of the CSP-Darknet-53 backbone. In this module, the input feature map propagates through three independent branches, where it is subject to maximum pooling operations that use sliding kernels of dimensions 5 × 5, 9 × 9, and 13 × 13. Pooling kernels of different sizes increase the receptive field from the input image in different levels. Hence, the resulting feature maps in each branch can effectively ''see'' fields of different sizes from the input image, which enables extracting and aggregating local features in multiple scales. In each branch, appropriate padding and stride of the kernel are used to preserve the spatial dimension of the feature map. This allows to concatenate the output and input feature maps together in a global feature map. Similar to YOLOv3, YOLOv5 uses three different feature maps from the backbone network to perform detection. A Path Aggregation Network (PAN) [56] is inserted for collecting and aggregating these features from different backbone levels through a series of convolution, upsampling, and concatenation operations. While high-level features provide meaningful semantic information, early low-level features contain finer-grained information. When fused together in the PAN module, they are combined into a rich feature representation that improves the subsequent detection performance.
The third and final part of the YOLOv5 network is the detection head, which realizes a similar detection mechanism as YOLOv2 and YOLOv3, with every cell of the grid predicting a set of bounding boxes (based on predefined anchors) and a set of class probabilities for every box. The training loss function is still calculated as the sum of localization, objectness, and classification error terms, but it has some changes compared to the loss function in (7). In particular, the last three terms on objectness and classification loss are now calculated using a binary crossentropy loss function. Furthermore, instead of the usual mean-squared error, the localization loss is calculated using an IoU-based loss function, which manages to assess the overall accuracy of the predicted bounding box rather than the accuracy of each predicted coordinate independently.
Specifically, similar to YOLOv4, YOLOv5 employs the Complete IoU (CIoU) loss, which takes into account three geometrical quantities to describe the similarity between the predicted and ground truth bounding boxes: the overlapping area, the Euclidean distance between the central points of the two boxes, and the aspect ratio consistency [57].
Further technical details on the utilization and adaptation of YOLOv5 as a wideband signal detector are given in Section IV.

C. PAM SIGNAL EXTRACTION AND PREPROCESSING
As a result of processing a wideband spectrogram image, the YOLOv5 network should predict localization information and a modulation class label for every detected narrowband signal. The localization information describes the position of the rectangular box that bounds the detected signal in terms of its (x, y) center coordinates, width, and height, which are normalized to the respective image dimensions. Knowing the image dimensions and the overall frequency band and time segment represented in the spectrogram image along its x axis and y axis, respectively, the localization coordinates in the image can be easily converted to time and frequency coordinates. For signals with distinctive timefrequency characteristics, these coordinates and the class label constitute all the information that is required from the wideband recognition framework. On the other hand, for signals assigned to the PAM class, a more specific further classification is needed. A PAM classifier network is employed for this task, which is designed to work on ECB samples of narrowband PAM signals. For this purpose, the detected PAM signal should first be extracted from the wideband recording it belongs to according to the following steps: 1) Let r RF [n] denote the discrete-time wideband signal represented by the spectrogram image processed by YOLOv5. Furthermore, letf low andf high denote the predicted lowest and highest frequency coordinate, respectively, of the detected narrowband PAM signal.
To extract the latter in the frequency domain, the wideband signal is filtered according to where the symbol ''*'' denotes a convolution operation, h BP [n] stands for the impulse response of a bandpass filter with cutoff frequenciesf low andf high , and r 1 [n] stands for the resulting bandpass filtered signal. 2) Next, a segmentation step in the time domain is performed in order to cut out the PAM signal of interest from r 1 [n] using the predicted starting timet start and stop timet stop of the corresponding detection as follows Here, n start = t start F s , n stop = t stop F s , and F s is the sampling frequency. 3) Finally, based on the estimated center frequency of the PAM signal,f c =f low +f high 2 , the extracted signal r 2 [n] is downconverted to ECB domain. The resulting signal r PAM [n] contains the baseband I/Q samples of the extracted PAM signal. Depending on the time duration of the narrowband PAM transmissions, the resulting extracted signals can have varying lengths. For a PAM classifier network designed as a CNN or ResNet this can be problematic, because such networks are usually trained on fixed-length data and have network dimensions that are fixed based on the size of the input. To tailor the network to such varying-length inputs, a practical approach would be to partition the extracted signal into frames of fixed length (e.g., 1024 samples), which can all be processed and classified independently by the modulation classification network. During test time, the class label of an extracted PAM signal can be determined as the most frequently occurring label from the set of label predictions for the frames that make up the signal.

D. PAM MODULATION CLASSIFIER NETWORK
DL based automatic modulation classification has received significant attention in recent years. In the numerous works that have investigated this topic, the signals to be classified are mostly modeled as independent, narrowband transmissions, and a neural network is employed to recognize their modulation class, which is chosen from a pool of analog and digital schemes. O'Shea et al. [6], [7], [8] first demonstrated the potential of DL for modulation classification by training CNNs and ResNets on I/Q samples of radio signals. Furthermore, they introduced three datasets containing narrowband signals captured over the air or synthetically generated under realistic assumptions. These datasets were used as benchmarks in several succeeding works, which proposed either novel DL architectures tailored to modulation classification, or new signal representations. In [9], a conventional CNN is trained on I/Q samples of narrowband signals to classify basic modulation classes, while another CNN is trained on constellation diagrams to recognize the more challenging higher-order modulations. Similarly, classification based on constellation diagrams is investigated in [10], where several image processing techniques designed for enhancing the constellation diagrams were analyzed, thus linking the task of modulation recognition with the computer vision field. Other representation formats such as the amplitude/phase representation and Fourier transform were studied in [11], using CNN architectures.
An alternative approach to automatic modulation classification treats narrowband radio signals as time series data and employs recurrent neural networks (RNNs) or related schemes such as long short-term memory (LSTM) networks to extract long-term temporal dependencies of the signals [13], [14], [15], [16]. These features proved suitable for identifying distinctive patterns of symbol-tosymbol transitions in signals modulated with different classes, yielding a high classification accuracy, though at an increased architecture and training complexity [17]. In further works, the advantages of CNN and RNN based processing are combined in hybrid architectures, such as the CLDNN architecture [18], where an LSTM layer follows a set of convolutional layers, or the HybridNet [19], consisting of sequential residual blocks and bidirectional gated recurrent units.
In our work, the PAM classification task is slightly different from the typical automatic modulation classification problem considered in related works. First, one should consider the effects that YOLO prediction and subsequent filtering steps introduce to the input signals. Specifically, localization inaccuracies of the YOLO object detector result in incorrectly estimated time/frequency coordinates. In frequency domain, errors in predicting the lowest and highest frequency of the PAM signals cause problems in the bandpass filtering used to extract the narrowband signal from the wideband observation. Predicted bounding boxes that are narrower than the true bandwidth of the signals could result in cropping important parts of the signal, while wider bounding boxes would inevitably include parts of the surrounding spectrum to the extracted signal, like noise or interference from nearby transmissions. Furthermore, errors in the frequency domain also impact the estimated center frequency which is used to downconvert the extracted signal to baseband, resulting in a residual frequency offset. Similar effects can also be expected due to errors in the estimation of the time coordinates. Such artifacts render our PAM modulation classification task more challenging than the typical automatic modulation classification problem that assumes perfectly detected and segmented narrowband signals. To guarantee robustness towards these artifacts, the classifier must learn to recognize the modulation classes of signals even when they are affected by the above-mentioned imperfections. This can be achieved by generating a training dataset where the signals are created in the same way as the test signals, i.e., through an imperfect extraction from the wideband signal they originally belong to. Section IV provides more specific details on the PAM classification training dataset used in this work. Finally, in addition to the localization errors, one should also account for possible classification error events, in which the YOLOv5 detector confuses one of the other classes in the dataset with PAM. If a wrong PAM detection is extracted from the wideband signal and given as input to the PAM classifier, this network should be able to correct the mistake and issue an alert or a detected outlier message that the signal is not PAM modulated. For this purpose, we add another class to the PAM dataset, labeled ''Others'', which contains signals modulated with any other modulation scheme except for PAM, e.g., AM-DSB, AM-SSB, FM, etc. The signals of this class should teach the network what a non-PAM signal looks like.
From the variety of learning-based modulation classification methods proposed in the literature, we have decided to investigate CNN based architectures, namely a conventional CNN model and a ResNet model. Naturally, as described above, there exist more complex and potentially better-performing alternatives, like RNNs or combinations of CNNs with LSTM units. However, since our proposed framework consists of two sequential DL modules (YOLOv5 + PAM classifier), it is important to be cautious about the complexity and the processing time of the involved networks, which motivates favoring simple CNN architectures. Fig. 4 shows the architecture of our conventional CNN used for PAM modulation classification. The input to the network is a 2-dimensional tensor of size 1024 × 2, where the first column contains the in-phase component and the second column contains the quadrature component of the complex time-domain samples of the signal. This arrangement of the data is needed to ensure that the network can operate on real-valued inputs. The input signal is first processed through six convolutional blocks, each of them consisting of a convolutional layer, a batch normalization layer, a rectified linear unit (ReLU) activation function, and an optional max pooling layer. We apply the max pooling operation starting from the third block to avoid losing important information from the early layers of the network. For the first convolutional layer, the filter size is set to 8 × 2 in order to capture the relationship between the I/Q components of sufficiently many consecutive samples. For all subsequent layers, the filter size is set to 16 × 1. Next, the convolutional layers' output is flattened and further processed through three fully connected layers followed by a Softmax activation function where the classification is performed. The first two fully connected layers are linked to dropout layers (dropout probability = 0.5) which serve as regularizers and prevent overfitting during training. This design of the network is in line with other CNN models proposed in the literature. However, its architecture is adjusted to optimize the performance in our PAM dataset by optimizing the combination of design hyperparameters, such as the number of convolutional blocks, number of fully connected layers, number and size of filters, etc., which are set as shown in Fig. 4. The resulting network has 7 207 607 trainable parameters.
The second network model that we consider is the ResNet shown in Fig. 5. Its design is inspired by the ResNet architecture proposed in [17], and it is based on the main conclusions from the CNN hyperparameter optimization process. Specifically, the input is first processed with a convolutional layer with filter size 8 × 2. Next, its output is forwarded through three residual stacks and three subsequent fully-connected layers. As shown in the block diagrams at the right side of Fig. 5, a residual stack consists of a convolutional layer, two residual units, and a max pooling layer. The residual unit comprises two convolutional layers and utilizes a skip connection that adds the input of the residual unit to the output of the second convolutional layer. This skip connection thus creates an alternative shortcut path for the gradient flow which is beneficial in particular when the conventional path through convolutional layers causes a vanishing or exploding gradient problem. The filter size for all convolutional layers of the residual stacks is set to  16×1 and the dropout probability is 0.1. The resulting ResNet has 1 868 423 trainable parameters, i.e., around 3.8 times less than the CNN.

III. PUBLIC BENCHMARK DATASET
A benchmark dataset for wideband signal recognition has been introduced in [28]. The dataset consists of synthetically generated wideband signals in ECB domain which are modeled as a superposition of several narrowband emissions, according to the signal model in (1). Here, the narrowband signals differ from each other in terms of their modulation class, bandwidth, start time, duration, and amplitude. 3 The layout of narrowband emissions in the time-frequency grid is unique for every wideband recording in the dataset and follows typical band layout profiles of the Industrial, Scientific and Medical (ISM) band, cellular bands, Public Safety band, Personal Communication Services (PCS) band, etc. [28]. As a result, the signals in the dataset resemble realworld wideband RF signals.
The recordings are saved in a SigMF format, which includes binary files containing the I/Q data samples of the wideband signals, as well as JSON files with their respective annotations [58], which provide both general information on the wideband signal they describe (sample rate, center frequency, signal duration), as well as specific information for each narrowband emission (starting sample, total number of samples, lowest and highest frequency, amplitude scaling factor, and modulation class). Because the dataset is fully annotated, supervised learning can be applied for all three tasks of signal detection, localization, and classification.
The dataset is divided into a training set and a test set, consisting of 109 and 21 wideband recordings, respectively. Considering that only the train set annotations have been made public by the authors of [28], we only use the train set as our general dataset and split the latter further into train and test sets, as described in Section III-C. In the following, we discuss the preprocessing steps we have conducted to adapt the dataset to our wideband signal recognition framework.

A. DATASET ANALYSIS
In [28], every wideband ECB signal in the dataset is normalized to a sampling frequency of F s = 1 sample per second (Sps), it has a time duration of L = 100 million samples, and a center frequency of F c = 0 Hz. 4 On the other hand, the narrowband signals have highly varying properties. They are modulated using a range of modulation classes that  Fig. 6 shows the distribution of these classes in the overall dataset. Evidently, the dataset is characterized by a serious modulation class imbalance, with M -PSK and M -QAM dominating and the other classes being severely under-represented. Fig. 7 and Fig. 8 show the histogram of the narrowband signals' time durations and bandwidths, respectively. Around 99% of the narrowband signals have a time duration of less than 1 million samples, and only a few exhibit the same duration as the wideband signals. The normalized bandwidth of the narrowband signals is also characterized by a high variation, with the minimum 3 The raw signals in the dataset are not subject to noise or channel fading. The only impairment affecting the signals is the adjacent channel interference arising due to sidelobes and filtering artifacts [28]. 4 The wideband signals in the public dataset are in line with the signal model in (1) and additionally downconverted to a center frequency of 0 Hz.  bandwidth reaching the value of BW min = 0.0005 Hz and the maximum bandwidth reaching BW max = 0.998 Hz.

B. TIME-FREQUENCY REPRESENTATION
In our proposed framework, the YOLO network for wideband signal recognition receives as input spectrogram images. Hence, as a first step, the wideband signals in the dataset should be represented accordingly. To avoid a high processing complexity due to the very long duration of the wideband recordings and to ensure a sufficiently good time resolution in the spectrogram even for the shortest emissions, we partition every wideband signal into 100 segments in the time domain. This step artificially increases the size of the spectrogram images set for YOLO to 10900. Furthermore, the spectrogram images are now significantly simplified, as they show the time-frequency content of a wideband signal only for a  time segment of 1 million samples; thus, they contain much fewer narrowband signal bursts. Narrowband signals whose time duration is larger than 1 million samples span several spectrograms and are counted as separate signals. As a result, the corresponding new distribution of the modulation classes looks as shown in Fig. 9. While the class imbalance problem remains grave, partitioning has artificially increased the number of occurrences of less-represented classes. Several techniques for further combating the class imbalance problem are mentioned in Section III-C.
For every partitioned wideband signal, the time-frequency representation is calculated via the STFT. Here, the number of frequency bins in the fast Fourier transform (FFT) calculation is set to 4096 to ensure a sufficiently high resolution in the frequency domain for recognizing signals with the narrowest   occurring bandwidths. In time domain, the signal is analyzed through a Hann window without overlapping and with a fixed length of 512 samples, resulting in a spectrogram matrix with dimensions 1953 × 4096. The final image visualizes the normalized logarithm of the calculated spectrogram, where the normalization is done by removing the mean and dividing by the standard deviation, according to the procedure in [28].
From the inspection of the generated spectrogram images, we observed that for three of the modulation classes, namely, 2-FSK, 4-FSK, and GMSK, not all instances have the same time-frequency characteristics in the spectrogram. Fig. 10 and Fig. 11 show two typical appearances of 2-FSK and 4-FSK modulated signals, respectively. Considering that an object detector typically categorizes in one class objects with similar visual features, it is expected that these two different shapes for 2-FSK and 4-FSK signals, respectively, would confuse the signal detector. Signals in the left images have a significantly larger bandwidth than signals in the right images and distinctive bright lines at two and four frequencies, respectively. As for the signals in the right images, they exhibit a smaller bandwidth and are characterized by a continuous transition between the frequencies present in the signal. Based on our experience with synthetic and realworld datasets, these features might indicate a continuousphase (CP) version of the FSK modulation. Hence, to avoid confusion, we manually introduce two new classes in the dataset: 2-CPFSK and 4-CPFSK, and adjust the labels of signals corresponding to these two classes accordingly. The same procedure is applied to the GMSK class, which is divided into two subclasses: GMSK1 only includes signals with a shape as shown in the right side of Fig. 12, while GMSK2 includes the remaining GMSK signals with the left shape. These modifications help to avoid possible confusion among the mentioned classes but also extend the functionality of the proposed wideband signal recognizer to classify a larger number of signal types.

C. CLASS IMBALANCE MITIGATION TECHNIQUES
A class imbalance occurs when one or more classes of a dataset are represented with significantly more data samples than other classes. Such skewed data distribution can arise due to a difference in the real-world occurrence frequencies of different classes, or it can be artificially introduced in the data generation/collection process [59]. In a classification problem, the class imbalance might impact the learning performance, as the learning based classifiers tend to learn and favor the over-represented classes, due to their increased prior probability. The class imbalance can be quantified by an imbalance ratio metric, which is defined as [59] where S i denotes the set of data samples of the i-th class and | · | stands for the cardinality of a set. The original wideband recognition dataset is characterized by an imbalance ratio of ρ = 339.4 whereas the dataset containing partitioned wideband signals has a reduced imbalance ratio of ρ = 17.86. However, in our signal recognition framework, all M -PSK and M -QAM signals are grouped under the same class PAM, as explained in Section II. The resulting PAM category contains a total of 15206 signal instances, which is a much higher number than the size of other classes in the dataset and increases the imbalance ratio to ρ = 84.95. The class imbalance problem has been carefully studied in the context of a pure classification task, where every data sample (e.g., an image) is only labeled by the name of the class it belongs to. However, according to the suggestion in [59], techniques for handling class imbalance in image classification can also be adapted to object detection. In [59], techniques for handling the class imbalance problem have been divided into three main categories: 1) Data-level methods attempt to reduce the imbalance by modifying the distribution of the training data. Two important methods from this category include under-sampling, which discards data samples from the majority classes, and oversampling, which produces data samples from the minority classes. The sampling can be performed randomly or in an informed manner. Specifically, informed under-sampling techniques ensure that only noisy or redundant samples from the majority classes are discarded. Similarly, intelligent oversampling methods aim to produce synthetic minority samples that can improve discrimination and strengthen the boundaries between classes, e.g., by interpolating between existing minority samples and their nearest minority neighbors [59]. 2) Algorithm-level methods modify the ML/DL classifier or its learning procedure so that it is more attentive to under-represented classes. For example, penalties or weights are assigned to each class such that the cost of misclassifying minority classes is higher, which forces the classifier to learn to distinguish them better. 3) Hybrid methods combine both data-level and algorithm-level methods. In object detection, the most extensively studied class imbalance problem concerns the imbalance between the background and the objects of interest (foreground). Specifically, most of the bounding boxes or detection regions proposed by the object detection network lie in the background of the image and do not contain an object, and as such are labeled as ''negative'' or ''background''. Only very few other bounding boxes detect an object. Such backgroundforeground imbalance seems to hold for our considered dataset as well, as all spectrogram images contain less than ten signals, and most of the background is occupied merely by additive white Gaussian noise (AWGN). Regardless, as it was mentioned in Section II-B and also observed during our experiments, the YOLO object detector is robust to this imbalance. On the other hand, the imbalance between foreground classes seems to have received less attention in object detection research. Below, we list a set of techniques used in image classification which we have applied to address the class imbalance problem in the wideband recognition dataset, whereby the first and third techniques correspond to the data-level category of methods, and the second technique corresponds to the algorithm-level category, which were described earlier.

1) RANDOM UNDER-SAMPLING
The most straightforward way to reduce the imbalance in our dataset is to decrease the number of instances of the dominant PAM class. To this end, we have discarded those partitioned wideband signals that contain only M -PSK or M -QAM narrowband signals, such that no instances of other classes are removed. After this under-sampling step, the number of PAM instances decreases to 3669, whereas for other classes the number of samples remains the same. Clearly, PAM is still the dominating class, along with OFDM, but the imbalance is now reduced and quantified by an imbalance ratio of ρ = 20.5.

2) CLASS-BALANCED LOSS
To instruct the YOLOv5 signal detector to put more attention to the under-represented classes, we can modify the classification loss term of its loss function according to the class-balanced loss proposed in [60]: Here, L(y, p) denotes the YOLOv5 classification error term (binary cross-entropy loss) introduced by predicting the class probabilities p for the ground truth object with label y. In the class-balanced loss L cb (y, p), this error is weighted with a class-specific coefficient that is inversely proportional to the effective number of samples in class y, E n y , which is calculated as E n y = 1−β ny 1−β , where β is a hyperparameter and n y denotes the total number of samples with label y in the dataset. The less represented a class is, the higher its weighting coefficient will be, and, consequently, the more it will contribute to the overall classification loss. This forces the network to focus on improving its classification performance for these under-represented classes.

3) STRATIFIED DATA SPLITTING
Before training, the dataset should be partitioned into two disjoint subsets: training set and test set. 5 For a fair performance evaluation, it is desirable that each of these subsets exhibits the same data distribution as the overall dataset. In simple classification problems, this can be achieved by using stratification, a random sampling method that splits a given set such that in every subset, the proportion of instances for each class is approximately equal to its proportion in the complete set. In an object detection problem such as ours, applying stratified splitting is not straightforward. The data samples are images that contain various numbers of objects of different classes. Adding an image to or removing an image from one of the subsets in order to maintain the proportion of one class would affect the proportion of all the other classes present in that image. This makes the stratified splitting for object detection a coupled problem. In [61], the authors propose an iterative stratification technique for multi-label datasets, where data samples can simultaneously belong to more than one class, e.g., in an image classification problem, the image of a cat is simultaneously characterized by labels ''cat'' and ''animal''. Practically, for a dataset with N nonexclusive categories, every sample x is annotated by an N -dimensional vector y, whose elements are given as In the wideband recognition dataset, the modulation classes are exclusive. However, a data sample, i.e., a spectrogram image, can contain one or more objects (potentially of different classes). In other words, the spectrograms are multiobject rather than multi-label data samples. Nevertheless, an analogy can be drawn between multi-label and multiobject data. Specifically, a spectrogram image can be annotated by an N -dimensional vector L with elements given as L i = n i , where n i denotes the number of objects of the i-th class that are present in the given image (n i = 0 if the image does not contain an object of the i-th class). Using this slight modification, we have adapted the iterative stratification technique proposed in [61] to perform stratified splitting in our object detection dataset, as shown in Algorithm 1.
The main concept of this stratification technique is to first distribute the under-represented data samples appropriately such that the distribution of the least occurring classes can be maintained. The motivation behind this operation is that if rare classes are not examined first, then an undesirable distribution of them cannot be repaired subsequently [61]. On the other hand, the distribution of the most common classes can be corrected later on thanks to the availability of more data samples labeled with these classes. Consequently, in each iteration, the algorithm examines the class with the fewest (remaining) data samples. For each data sample of this class, the algorithm selects that subset whose current proportion of the considered class is furthest away from the desired one. Once the appropriate set has been selected, the current data sample is added to that set and removed from the complete dataset. Additionally, the current proportion of the classes in the selected subset is updated to account for the addition of all class instances that are present in the added data sample.
We note that during the data splitting, we force all partitioned segments of the same wideband recording to be either on the train set or the test set. Even though different segments of the same wideband recording contain different time-frequency contents, often they show an identical layout of the narrowband signals they contain, which renders their spectrograms very similar. If some of these segments belong to the test set and some others to the train set, YOLO would yield an overly optimistic performance, as it would be tested on images that look almost identical to ones it has seen during training. To avoid this and to ensure a fair performance evaluation, in our data splitting algorithm, all segments extracted from the same wideband recording are considered to be instances of the same data sample, and hence, are assigned jointly either to train or test set. Table 1 shows the proportion of signal instances assigned to the training set and test set for each of the classes after the application of stratified data splitting. Because the splitting problem is highly coupled and there are very few available signals for some of the classes, we cannot guarantee a fixed train-test proportion for all the classes. For example, there are only two wideband recordings that contain GMSK2 signals. Based on our discussion above, all partitioned segments from one of the recordings would belong to the training set, and the partitioned segments from the other recording must belong to the test set. In such case, it is impossible to assign only 20% of GMSK2 signals to the test set. The same consideration extends to other under-represented classes. for i = 1 to N do 22: c i m ← c i m − n i (x) 23: 24: end for 25: end for 26: end while 27: return D train , D test VOLUME 11, 2023 D. DATA AUGMENTATION To further increase the variability of the data and to avoid overfitting during training, a set of augmentation techniques are applied to the training spectrogram images, including photometric distortions (hue, saturation, and value adjustments), geometric distortions (image translation, scaling, horizontal and vertical flipping), and image mixing techniques [50]. The latter category includes the mix-up augmentation technique, which blends two images together, the copy-paste technique, which copies segmented objects from one image and pastes them into another, and the mosaic augmentation, which randomly mixes four images together to synthetically create a new image with a complex context.

IV. PERFORMANCE EVALUATION
In this section, we present the results of several experiments we have conducted to evaluate the performance of the proposed framework in terms of both wideband signal recognition and PAM classification. First, we describe the performance metrics used to assess the networks. For the first stage of the framework, we start by providing some technical details on the employed YOLOv5 network and its training procedure, before moving on to the results of wideband recognition experiments. Subsequently, for the PAM classification stage, we describe the implementation of the networks and investigate their achieved performance.

A. PERFORMANCE METRICS
Regarding wideband signal recognition, we analyze the performance of the proposed method based on three criteria: robustness to noise, sensitivity to signals of interest, and localization accuracy. Since we use an object detector to perform this task, we characterize each of these criteria with commonly used object detection metrics. Specifically, the robustness to noise is linked to the precision metric, which measures the ratio of correct predictions to the total number of predictions. Precision is calculated as where TP and FP denote the number of true positive and false positive predictions, respectively. A prediction qualifies as a true positive if it satisfies three conditions: the confidence score of the prediction is higher than a predefined threshold, the predicted bounding box has an IoU with the ground truth bounding box greater than a threshold (typically 0.5), and the predicted class label matches the ground truth label. If any of the two latter conditions are violated, the prediction is considered a false positive. If there is indeed a ground truth object to be detected but the confidence score of the prediction is lower than the threshold, the prediction is discarded. Any ground truth object that is not detected by any valid prediction increments the number of false negatives (FN). The second aspect, i.e., the sensitivity of the object detector, can be characterized by the recall metric, which measures the percentage of ground truth objects detected correctly. Recall is calculated as Both metrics are of interest to characterize a robust object detector, hence, usually, joint metrics are used. For example, the precision-recall curve, whose points are precision-recall pairs obtained for varying confidence score thresholds, provides information for both metrics simultaneously. For the same purpose, a numerical metric called Average Precision (AP) is used, which measures the area under the precisionrecall curve, where the recall takes on values from the interval [0, 1]. Practically, in our simulations, we calculate the AP using the all-point interpolation method adopted in the Pascal challenge [62], [63]. In the following, we use the mean AP (mAP), which is calculated as the mean value of APs across all classes, to characterize the performance of the YOLOv5 signal detector. Finally, to assess the localization accuracy, we use the average IoU of the true positive bounding boxes with their respective ground truth bounding boxes.
As for the PAM classification, to evaluate the performance of the CNN and ResNet classifiers, we use the conventional accuracy metric for measuring the percentage of correctly predicted labels and the confusion matrix for visualizing the sources of errors and confusion among classes.

B. YOLO IMPLEMENTATION DETAILS
The YOLOv5 framework offers five different models, namely YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, where the last letter of each name defines the size of the respective model architecture (nano, small, medium, large, extra large). Larger models have a greater number of learnable parameters and, as a result, more learning capacity, but they are associated with higher complexity and longer training/inference times. However, considering the expected learning challenges due to the lack of sufficient training data and the class imbalance in the dataset, we have mainly used the YOLOv5l model in our experiments. During training, the stochastic gradient descent (SGD) algorithm is used to optimize the multi-term loss function described in Section II-B. The learning algorithm uses an initial learning rate of 0.01 and a cyclical scheduler for adapting the learning rate throughout the training process. For other hyperparameters, such as weight decay, momentum, balancing weights of the loss terms, IoU training threshold, image augmentation hyperparameters, etc., we have adopted the default values offered by the YOLOv5 framework [30], unless stated otherwise. Before the start of the training process, YOLOv5 uses a genetic evolution based algorithm for calculating prior anchor boxes adjusted to the size and aspect ratio of the objects (signals) in the wideband recognition dataset [30]. All networks are trained and tested using an Nvidia GeForce RTX 2080 graphical processing unit (GPU).

C. SINGLE-CLASS WIDEBAND DETECTION
First, to assess the suitability of YOLOv5 as a signal detector, we conducted experiments on single-class wideband signal detection, i.e., the network should detect the presence of narrowband signals in wideband spectrograms and localize them in the time-frequency plane, but it does not have to distinguish among signals of different modulation classes. YOLOv5l was trained in the training set (resulting from the data splitting described in Section III-C) for 100 epochs, which took approximately four hours in our hardware. Before generating the spectrogram images, AWGN has been added to each wideband signal in both the training and test set, resulting in a signal-to-noise ratio (SNR) level uniformly distributed in the interval [−5, 20] dB, where SNR is defined as the ratio of the total power of the wideband signal to the total power of the noise over the whole spectrogram duration and bandwidth, similar to the SNR definition in [22]. To evaluate the performance of the trained network, we calculated the precision, recall, AP, and IoU metrics for the predictions of YOLOv5l on the test set.
We recall that, theoretically, AP should measure the area under the precision-recall curve for recall levels from the interval [0,1]. To obtain all the points of such precision-recall curve, one should consider all the predictions with confidence scores varying from 0 to 1. However, in the evaluation of the single-class and the following multi-class experiments, we have fixed the minimum acceptable confidence score to 10%, i.e., we have discarded all network predictions with a confidence score lower than this threshold, even though they might be correct. We note that this causes a reduction of the reported AP, because the obtained precision-recall curve cannot span the complete recall interval. However, it is realistic to assume that, in practical applications, a certain reliability level is required to accept the network's predictions. We also emphasize that the confidence threshold is a variable inference hyperparameter that is defined by the application requirements. Finally, we have set both the IoU threshold for accepting a prediction as true positive and the Nonmaximum Suppression (NMS) IoU threshold for discarding duplicate predictions to 0.5, which are typical values in object detection. Based on these inference hyperparameters, the performance metrics are presented in Table 2, where it can be seen that YOLOv5l reaches an AP of 96.65% and succeeds in detecting and localizing signals with simultaneously high precison, recall, and localization accuracy.

D. MULTI-CLASS WIDEBAND RECOGNITION
In the second set of experiments, YOLOv5 was employed to jointly perform all three tasks of signal detection, localization, and classification. We used the same training and test sets and the same network version, i.e., YOLOv5l, as in the single-class experiments. The network was trained over 100 epochs, which took approximately four hours in our hardware. When evaluated on the test set, the YOLOv5l based signal detector reaches an mAP of 61.7%, precision of 62.76%, recall of 68%, and average IoU of 90.25%. In an attempt to improve the achieved performance, we account for the increased difficulty due to the class imbalance and high similarity between some of the classes by increasing the augmentation level and fine-tuning the probabilities of applying different augmentation techniques to the training data. In particular, we focus on geometric distortions and image mixing techniques, as only they can model meaningful variations to our data. The selected augmentation parameter values are shown in Table 3.
The performance results of the best-performing network are summarized in Table 4, which shows the precision, recall, mean IoU, and AP for every class, as well as the average metrics for the overall test set, where the YOLOv5l object detector reaches an mAP of 67.8%. By comparing the individual class metrics, we notice that the best performance is achieved for the well-represented classes (PAM and OFDM) and for classes with distinctive time-frequency representations, such as 2-FSK, 4-FSK, and GMSK2, which all reach AP values of higher than 97%. Classes that are under-represented or that share similar visual features with other classes exhibit lower performance, especially in terms of precision. For instance, due to the limited frequency resolution in the spectrogram, the appearances of GMSK1, 2-CPFSK, and 4-CPFSK are often hardly distinguishable from each other. Additionally, from an inspection of the spectrograms, it can be observed that several AM-DSB signals with narrow bandwidth appear similar to OOK signals, whereas wideband FM signals often resemble PAM signals. These similarities cause confusion among classes, as it is also confirmed by the confusion matrix shown in Fig. 13, in which these error sources are reflected, despite the approximately diagonal structure of the matrix. The main drawback of the network is its inability to correctly recognize any instance from the AM-SSB class, which is one of the least represented classes in the dataset. Upon inspection of the network's predictions on spectrogram images that contain AM-SSB signals, we have observed that a few corresponding detections are incorrectly labeled as AM-DSB, whereas the majority of them are discarded due to their low IoU with their corresponding ground truth bounding boxes, cf. Section IV-A. Hence, as shown in the confusion matrix in Fig. 13, the AM-SSB signals in the test set are either confused with  AM-DSB or not detected with a valid detection. As for the task of localizing the other narrowband signals, YOLOv5l manages to ensure a high localization accuracy, which is indicated by a mean IoU of 88.5% between the true positive predictions and their respective ground truths.
A similar performance can also be achieved with a smaller network, i.e., YOLOv5m, by simply using more augmentation and a class-balanced classification loss, 6 as described in Section III-C. Specifically, the network trained under these settings achieves an mAP of 67.44%, precision of 73.25%, and recall of 73.76%. However, similar to YOLOv5l, this network fails to recognize AM-SSB signals and cannot provide sufficiently good performance in distinguishing similar classes, as indicated by the performance metrics per class summarized in Table 5. In the following section, we present an alternative training strategy to tackle these limitations.

E. DATA MIXING AND TRANSFER LEARNING
The limitations regarding detection and classification encountered by the YOLOv5l network in the wideband recognition dataset could not be further alleviated by model/parameter fine-tuning or by increasing the network's 6 YOLOv5l with class-balanced loss exhibits training instabilities and overfitting, hence, its results are not included in the performance evaluation. learning capacity. Thus, our alternative strategy for improving the performance is based on enriching the input training data through a combination of data mixing and transfer learning techniques. For this purpose, we have first synthetically created a new dataset that contains more than 4000 wideband signals modeled according to the signal model in (1). To mimic the public dataset as well as possible, every generated wideband signal has a time duration of 1 million samples, a normalized sampling rate of 1 Sps, and it contains an arbitrary number of narrowband emissions which are modulated with the same modulation classes as the signals in the public dataset. Parameters of narrowband signals, such as start time, duration, center frequency, and modulation parameters, are chosen randomly from predefined typical ranges. We enforced a modulation class balance while making sure to include sufficient instances for underrepresented classes in the public dataset. The generated dataset is openly accessible in [64].
In the first step, a YOLOv5l network with default hyperparameters was trained using the newly generated data, where it reached an mAP of 94.2%. Due to the similarity between the two datasets, we expected the features learned from the new data to be transferable to the public wideband signals, which motivated us to perform transfer learning. Hence, in the second training step, we mixed the two considered datasets and retrained in the resulting dataset a YOLOv5l network that employs the weights learned in our new data as a starting point. Throughout the training, YOLO fine-tunes the weights to adapt them to the public wideband signals, while still learning from the large amount of available data in the new dataset. In this step, we use the same data augmentation parameters as those defined in Table 3 and the default (not balanced) classification loss. The network was trained over 150 epochs, which took a longer time of 15.5 hours due to the larger training dataset.
The trained network is tested only on the test data from the public dataset and the resulting performance metrics are shown in Table 6, for a confidence threshold fixed to 10%. YOLOv5l with data mixing and transfer learning reaches an mAP of 76.3%, thus providing an 8.5% increment compared to the previous network trained on the smaller public dataset. Furthermore, the new network simultaneously improves the average precision and recall, which reach values of 77.1% and 81.8%, respectively, while maintaining a high localization accuracy, with a mean IoU of 86%. For a confidence threshold of 5%, the new network reaches an even higher mAP of 81.22%. In this case, the mean IoU has a value of 86.05%, the recall is improved to 88.48%, whereas the precision is decreased to 69.24% due to the relaxed confidence threshold, thus emphasizing the precision-recall trade-off discussed above. The improvement of the performance metrics is visually demonstrated in Fig. 14, which presents the interpolated precision-recall curves of YOLOv5l networks trained on the original dataset and the mixed dataset, respectively.
A one-by-one comparison of the APs per class from Table 4 and Table 6 indicates that the new training strategy has enhanced the performance for almost all classes. In particular, the addition of new AM-SSB instances to the training dataset and transfer learning help to mitigate the previous inability of YOLOv5 to detect AM-SSB signals, which are now recognized with a high AP of 95.73%. Similarly, more data and transferable features prove helpful in improving the recognition of 2-CPFSK, 4-CPFSK, and OOK, which were some of the most challenging classes in the previous experiments.
GMSK1 and FM, on the other hand, continue to pose a challenge to the YOLOv5l network. The confusion matrix presented in Fig. 15 gives more insight into the sources of errors related to these two classes. Specifically, GMSK1 continues to be confused with 2-CPFSK, thus demonstrating that the challenge arising due to their highly similar spectrogram representation combined with the limited frequency resolution cannot be successfully tackled by more data. As for the FM class, most of its instances are confused with PAM signals. We note that the dataset we generated only contains narrowband FM signals that are modulated by audio source signals with varying modulation indexes. Due to a potential  difference in the selected simulation parameters, the FM signals in our dataset might not be sufficiently representative of the FM signals from the public dataset, which is confirmed by a manual inspection of the corresponding spectrograms of these signals, examples of which are shown in Fig. 16 and Fig. 17, respectively. We suspect that this mismatch between the data from the two datasets has impacted the transferability of the learned features, resulting in a degradation of the FM recognition performance. Nevertheless, except for the GMSK1 and FM related errors, the confusion matrix exhibits an essentially diagonal structure.
To assess the robustness of YOLOv5l trained on the mixed dataset using transfer learning with respect to noise, in Fig. 18, we present the variation of the performance metrics with the SNR value of signals in the test set. We recall that SNR measures the ratio of a wideband signal's total power to the power of the added noise over the whole spectrogram duration and bandwidth. From the behavior of   the curves, we can observe that the detection, localization, and classification performance is not severely impacted by the added noise, even in regimes with SNR as low as −10 dB. The YOLOv5l detector manages to recognize wideband signals with consistent precision and recall starting from over 65% for SNR = −10 dB and approaching 80% as the SNR increases.

F. PAM MODULATION CLASSIFICATION
In this section, we focus on the performance evaluation of PAM modulation classifiers. The results on multi-class wideband recognition presented in Section IV-E showed that the YOLOv5l network can detect PAM signals with high recall (96.35%), which ensures that almost all true PAM instances will be given as an input to the PAM classifier, and with high IoU (93.81%), which ensures a high quality extraction of the narrowband PAM signal from the wideband recording to which it belongs.
The training, validation, and test data for training and evaluating the PAM classifier networks are generated according to the following steps: 1) As described in Section III-C, the wideband signals that were discarded during the PAM under-sampling procedure contain only M -PSK and M -QAM narrowband emissions. Even though these signals were not part of YOLOv5l's training data, they can still be exploited for training the PAM classifier networks.
To this end, the YOLOv5l signal detector is applied to spectrogram images of the discarded wideband signals. The resulting YOLO multi-class predictions, i.e., predicted labels and estimated localization coordinates, are subsequently used to extract detected PAM signals from their respective wideband recordings. The extracted signals are downconverted to ECB and further partitioned into frames with a fixed length of 1024 samples. Finally, the available ground-truth annotations are used to map the detected PAM signals to true ones. In case of a correct detection, the generated frames are labeled according to their specific modulation scheme, namely 2-PSK, 4-PSK, 8-PSK, 16-QAM, 64-QAM, or 256-QAM. 2) In the second step, to further increase the dataset size, PAM signals are extracted from the wideband recordings included in YOLOv5l's training set. Since YOLOv5l has already ''seen'' the spectrograms of these wideband recordings during its training, their predictions' accuracy would naturally be higher than the expected accuracy of predictions on unseen data. Consequently, PAM signals extracted from the training wideband signals based on their YOLOv5l localization predictions (as done in step 1) would not be sufficiently representative of PAM signals extracted during the test time. Hence, instead of predictions from YOLOv5l, in this step, we have used the available ground truth time/frequency coordinates, to which we have added a random offset from the interval [−5%, 5%] 7 of the true value to model a realistic inaccurate extraction. In addition to PAM signals, the same procedure is followed to extract narrowband emissions modulated with other modulation schemes in order to construct 7 Upon analyzing YOLOv5l's predictions on the test set with respect to the localization accuracy, we observed that the deviation of the estimated time/frequency coordinates from the respective ground truth ones is concentrated in the interval [−5%, 5%] of the true value. the ''Others'' class. All extracted signals are finally downconverted to the ECB domain and partitioned into fixed-length frames. The resulting dataset, hereinafter referred to as the PAM dataset, contains 30000 training frames per class for 2-PSK, 4-PSK, and 8-PSK. For 16-QAM, 64-QAM, and 256-QAM, the dataset contains 50000 training frames per class to account for the higher difficulty associated with distinguishing these classes compared to others. Furthermore, 50000 frames are generated for the class ''Others'', since this class should represent various non-PAM modulation classes. Finally, 5000 frames per modulation scheme are assigned to the validation set and the remaining generated frames (around 6000 per class) are assigned to the test set. We note that for every class, the frames corresponding to training, validation, and test sets are extracted from different wideband recordings. Same as for the wideband recognition experiments, before the narrowband extraction step, AWGN is added to the wideband signals corresponding to an SNR level uniformly distributed in the interval [-5, 20] dB.
The CNN presented in Fig. 4 and the ResNet presented in Fig. 5 are trained using an SGD with momentum optimization algorithm that minimizes a categorical cross-entropy loss function. We fine-tuned the learning hyperparameters using the Asynchronous Successive Halving Algorithm [65], which exploits parallel computing and early stopping to evaluate a large number of hyperparameter configurations. Based on the fine-tuning results, the training process employs an initial learning rate of 0.0015, batch size of 96, momentum of 0.83, and weight decay of 0.009.
The trained CNN and ResNet PAM classifiers achieve an accuracy of 86.49% and 89.07%, respectively, on the test set generated as described above. To investigate the variation of classification accuracy with SNR, the PAM classifier networks trained on the general PAM dataset are evaluated on test sets generated with specific SNR values from the interval [−10, 30] dB. The results are shown in Fig. 19, where it can be observed that CNN closely approaches and ResNet  surpasses an accuracy of 80% already at SNR = 0 dB. While both networks exhibit similar classification capabilities in the low SNR regime, the ResNet classifier outperforms CNN in the moderate and high SNR regimes, reaching an accuracy of 96% at SNR = 20 dB. To better understand these results, Fig. 20 and Fig. 21 show the confusion matrices of the classifications performed in the test set of the PAM dataset for CNN and ResNet, respectively. Both matrices are characterized by a diagonal structure, with M -PSK and 16-QAM classes having the highest classification accuracies. However, the presented matrices manifest some confusion between the higher-order QAM schemes, i.e., 64-QAM and 256-QAM, due to their similar modulation constellations, a problem that is also observed in related works [10], [11], [31]. This confusion is less pronounced for the ResNet classifier. A similar classification performance can be seen in Fig. 22 and Fig. 23, which show the confusion matrices of CNN and ResNet when they are applied to  YOLOv5l's predictions (described in Section IV-E) in the test set described in Section III-C3. Here, the CNN and ResNet classifiers reach overall classification accuracies of 89.84% and 92.7%, respectively. The high performance in this test set proves the generalizability of the PAM classifier networks and their robustness towards the errors and artifacts introduced due to the imperfect YOLOv5 detection and localization.

V. CONCLUSION
In this work, the task of wideband signal recognition is addressed using a learning based approach that combines the operations of two neural networks: a YOLOv5 based object detector for detecting, localizing, and classifying narrowband signals in wideband spectrograms, and a CNN based model for automatic classification of modulation schemes which cannot be resolved by YOLOv5. We thoroughly evaluated the proposed framework on a public wideband recognition dataset, which was priorly analyzed and curated through several data processing techniques. The evaluation results on this dataset highlighted the importance of highquality training data for ensuring a successful application of learning based methods. When trained on the smallsized public dataset, YOLOv5 achieved a moderate mAP of 67.8% on a general test set but demonstrated limitations in detecting and classifying difficult under-represented classes. To further improve the recognition performance, we proposed complementing the dataset with more training data and using a transfer learning based training strategy, which resulted in an increased mAP of 76.3% and a simultaneously high precision, recall, and localization accuracy. The presented two-step detection framework can support a wide range of signals modulated with various analog and digital modulation schemes. Furthermore, the same detection principle can be extended to more general scenarios. As an example, the YOLOv5 based signal detector can be employed to detect and localize fully-overlapped emissions (e.g., non-orthogonal multiple access (NOMA) signals), which can be subsequently classified by a dedicated modulation classification network [66], [67].
The simulation results have indicated that there is room for further improvement of the proposed framework's performance, especially with regard to distinguishing signals with similar time-frequency characteristics in the spectrogram. Thus, additional research efforts should be devoted to addressing these limitations and enhancing the performance, either by optimizing/redesigning the signal detection networks or exploiting alternative, more informative signal representations, such as representations based on higher-order statistics (HOS) [68]. Furthermore, despite the lightweight structure of the two employed networks, a two-step framework is difficult to optimize and introduces complexity due to the intermediate steps connecting the modules. Hence, a unified framework would be an interesting direction for future work, too. To this end, a prospective strategy would involve the adaptation of a complex-valued object detection network [69] that accepts as input complex-valued TFRs of the wideband signals, e.g., their STFT coefficients, thus simultaneously leveraging both the magnitude and the phase information of the signal's transformation to effectively perform wideband signal recognition for a wide range of modulation classes, including PAM.