Self-supervised Learning for Clustering of Wireless Spectrum Activity

In recent years, much work has been done on processing of wireless spectrum data involving machine learning techniques in domain-related problems for cognitive radio networks, such as anomaly detection, modulation classification, technology classification and device fingerprinting. Most of the solutions are based on labeled data, created in a controlled manner and processed with supervised learning approaches. However, spectrum data measured in real-world environment is highly nondeterministic, making its labeling a laborious and expensive process, requiring domain expertise, thus being one of the main drawbacks of using supervised learning approaches in this domain. In this paper, we investigate the use of self-supervised learning (SSL) for exploring spectrum activities in a real-world unlabeled data. In particular, we compare the performance of two SSL models, one based on a reference DeepCluster architecture and one adapted for spectrum activity identification and clustering, and a baseline model based on K-means clustering algorithm. We show that SSL models achieve superior performance regarding the quality of extracted features and clustering performance. With SSL models we achieve reduction of the feature vectors size by two orders of magnitude, while improving the performance by a factor of 2 to 2.5 across the evaluation metrics, supported by visual assessment. Additionally we show that adaptation of the reference SSL architecture to the domain data provides reduction of model complexity by one order of magnitude, while preserving or even improving the clustering performance.


I. INTRODUCTION
The number and type of wireless devices connected to the Internet is rapidly increasing with the current affordable personal mobile and Internet of Things (IoT) devices, requiring wireless networks to handle high traffic. As a reference, the number of connected devices requirement for the fifthgeneration (5G) network is one million devices per square kilometer. The existence of such a number of devices requires complex wireless resource management. Over time, several new approaches to wireless resource sharing, including dynamic spectrum access [1], licensed shared access [2] have been proposed. However, additional technological components, such as spectrum usage databases [3] and radio environment maps [4], had to be developed before being able to come closer to such sophisticated and dynamic spectrum usage approaches. To be able to correctly inform on spectrum usage, additional knowledge of the devices transmitting within the range of a wireless network is critical for future smart usage of the spectrum. In this respect, some of the recent efforts were focused on detecting the modulation used [5], technology used [6], anomalous activities [7], etc.
As also discussed in [8] and [9] significant effort is still being invested in the field to develop accurate and scalable deep learning algorithms able to accurately and automatically manage spectrum resource usage. With respect to the learning approach, these techniques can be divided into two groups 1) supervised that require labels to be present for the training data and 2) unsupervised that do not assume any such labels. As most applications in wireless spectrum management need to be aware of facts (i.e. type of technology, transmission parameters), developing a deep-learning based model to support such application typically requires labelled data that is expensive to acquire as it requires complex wireless and computing equipment [10], [11] or intense labelling efforts that do not always lead to high quality labels [12]. Semisupervised and active-learning are alternative techniques that have the advantage of using a relatively small amount of labeled samples for achieving performance that is comparable to the regular supervised approach.
Given the advent of large datasets which are expensive or practically impossible to label, self-supervised learning [13], as another intermediate learning approach, that is particularly suitable to reduce the data labelling cost and leverage the unlabelled data pool, is becoming an important alternative. Self-supervised learning is a representation learning method where a supervised task is created out of the unlabelled data. Using a self-supervised approach, it is possible to create very similar groups (i.e. clusters) from a large, unlabelled dataset and then label each cluster. By labelling the learnt clusters, it is possible to then use the model as a classifier by assigning new, unseen examples to those clusters and therefore label them as one would do in a typical classification task.
Developing an easy, automated and technology agnostic way to explore spectrum activity and group similar activities, eventually enabling automatic rather than manual transmission identification and cataloguing as currently done in the Signal Identification Guide 1 , to take one example, is still an open research topic.
In this paper, we investigate the suitability of self-supervised learning to support automatic spectrum exploration of an unlicensed 868 MHz Short Range Divece (SRD) band in a medium size European city 2 containing 15 days of spectrum sweeps collected 5 times per second. The spectrum activity data was collected with Software Defined Radio (SDR) in terms of Power Spectral Density (PSD) that can be visualized using spectrograms. Leveraging this data, we propose a new self-supervised learning approach, based on machine vision and inspired by DeepCluster [14], in which segments of spectrograms containing signal activity, such as the ones depicted in Figure 1, are used to train the deep learning (DL) selfsupervised network and enable the discovery of the types of transmissions available over the respective period of time. We experimentally prove that such an architecture is suitable for spectrogram analysis by learning spectral features and clustering spectrograms based on their content. The performance of the proposed solution is compared to a baseline approach using Principal Component Analysis (PCA) for feature extraction and K-means as the clustering algorithm.
The main contributions of this work can be summarised as follows: The rest of the paper is structured as follows. Section II analyzes the related work, Section III introduces the proposed self-supervised architecture, Section IV elaborates on the experimental methodology and Section V presents the experimental results. Finally, Section VI concludes the paper.

II. RELATED WORK
In recent years, as in many other research areas, the development of algorithms for processing spectral data has focused on the use of deep learning models. In a most general level, based on the learning approach, they can be divided in three main groups, described in this section.

A. Unsupervised
As an example of unsupervised approach, an architecture that uses Power Spectral Density (PSD) data vectors as input for anomaly detection is presented in [15]. The basic idea is to learn the distribution of the source data and detect when the input data vector deviates from it. A complex encoder-decoder architecture is used, which includes different types of neural networks for feature extraction, reconstruction and anomaly discrimination. Feature extraction is performed as an unsupervised learning procedure, while anomaly discrimination and classification is performed in a semi-supervised manner using only a subset of the labeled data. An important aspect in this approach is the definition of anomalies as rare events, which is not always the case. An improved and generalized architecture is presented in [16], where the problem of defining anomalies as rare events is addressed.
A pipeline for automatic detection of wireless transmissions is proposed in [12], along with software for manual labeling of transmissions in a spectrogram. A generic framework is designed that includes infrastructure for data acquisition, algorithms for data processing and evaluation. Two different groups of automatic detection algorithms are used: rule-based and computer vision-based. An evaluation of the manual and automatic detection and classification processes is presented, showing the influence of the labeled data on the evaluation.

B. Supervised
In [17], a solution to a device classification task using realworld transmissions is proposed, using a supervised learning approach. A new neural network architecture based on dilated causal convolutional layers is designed and used as a classification model. The design is motivated by existing audio signals processing architecture. The dataset contains I/Q data of transmissions from nearly 10000 devices using two different protocols. The authors conclude that the proposed neural network is suitable for classifying a large number of devices and that further analysis should focus on adding more protocols.
In [5], the authors present a deep learning model for automatic classification of signal modulations. The models are evaluated on synthetic and real-world, labeled datasets. A special type of recurrent neural network (RNN), Long Short Term Memory (LSTM), is used as a classification model. The results are presented using two different types of data: Phase and Amplitude in the first case and FFT-averaged Amplitude in the second case. They show that for the model used, phase and amplitude provide better signal classification accuracy. The FFT-averaged amplitude data are one-dimensional arrays with variable length.
In [18], an application of CNN supervised learning for device identification based on raw In-Phase and Quadrature (I/Q) data is proposed. Five different devices of the same type are used for data collection. Device identification is based on the CNN's ability to learn various device-specific impairments in the raw signals. SVM and logistic regression are used as reference models for performance comparison. It is found that the proposed CNN model significantly outperforms the baseline algorithms for the posed task.
In [6], the authors design several different supervised machine learning models and evaluate them on real data. The main idea is to compare the performance and generalization ability of models that use manually extracted expert features with models that use raw data. The task is technology classification and three different technologies are considered. Three different types of raw data are used: Received Signal Strength Indicator (RSSI), I/Q and spectrogram images. They prove that using CNN on I/Q raw data or spectrogram images outperforms all other models in terms of accuracy, generalization ability to unseen datasets from different environments, and robustness at different noise levels.
In [19], the authors present a comparative evaluation of two CNN architectures, ResNet and VGG, using XGBoost as a baseline model for signal classification with respect to different parameters and dataset configurations. Both CNNs have a comparable number of trainable parameters. The base model is trained on higher order statistics extracted from the training samples, while the input to the CNN models are the normalized raw I/Q samples. Simulated datasets and real signal datasets are used to evaluate the models under different conditions, both for low and high order modulations. ResNet is shown to outperform the other two models. Transfer learning is also evaluated, using the simulated dataset for training and the real signal dataset for tuning and testing. It is found that the accuracy of such model decreases only by 7%. In general, it can be concluded that the ResNet architecture is suitable for future use in RF signal classification applications.
Another supervised approach is presented in [20]. The authors propose the use of object recognition and deep learning models on spectrogram images to classify and localise transmissions. The starting point is the You Only Look Once (YOLO) model pretrained on the ImageNet dataset. Only the last layer before the model output is fine-tuned using simulated data from two different types of transmissions with corresponding bounding boxes. It is shown that the proposed model performs well in classifying interfering signals on simulated data with different signal-to-noise ratios. The accuracy of the classification is also evaluated on real data. It is shown that the model performs slightly worse than the CNN-based models trained on the same data, but provides additional information about the position of the transmission events in the spectrum.

C. Self-supervised, Semi-supervised and Active learning
An event of anomaly detection algorithm is also proposed in [7]. Again, anomaly detection is based on modelling the normal behaviour of spectrum activities. The novelty here is the use of a deep predictive coding network for spectrum data analysis, originally developed for video prediction application. The advantage of this approach is the use of a much smaller data set for training, since only the data of the normal behaviour of the network is needed, without all the different types of anomalies that can occur. Two different types of input data are used for two different models, spectrograms and spectral correlation functions (SCF). Both types are used in the form of sequential 2D images. The detector algorithms are trained and tested on simulated data. Four different types of anomalies are used for evaluation and a high detection rate is obtained. It is found that the SCF-based model has better performance. The architecture of the models limits their use to anomaly detection, without classification.
In [21], a combination of CNN as a feature extractor and a clustering algorithm is proposed to address the transmitters classification problem. The unique feature of this approach is that it does not assume that the transmitters from the training set are also present in the test dataset. The feature extractor is trained on labeled data to learn the specific features for all transmitters present in the dataset. Raw I/Q data is used for the training. Clustering is then performed on the features extracted using the trained network. The idea is to use the knowledge of existing transmitters to detect new unknown transmitters. The analysis of the performance on different number of unknown transmitters is also presented, where the performance of the model decreases as the number of unknown devices in the test set increases.
The application of transfer learning for RF signals is proposed in [22]. Data with complex-valued I/Q signal representations containing real and simulated signals are used for training and evaluation. Three different types of transfer learning are explored, each using a modification of an existing neural network architecture. The unsupervised transfer learning model is trained with simulated data and tested with real, large-scale data. This is of particular interest because labeled real-world spectral data is not widely available.
III. PROPOSED MODEL In this section, we propose a model that relies on selfsupervised learning and is able to learn to automatically cluster wireless spectral activities identified from spectrogram segments. The spectrogram segments contain transmissions of different shapes and intensity levels. The model assumes no prior knowledge about the types and number of different spectral activities present in the dataset. Additionally, no ground truth or labeling is available for the dataset used. We also elaborate on a baseline model used for benchmarking the ones we propose.

A. Self-supervised model
The proposed self-supervised model is developed using spectrogram segments to train a self-supervised deep learning based system. Through the procedure of training, the system will be tuned to learn the underlying distribution of the training data thus retaining a model of the distribution.
The proposed architecture of the deep-learning based system is depicted in Figure 2 and was inspired by and adapted from the existing self-supervised DeepCluster model, originally proposed in [14] for RGB image feature learning. It can be seen from the figure that the system design contains two branches (1) a so-called unsupervised branch depicted with red arrows and (2) an supervised branch depicted with blue arrows. Both branches contain a CNN-based feature extractor realized by ResNet18. In the unsupervised branch it is followed by a dimensionality reduction (PCA) and normalization (L2) preprocessing block and concluding with the K-means clustering technique. In the supervised branch, the feature extractor is followed by fully connected layer as classifiers. Note that even if we refer to the second branch as supervised, it does not rely on actual labelled data. It relies on pseudo-labels that are continuously generated by K-means and refined through feedback, i.e. backpropagation. By working in tandem, the two branches realize self-supervision. Compared to DeepCluster we propose the following adaptations: • Rather than using VGG [19] as a deep learning architecture, we select ResNet18 as it was shown to perform better in supervised classification tasks for spectrum data [23]. • The ResNet is used in its original form with customized input and output layers, according to the shape of the images and the number of classes. In our case, the input spectrogram images have only one channel, while the ResNet was originally designed for 3-channel RGB images. • PCA feature space reduction is added before clustering according to the methodology described in detail in Section IV-B. In DeepCluster, the full PCA-reduced feature space is used, instead of determining the amount of components based on their cumulative sum of explained variance ratio (EVR).
The pseudo-code corresponding to the workflow of the proposed self-supervised system is given in Algorithm 1. For clarity and clear mapping, the blocks of the system in Figure  2 are marked with the corresponding line numbers from the pseudo-code in Algorithm 1.
1) Workflow of the unsupervised branch: In the initial phase, the ResNet is initialized randomly and the input data, which consists of image-like spectral segments as described in Section IV-A, is unlabeled. The clustering algorithm (unsupervised branch, marked with red arrows in Figure 2) is used to provide labels (denoted by L in the figure) at the beginning of each training epoch. Clustering is performed on the features extracted by ResNet and processed by PCA/L2. Initialization of the cluster centers is random. The features are extracted using the convolutional layers of the ResNet and the average pooling layer, which has a size of 512 x 1 (lines 2-3 in Algorithm 1. Thus, a descriptor in the form of 512 x 1 vector is obtained for each image. These descriptors are then PCA-reduced, L2-normalized and finally clustered (lines 4-5 in Algorithm 1).
2) Working of the supervised branch: In this phase, ResNet is used in training mode, as a supervised classification model. The flow of this pipeline activity is indicated by the blue arrows in Figure 2. Using the provided cluster assignments from the previous step as labels (L) for the input images, the ResNet is trained for one epoch. This completes one iteration of the entire pipeline work cycle.
The procedure stops when the predefined number of iterations (training epochs) is reached. In our experiments, we used 200 training epochs. This number was determined empirically by observing the convergence of the loss function. The flow of the architecture is given in pseudo-code, matching the visualization in Figure 2.

B. Baseline model
The baseline model is developed using the same training data as for the self-supervised model and a K-means clustering algorithm applied on the PCA-reduced, L2-normalized inputs, as depicted in Figure 3. The system consists of a flattening block, PCA-dimensionality reduction and L2 normalization for input data preprocessing, followed by the K-means clustering block. L ← EstimatedLabels 10: CrossEntropyLoss(L, L )

11:
Backpropagation // Update ResNet weights 12: end while The flattening reorders the elements of the input matrix into a single row for each data sample. On the vectors provided by the flattening, the same sequence of operations is applied, as for the vectors provided by the self-supervised system. PCA is applied on the flattened data and as a result of this operation, reduced feature vectors representing the input data are obtained. These vectors are then L2-normalized and finally used as an input to the K-means clustering algorithm.

IV. METHODOLOGY
We define the following methodological approach to developing and benchmarking the proposed self-supervised model in view of enabling unsupervised spectrum activity exploration. Firstly, in Section IV-A, we describe the raw data used to train the proposed self-supervised and baseline models and its preparation. We then elaborate on the methodology for developing and evaluating high quality features, including feature extraction and processing in Section IV-B. Finally, we elaborate on developing and evaluating the models in Sections IV-C and IV-D.

A. Raw training data preparation
The dataset used for the analysis consists of fifteen days of spectrum measurements acquired at a sampling rate of 5 power spectral density measurements per second using 1024 FFT bins in the 868 MHz license-free (shared spectrum) SRD band with a 192 kHz bandwidth. The data is acquired in the LOG-a-TEC testbed 3 . Details of the acquisition process and a 3 http://log-a-tec.eu/ subset of the data can be found in [11]. The acquired data has a matrix form of 1024 x N, where N is the number of measurements over time. By windowing the data with a window size W, the resulting raw images for training would have 1024 x W dimension which can be computationally untractable for computing platforms. Therefore, in addition to windowing in time, we are also windowing in frequency, resulting in so-called image segmentation, where the image is the spectrogram.
The segmentation of the complete data-matrix into nonoverlapping square images along time and frequency (FFT bins) is realized for a window size W = 128. An example of such segmentation containing 8 square images is shown in Figure 1, corresponding to image resolution of 25.6 seconds (128 measurements taken at 5 measurements per second) by 24 kHz. The window size is chosen to be large enough to contain any single type of activity and small enough to avoid having too many activities in a single image while also having in mind computational cost. Dividing the entire dataset of 15 days using W = 128 and zero overlapping, produces 423,904 images of 128x128 pixels. Additionally, the pixel values are scaled to [0,1].

B. Feature development and evaluation methodology
As discussed in Section III-A, before being clustered, the segmented and normalized raw data is passed through a feature extractor (ResNet18) and PCA/L2 preprocessing, before being actually clustered. These blocks, together, learn to engineer features. The better the quality of the engineered features that train the K-means clustering algorithm, the better the resulting clustering model. Therefore, in the process of developing the proposed self-supervised model, the feature development process needs to be tuned and evaluated.
While the feature extractor is automatically trained, the PCA performing dimensionality reduction for both the proposed self-supervised and the baseline approach needs to be tuned. We first analyze how many PCA components are needed to keep the most relevant information while discarding as much noise as possible. To evaluate the quality of the feature after the PCA for both models, we use the EVR [24] as an evaluation metric. The EVR is a measure of how much of the variation in the feature space is assigned to each of the principal components after performing the PCA. Next, we select the PCA representation with the best EVR for both proposed selfsupervised and baseline models and analyze the quality of the representation for clustering purposes using the Hopkins score and the Visual Assessment of Tendency (VAT). Calculations for both metrics are made on a random subset of 1000 samples. The Hopkins score [25] is a metric that shows the probability that randomly sampled subset of the data comes from a uniform distribution. The Hopkins score close to 1 indicates data with distribution which is statistically close to the uniform distribution and is not likely to contain clusters. The Hopkins score is calculated by using the statistics of the subset of the real data and the statistics of a synthetically generated data of the same size with uniform distribution, according to: where D is the number of samples used for calculation, y i and x i are the distances between each sample and its nearest neighbour, for the synthetic and for the real data, accordingly. Resulting values are between 0 and 1, with 0 meaning the data is not uniformly distributed. However, having nonuniform distribution does not guarantee existence of clusters in the data. One such case is if the data has normal distribution, which will show low Hopkins score and again contain no meaningful clusters. To prevent possible false conclusions derived based on this metric alone, we support the evaluation with the VAT metric.
VAT [26] is an algorithm which produces matrix visualisation of the dissimilarity of samples based on their pairwise euclidean distances. The value of each element of the visualization matrix is proportional to the pairwise dissimilarity between each of the samples to all of the other samples. Thus, the left diagonal of the matrix is with zero values representing dissimilarity of each sample to itself. The samples are ordered in such a way that groups that are closely located in the feature space, according to the distance metric, appear as dark squares along the diagonal of the matrix. Besides the assessment of existing clustering tendency, hierarchy of clusters can also be detected as nested squares with different shades. Implementation wise, an improved version of the VAT (iVAT) is used which provides better visualization than the standard one.

C. Cluster development and evaluation methodology
To develop the cluster model that best finds similar transmissions, both architectures have to be trained several times for different values of k in K-means resulting in as many models. These models then need to be evaluated to see which one contains the most clean clusters (clear data separation). As all the possible transmissions that may occur are not known in advance, we choose k ∈ [2, 3, 4, ...60].
The clustering is then evaluated with two standard metrics, Silhouette score and Davies-Bouldin index, and also manually by cluster analysis and explanation. The Silhouette score [27] is a value that measures the quality of the clustering by evaluating how similar is each sample to its assigned cluster. The calculation of the Silhouette score for sample x i is: where s is the Silhouette value, a is the average distance of x to all other elements belonging to its assigned cluster and b is the average distance of x to the elements of the closest neighbouring cluster. The values it takes are in the range of [−1, 1], where bigger values mean better clustering. Silhouette score of a cluster is the average of the scores of its elements, while Silhouette score of a clustering is the average of the scores of all of the clusters. Davies-Bouldin [28] index is a clustering quality metric calculated as average of the similarity value of each cluster to its most similar (closest) cluster. Similarity between two clusters R ij is related to the intra-cluster dispersions S i , S j and inter-cluster distance M ij according to: The range of values that this metric takes has only lower bound 0 and smaller values mean better clustering.

D. Manual evaluation
As discussed in Section III, the clustering algorithm is randomly initialized, so labels are not fixed to any specific type of content. So, besides the quantitative evaluation of the quality of the clustering results, manual evaluation is also performed in order to interpret the content that is specific for each cluster. The motivation for the manual inspection is to verify that the clusters formed as output of the K-means are meaningful subsets of spectrograms containing highly correlated types of spectral patterns (transmissions). The evaluation is being performed at two levels of generality: Macro cluster analysis: in which we discuss the average per cluster spectrogram and the number of instances classified in each cluster. This analysis reveals the type of transmissions and their frequency.
Single cluster analysis: zooms in one selected example cluster to provide insights into its contents by randomly looking at some samples.

V. EXPERIMENTAL RESULTS
The feature space of a few experimentally trained models is analysed first. The goal is to provide an insight about the content dynamics of the specific range of PCA components before investigating the number of clusters in the dataset, and set the number of PCA components to be used in clustering.

A. Effect of the dimensionality reduction
In order to tune the PCA-based dimensionality reduction of the proposed self-supervised and baseline models discussed in Section III, we used the methodology described in Section IV-B. The cumulative sum of the EVR of the features provided  for the different models is shown in Figure 4. Figure 4a plots 5 different self-supervised models corresponding to 2, 4, ... 10 clusters. As can be seen from the figure, up to some point, the more clusters the model has, the more principal components it needs. For example the 2-clusters model, the first component has most of the cumulative EVR (90%) while for a cumulative EVR of (95%), only two components are needed. On the other hand, when developing a 10 cluster model, the first component has only 15% EVR and 18 components are needed to capture 95% cumulative EVR. Further experiments show that after 20 PCA dimensions the cumulative EVR for the CNN extracted features does not improve much, therefore we selected the first 20 PCA components as the size of the training data for the K-means component of the unsupervised branch in Figure 2.
In the remainder of the paper, we will refer to this as selfsupervised PCA 20 (SS-PCA-20). The plot in Figure 4b represents the feature space for the baseline model. It can be seen from the figure that direct PCA on images requires 3601 components to keep 95% EVR independent of the final number of clusters. In the remainder of the paper, we will use 3601 PCA components that keep 95% of the EVR and refer to it as B-PCA-3k and, for completeness, we will also consider the first 20 components only and will refer to them as B-PCA-20.
Comparing to the result in Figure 4a, where to keep 95% EVR 2 PCA components are sufficient for the 2 cluster case, 4 PCA components for the 4 cluster case, 15 components for the 6 and 8 cluster cases and 18 components for the 10 cluster cases, it can be seen Comparison of results in Figures 4a and 4b shows that selfsupervision is able to learn to encode the relevant information for cluster development in only 0.53% of the PCA components required by the baseline for the same 95% EVR. Thus, it provides significant simplification of further processing because of the dimensionality reduction.

B. Suitability of the features for clustering
Using SS-PCA-20, B-PCA-3k and B-PCA-20 derived in the previous section as input features to the K-means clustering in both models, we perform quantitative evaluation, as per Section IV-B, of how suitable are the resulting features for clustering.
The Hopkins scores evaluating the quality of the features developed using the proposed self-supervised versus the baseline model are presented in Figure 5 and show that the scores are improving until the number of clusters reaches 20, and after that the values stabilize. This means that further increase of the number of clusters does not affect the feature learning and extraction capability of CNN. On the same figure, the Hopkins scores for the baseline models are shown as constant values (green and blue lines). The clusterability of the features for the baseline models is constant along the clusters axis because identical features are used as inputs for the baseline models, obtained by PCA transformation on the raw data.
According to the Hopkins scores in Figure 5, the baseline features obtained through B-PCA-3k have the worst clustering tendency with a score of 0.2865, probably not able to make a clear separation in space. The B-PCA-20 baseline has a much better clustering tendency compared to B-PCA-3k, with a score of 0.0752. The self-supervised SS-PCA-20 has a score  Figure 6. Figure 6a shows the clustering tendency with SS-PCA-20 where groups of different size are represented as squares of different sizes along the diagonal. Clear and distinguishable squares corresponding to clusters can be seen in this figure. Figure 6b corresponding to B-PCA-3k contains no grouping at all while in Figure 6c corresponding to B-PCA-20 there is a small grouping in the left upper corner, which means that there are two clusters, one very big and one small. The very big cluster seems to have some sub-clusters that are not clearly separated. Hopkins scores in Figure 5 are in line with the VAT plots for the selected 20 cluster model.

C. Macro evaluation of cluster numbers and quality
Developing and evaluating the clustering quality is performed according to the methodology described in Section IV-C. For the Davies-Bouldin scores plotted in Figure 7a, it can be seen that the self-supervised model, SS-PCA-20, has consistently better scores compared to the baseline model B-PCA-20, although they are having similar shapes of the curves. For the SS-PCA-20 model, same pattern of performance variation is followed as in the feature space evaluation. The performance is improving until the number of clusters reaches 20, and it is practically constant afterwards. B-PCA-3k that uses high-dimensional feature space shows modest performance. The low values (∼1) are a result of assigning almost all of the samples to a single big cluster and having only a very small amount of samples in the other clusters, usually 1. The large spikes in the values appear when significant amount of samples are also assigned to another cluster, apart from the main one. This observation further supports the proposal for using multiple metrics in Section IV-C.

D. Macro cluster analysis
According to the results from Sections V-B and V-C, using model with 20-25 clusters is providing the best results. Considering this advantage of using the self-supervised architecture over the baseline clustering approach, we present two use cases where the learning capabilities of such model are being exploited. The first one is the spectrum availability estimation, where only two clusters are expected to exist, one with empty spectrograms and one with active spectrograms. The second use case is for higher granularity clustering, where different types of activities are separated in different clusters. Both cases will be elaborated through using the 22 clusters self-supervised model, as one of the top performing models from the previous Section V-C.
1) Use case 1 -Spectrum availability: To provide a macro analysis for the 22 cluster model, we depict the distribution of samples per cluster in Figure 8 and the average spectrogram corresponding to each cluster in Figure 9 -22 images for the 22 clusters model. Averaging of samples leads to highlighting the position of the most common types of activities that appear in each of the clusters. This visualization reveals which type of spectral activity (transmissions of existing technologies) corresponds to each of the clusters. By relating the number of samples assigned to each cluster with the cluster's content (technologies that are specific for each cluster) we estimate the spectrum occupancy with each type of the discovered spectral activities. From Figure 9, it can be seen that there are two clusters (0 and 2) that are relatively free while the rest contain at least one type of activity.
a) Idle state -No/weak spectral activity: According to Figure 9, clusters with labels 0 and 2 contain no or very low activity (idle). If we consider the counts in Figure 8, it can be seen that these two clusters contain a significant amount of samples, 31333 and 36609 for Clusters 0 and 2 respectively. These two clusters represent the parts of spectrum where no transmissions are recorded, or the sources of the signals are very distant so the signal intensity is very low. In the context of spectrum availability, they could be considered very similar since in both cases there are no existing patterns of transmissions with significant intensity. Taking a closer look into the shapes of the averaged weak signals it can be noticed that there is actually a difference and that's the reason the self-supervised model is distinguishing them. Weak vertical lines appear in both averaged spectrograms, but for Cluster 0 they are almost evenly distributed by width, while for Cluster 2 they are slightly brighter on the right side. This means that the self-supervised model learns also the global features of the spectrograms beside the shapes of the strong activity transmission patterns. Another contributing factor to this effect of creating two clusters for no or weak activity is the different background noise that is apparent in the spectrograms coming from different frequency sub-bands, even in this case when the data comes from relatively narrow band (192 kHz).
b) Occupied -presence of transmissions: All other 20 clusters contain some type of activity, so in the sense of spectrum availability they represent occupied spectrum segments. By considering the counts in Figure 8, it can be seen that these amount to 355970 spectrograms with transmissions. Therefore, Clusters 0 and 2 containing idle spectrograms contain about 17% of sliced data, while the other 20 clusters that contain at least one transmission, represent the remaining 83% of the data. This represents valuable information about the spectrum occupancy that was extracted in automatic way and all the manual work is reduced to inspection of the averaged clusters' spectrograms.
2) Use case 2 -Distinguishing types of activities: The cluster contents and the cluster size allow for better understanding of types of activities and frequency of their appearance across the spectrum. Considering only the low-activity clusters, we acknowledge the percentage of available slots during time. Analog to this, there are also percentages for other types of activities. This means that information is provided about what portion of the spectrum is occupied with certain types of activity or combination of overlapping activities. This is useful when bigger granularity separation of the spectrograms is required. The mentioned manual evaluation of the averaged spectrograms has a crucial role in such case.
The clusters containing different types of combined activities can be considered as separate classes. In this way, besides the information for the spectrum occupancy, additional information can be derived for the frequency of occurrence and types of overlapping transmissions. a) Occupied -Horizontal stripes: According to the averaged samples visualization in Figure 9, there are clusters with mostly horizontal stripes (clusters 3, 5, 6, 8, 10, 11 and 16), which account for 36% of the dataset. These samples represent the same type of activity, namely IEEE 802. 15.4 transmissions, that occupy the entire bandwidth covered with a single segment, and they last for small amount of time with regards to the time duration of a single sample. The location of the stripe is varying in vertical direction because the segmentation of the data in non-overlapping windows is fixed,  as explained previously in Section IV-A, without correlation to the appearance times of any of the activities.
b) Occupied state -Horizontal stripes + other type of activity: Second type of activities is observed when then same horizontal stripes appear combined with other activities (clusters 1, 4, 9 13, 14, 17, 19, 20 and 21), representing 26% of the samples. Together they represent separate cluster of segments where multiple activities are occurring in the same spectrogram segment. Some of the activities we distinguish correspond to concurrent IEEE 802. 15  In summary, based on manual evaluation of the clusters in the second use case, the number of activity types can be reduced to 5. It is important to notice that varying background noise is also influential in the clustering. Clusters 1, 7 and 15 contain significant amount of background noise, having weaker values on one side of the spectrogram and gradually increasing in horizontal direction (frequency axis). As a consequence of this, such spectrograms are clustered separately. This supports the previous conclusion derived from the observation of the weak-activity clusters, that the self-supervised model learns also the global features of the spectrograms beside the shapes of the strong activity transmission patterns.
3) Single cluster inspection: Single cluster analysis refers to simple visualization and inspection of randomly selected samples that are assigned to the specific cluster. This is useful when cluster contains multiple types of coexisting activities or activities that appear at variable positions in time, frequency, or both. The averaged images of such clusters appear to contain blurred shapes that does not correspond to any of the known signal transmission shapes. Showing random samples can reveal the types of transmissions that are existent in the cluster of interest.
Selecting Cluster 13 to better understand the activities that were grouped together, we present it's summary as the average of all instance in Figure 10b and randomly sampled examples in Figure 10a. It can be seen that besides the stripe-shaped activities that are appearing at different locations vertically, there are also dotted activities in some of the samples.

VI. CONCLUSIONS
In the work presented in this paper, we adapted and reused an existing self-supervised architecture for feature learning and clustering of radio frequency spectrum data. The architecture was used on a real-world radio spectrum data acquired by the LOG-a-TEC wireless testbed. Performance of the proposed self-supervised model is compared with a K-means clustering algorithm used as a baseline model in two different configurations. The comparison was made for evaluation of the clustering tendency of the input data with appropriately proposed metrics for the different models and the quality of the clustering results. The proposed self-supervised model enables compact feature extraction, encoded in significantly lower number of dimensions. The dimensionality reduction is of two orders of magnitude, from 3601 to 20, while preserving the same amount of EVR of the input features. The proposed self-supervised model also outperforms the baseline model in the clustering quality by 0.3 according to Silhouette scores, and by 0.6 according to Davies-Bouldin score. The results were supported also by manual evaluation based on visual inspection. We conclude that the self-supervised architecture elaborated and experimented within this paper is a valuable tool for spectrum analysis. The CNN proved as a relevant feature extractor that can learn feature representation of the spectrograms without providing any labels, in automatic manner. The relevance of the extracted features was proved by clustering the different types of activities in the dataset and discovering meaningful groups of activity patterns. Knowledge about spectrum occupancy and activity types can be of a great importance for regulatory bodies but also towards the development of algorithms for spectrum availability prediction with minimal expert intervention.