MSMatch: Semi-Supervised Multispectral Scene Classification with Few Labels

Supervised learning techniques are at the center of many tasks in remote sensing. Unfortunately, these methods, especially recent deep learning methods, often require large amounts of labeled data for training. Even though satellites acquire large amounts of data, labeling the data is often tedious, expensive and requires expert knowledge. Hence, improved methods that require fewer labeled samples are needed. We present MSMatch, the first semi-supervised learning approach competitive with supervised methods on scene classification on the EuroSAT and UC Merced Land Use benchmark datasets. We test both RGB and multispectral images of EuroSAT and perform various ablation studies to identify the critical parts of the model. The trained neural network achieves state-of-the-art results on EuroSAT with an accuracy that is up to 19.76% better than previous methods depending on the number of labeled training examples. With just five labeled examples per class, we reach 94.53% and 95.86% accuracy on the EuroSAT RGB and multispectral datasets, respectively. On the UC Merced Land Use dataset, we outperform previous works by up to 5.59% and reach 90.71% with five labeled examples. Our results show that MSMatch is capable of greatly reducing the requirements for labeled data. It translates well to multispectral data and should enable various applications that are currently infeasible due to a lack of labeled data. We provide the source code of MSMatch online to enable easy reproduction and quick adoption.


I. INTRODUCTION
T HE last decade has seen a momentous increase in the availability of remote sensing data, thus enhancing the need for efficient image processing and analysis methods using deep learning [1]. The former is driven by continuously decreasing launch costs, especially for so-called Smallsats (< 500kg). As Wekerle et al. [2] describe, less than 40 Smallsats were launched per year between 2000 and 2012 but over a hundred in 2013 and almost 200 in 2014. Since then the numbers have been increasing with over 300 launches in 2018 and 2017 [3]. Many of these are imaging satellites serving either commercial purposes [4] or related to earth observation programs, such as the European Space Agency's Copernicus program [2], [5]. This has led to an increase in the availability of large datasets. Concurrently, image processing and analysis have improved dramatically with the advent of deep learning methods [6]. As a consequence, there is a large corpus of research describing successful applications of deep learning methods to remote P. Gómez  sensing data [1], [7]- [10]. However, training deep neural networks usually requires large amounts of labeled samples, where the expected solution has been manually annotated by experts [11]. This is in particular tedious for imaging modalities such as radar data or multispectral (MS) imaging data, which is not as easily labeled by humans as, e.g., RGB imaging data. One way to alleviate these issues is the application of socalled semi-supervised learning (SSL) techniques. These aim to train machine learning methods, e.g. neural networks, while providing only a small set of labeled training samples and a typically larger corpus of unlabeled training samples. This has recently garnered a lot of attention in the remote sensing community [12]- [16]. In the last two years, the state-of-theart in SSL has advanced significantly 1 to a point, where the proposed methods are virtually competitive with fully supervised approaches [17]- [19]. The recent advances in SSL approaches bear the promise to save large amounts of time and cost that would be required for manual labeling.
In this work, we propose MSMatch, a novel approach that builds on recent advances [19] together with recent neural network architectures (so-called EfficientNets [20]) to tackle the problem of land scene classification, i.e. correctly identifying land use or land cover of satellite or airborne images. This is an active research problem with a broad range of research focusing on it [21]- [23]. We compare with previous methods on two datasets, the EuroSAT benchmark dataset [24], [25] collected by the Sentinel-2A satellite and the aerial UC Merced Land Use (UCM) dataset [26]. The EuroSAT dataset also includes MS data, which is commonly used for tasks related to vegetation mapping [8].
In summary, the main contributions of this work are: The need for a large amount of labeled training data is one of the most significant bottlenecks in bringing deep learning approaches to practical applications [9]. For satellite imaging data this problem is particularly aggravated as satellites have greatly varying sensors and applications, which makes a transfer between a model trained on data from one application or sensor to another challenging [27]. Hence, SSL is of particular interest in remote sensing. In the following, we describe relevant works that applied SSL to scene classification problems. Further, we point out the works that led to significant improvements in SSL in the broader machine learning community in the last years.
A. Semi-supervised learning for scene classification There are several datasets that have been established as benchmark datasets for scene classification. Some of the most commonly used ones are EuroSAT, UCM and the Aerial Image Dataset (AID) [28]. Both, UCM and AID, use aerial imaging data and provide, respectively, 2100 and 10000 images for a classification of 21 and 30 classes. EuroSAT provides 27000 images of 10 classes. Aside from the SSL works mentioned here, there is also a multitude of studies using supervised methods for these datasets (e.g. [29]- [31]). In terms of SSL approaches, there have been several interesting approaches: Guo et al. [32] trained a generative adversarial network (GAN) that performs particularly well with few labels on UCM and on EuroSAT. They achieved between 57% and 90% accuracy (4.76 to 80 labels per class) on UCM and between 77% to 94% on EuroSAT (10 to 216 labels per class). Han et al. [33] used self-labeling to achieve even better results on UCM reaching 91% to 95% using comparatively more labels. They report similarly good results with comparatively large label counts (10% of the whole data) on AID. Dai et al. [34] used ensemble learning and residual networks with much fewer labels on UCM and on AID reaching 85% on UCM and between 72% and 85% on AID. Similiarly, Gu & Angelov proposed a deep rule-based classifier using just one to ten labels per class and achieving between 57% and 80% on UCM [35]. A self-supervised learning paradigm was suggested by Tao et al. [16] for AID and EuroSAT which obtained between 77% to 81% on AID and 76% to 85% on EuroSAT. Another GANbased approach has been suggested in the work of Roy et al. [15], who applied it to EuroSAT, but other approaches have already accomplished better results, such as the work by Zhang and Yang [14], who utilized the EuroSAT MS data (97% accuracy) -however with 300 labels per class. Yamashkin et al. [36] also suggested an SSL approach where they extended dataset, but their results are not competitive. Thus, reaching high accuracy over 90% on any of the datasets usually still requires a larger amount of labels (80 per class or more). Low label regimes with e.g. five labels per class typically reach only about 70% to 80% accuracy.

B. Recent advances in semi-supervised learning
Semi-supervised learning is a field that has been gaining a lot of attention in recent years. In particular, the last two years have seen several methods being published that led to unprecedented results and, for the first time, achieved results that are competitive with supervised methods trained on significantly more data. For example, the accuracy on the popular CIFAR-10 dataset [37] for training with just 250 labels has improved from 47% to 95% 2 from 2016 to 2020. Many of these improvements rely on smart data augmentation strategies such as RandAugment [38] or AutoAugment [39]. The most significant improvements were made in 2019 in two works describing the so-called MixMatch [17] method and Unsupervised Data Augmentation [40]. The former improved the state-of-the-art by over 20% and the latter pushed the accuracy above 90%. In a series of follow-up works [18], [41] results were further improved until the current state-ofart method was introduced in 2020. Utilizing the ideas of pseudo-labeling and consistency regularization by Bachman et al. [42], FixMatch [19] achieved state-of-the-art result on four benchmark datasets including almost 95% accuracy on CIFAR-10 with 250 labels. This is comparable to the performance of a supervised approach for the utilized network architecture. Furthermore, using just four labels per class, they still achieved 89% accuracy. To the authors' knowledge, none of the mentioned works have found their way into the remote sensing community yet. Hence, this work aims to build on these recent advances to achieve state-of-art results.

III. METHODS
This section will introduce the network architecture, SSL method and setup of the training.

A. EfficientNet
EfficientNets [20] have become the go-to neural network architecture for many applications. They achieve state-ofart results, especially in terms of efficiency as EfficientNets use comparatively few parameters in relation to achieved performance. The architecture was conceived using neural architecture search [43], a method where the neural network architecture itself is optimized. Tan and Le propose several versions of EfficientNets called B0 to B7 with increasing numbers of parameters and performance. Thereby, it is possible to choose a suitable trade-off that keeps memory and computational requirements manageable at a sufficient model complexity. Note, however, that EfficientNets have not seen broad adoption in the remote sensing community, yet. None of the mentioned prior SSL works utilized them. We utilize them for MSMatch given their excellent performance and low memory footprint. We relied on an open-source implementation utilizing PyTorch. 3 For MS images the first network layer received all available bands as normalized input channels individually. Thus, the number of network parameters is almost identical for MS and RGB images and performance differences ought to be largely due to contained information.

B. FixMatch
There are two central ideas behind the effectiveness of the FixMatch approach, pseudo-labeling and consistency regularization [19], [42]. Pseudo-labeling refers to practice of using the model (or another model) to automatically label otherwise unlabeled data. The second idea, consistency regularization, refers to the concept that the model should predict the same output for similar inputs. FixMatch training consists of a supervised and unsupervised loss. While the supervised loss is a common cross-entropy loss, the unsupervised loss incorporates both pseudo-labeling and consistency regularization. This is achieved by creating two different augmentations of the same image, a so-called weak and a strong one. As depicted in Figure 1, the weakly augmented image is used to create a pseudo-label for the image. Consistency regularization is then employed by computing a cross-entropy loss between a pseudo-label on the weakly augmented image and the model's classification of the strongly augmented image. Thus, the supervised loss is where H is a cross-entropy loss, y i the ground-truth label and y i the network's prediction on the weakly augmented Image. The unsupervised loss is whereỹ i is the pseudo-label on an unlabeled, weakly augmented image and δ i is 1 ifỹ i ≥ 0.95 and 0 otherwise. The total loss from the supervised loss L s and unsupervised loss L u is then obtained as L = L s + L u . For the implementation of FixMatch we adapted an opensource PyTorch implementation 4 . To the authors' knowledge this is also the first work applying FixMatch to MS images.

C. Augmentation
Data augmentation is frequently used to help neural networks generalize better to unseen data and to increase the richness of the utilized training data [38], [39]. During the FixMatch training process the augmentation is critical to ensure that the little amount of available labeled data is exploited optimally using the weak augmentations, and to aid generalization to unseen data using the strong augmentations.
For the weak augmentation, we only utilized horizontal flips and image translations by up to 12.5%. Note, that Kurakin et al. [19] describe in their work that, e.g., they tried to harness stronger augmentations for the labeled data but experienced training divergence. We encountered similar issues utilizing, e.g., image crops. For the strong augmentation of the RGB images, several methods from the Python library Pillow were applied. For the strong augmentation of the MS data, a slightly reduced set (due to a lack of implementations for more than three image channels) were applied utilizing the albumentations Python module [44]. Exemplary weak and strong augmentations of RGB EuroSAT images are depicted in Figure 2. A full overview of the applied augmentations can be seen in Table I.

D. Training
All models were trained for three different random seeds on NVIDIA RTX 2080 TI graphics cards using PyTorch 1.7. The training utilized a stochastic gradient descent optimizer with a Nesterov momentum of 0.9 and different weight decay amounts. A learning rate of 0.03 was used and reduced with cosine annealing. Training batch size was 32 for EuroSAT datasets and 16 for UCM, with one batch containing that many images and additionally four and seven times as many unlabeled ones for UCM and EuroSAT, respectively. The training was run for a total of 500 and 1000 epochs with 1000 iterations each for EuroSAT and UCM, respectively, after which all investigated models had converged. For UCM, the number of epochs was doubled to compensate the smaller batch size. All images were normalized to the mean and standard deviation of the datasets. The train and test sets were stratified. The test sets for each seed contained 10% of the data for EuroSAT (2700 images) and 20% of the data for UCM (420 images). To speed up the training and to reduce the memory footprint, UCM images were downscaled to 224 × 224. For a supervised baseline the unsupervised loss L u was set to 0 to allow a fair and direct comparison. Overall, with this setup, training of one model requires up to 48 hours on a single GPU for EuroSAT. A single run for UCM requires 131 hours on two GPUs. We provide the code for this work open source online 5 .

IV. RESULTS
We investigate results on two datasets. We report results for both, the RGB and MS, versions of EuroSAT as well as the UCM dataset. Aside from a detailed comparison with previous research depicted in Table II and Table III, we also investigated the impact of the weight decay strength in MSMatch as well as the number of parameters of the utilized EfficientNet.

A. Datasets
Some of the most commonly utilized benchmarks for SSL in remote sensing are UCM and EuroSAT. They allow for a detailed comparison to previous works.
1) EuroSAT: Given the comparatively large images, the computational burden of the heavy image augmentation and training the model is larger for datasets like UCM and AID. Further, they provide fewer and only RGB images. Hence, we intensively tested on the EuroSAT dataset, which consists of 27 000 64 × 64 pixel images in RGB and 13-band MS format. The MS bands are between 443 nm and 2190 nm, the spatial resolution is up to 10 meters per pixel depending on the band. Note that especially the infrared bands are well-suited to vegetation identification. The data stem from the Sentinel-2A satellite and are split into ten classes, such as river, forest, permanent crop and similar ones. Some examples of images from the EuroSAT dataset can be seen in Figure 2. The data were obtained from the authors' GitHub respository. 6 2) UCM: The UCM dataset is arguably the most established land scene classification dataset. It consists of 2100 images of areas in the USA classified into 21 classes, such as beach, forest or storage tanks. The original images were taken using aerieal orthoimagery and processed into slices of 256 × 256 pixels. Each class is represented with a 100 images. We display several example images from UCM with the associated labels and predicted classes in Figure 3. The data were obtained from the authors' website. 7

B. Number of Labels
The main factor for comparing SSL approaches is the number of labeled training examples used for training. For EuroSAT, we tested a large range of different amounts ranging from 50 (five per class) to 3000 (300 per class) labels to ensure comparability with prior research. For UCM, training datasets with 105, 210, 441 and 1680 labels (respectively 5, 10, 21 and 80 labels per class) were investigated. The rest of the (unlabeled) data are used for the unsupervised part of the training. As seen in Table II and Table III, MSMatch outperforms all prior research by large margins for all tested amounts of available labels. The greatly enhanced accuracy per labeled training sample is especially prominent for the cases with just 50 and 100 labels (five and ten per class). In these cases MSMatch improves on previous methods between 16% and 20% on EuroSAT and 2.71% and 5.59% on UCM. Using 1000 (100 per class) labels, which was the most popular amount in prior research on EuroSAT, it improves the stateof-the-art by 7%. For EuroSAT, the last three rows in Table II showcase the impact of using MS data and difference if, instead of using our SSL approach, we train an EfficientNet using only the labeled samples and no unlabeled samples. Note, that results on supervised baseline, i.e. training without any unlabeled data, are clearly worse. This demonstrates the effect of the proposed SSL framework. Even for 3000 labels, the SSL method performs significantly better than the baseline. Notably, the MS data improves results even further. The proposed method is hence successfully adapted to MS data. Due to the high computational demands of training a model UCM, this ablation was not performed on UCM. Additionally, Figure 4 and Figure  5 show F1 scores for all classes and amount of training labels. Notably, some classes, such as PermanentCrop for EuroSAT or denseresidential for UCM, seem to require more samples to reach optimal results. Fig. 4. Classification F1 scores for models trained with a different amount of labeled samples. RGB data was used. Notably, some classes, such as PermanentCrop, seem to require more samples than others, such as SeaLake, to reach good scores. Results are averaged over three seeds.

C. Hyperparameters
Two further factors were investigated in detail. One factor that was found to be particularly critical by Kurakin   weights in the neural network. They found 5.0 · 10 −4 to be optimal in most cases. The other one is model size, which is a decisive factor for model performance and can be varied using the different versions (B0 to B7) of the EfficientNet architecture. Note that only B0 to B3 fit into the available GPU memory with the utilized training settings and, thus, no larger models were compared. These runs were only done on EuroSAT due to the large memory requirements of the larger images in UCM. Further, note that results in Tables II and V were run on different machines, which led to slightly different random seeds and mean values.
Results for different weight decay values are displayed in Table IV. In our experiments we found that slightly larger weight decay values were beneficial than originally proposed by Kurakin et al. [19]. The highest accuracy was obtained with a weight decay of 7.5 · 10 −4 , which led to an accuracy of 96.63%.
Detailed results for the model size are given in Table V.  We found the EfficientNet-B2 with 9.2 million parameters to provide the best trade-off of performance and model size with an accuracy of 96.85%. However, performance gains from a larger number of parameters are limited and smaller than standard deviation among random seeds. Thus, choosing a smaller model when optimizing for efficiency can also be reasonable.

V. DISCUSSION
Overall, MSMatch outperforms previously published SSL works tested on EuroSAT and UCM. Even with a low number of labels (five per class), it is able to achieve a high accuracy at 94.53% or 95.86% for EuroSAT RGB and MS data and 90.71% for UCM, respectively. This makes the approach applicable in scenarios where only very limited labeled data are available. The superior performance using MS data both highlights the potential of utilizing such data as well as the suitability of the proposed method for it.

A. Comparison to FixMatch
Compared to the original FixMatch approach, we find that slightly higher weight decay values benefit the train-   In terms of the comparison of model parameters in the network architecture (see Table V) it is noteworthy that improvements from a larger number of parameters are only marginal. Possibly, the reason is that the EuroSAT dataset features just ten classes. The larger models may be overparameterized for the problem. However, it is also conceivable that the proposed training procedure performs better on smaller models. This will require further investigation in the future. Another element that may warrant more detailed examination in the future is the type of augmentations utilized for the strong augmentation. Kurakin et al. [19] described some performance impact of the utilized augmentation method. Due to the

B. Class-dependent Performance
Another interesting factor is the varying performance depending on the classes. This tendency is observed in both, UCM and EuroSAT and persists across multiple seeds, i.e. random splits of the train and test sets. A detailed overview of the per class performance for EuroSAT can be seen in Table VI. Clearly, some classes require more labeled training data than others, which hints at a possible improvement to the described method. In particular, the number of labeled samples could be adjusted for each class in relation to the model's performance on it. For example, after observing the worse performance on the classes PermanentCrop and HerbaceousVegation in EuroSAT when training with 50 labels (five per class) in an operational scenario, this insight could be used to selectively label more data from such underperforming classes. Note that Figure 7 highlights that the models tend to confuse these two classes with each other, respectively, and especially the recall of the model on these classes is impacted. In practice, the lower performance for these classes might also hint at an underrepresentation of some necessary features in the supplied training data. This is also evident in Figure 4 as the problem is remedied when additional labeled examples for these classes are added. Kurakin et al. [19] also observed a strong impacted of the selected samples on the performance. In Figure 6 we display some saliency maps (using guided backpropagation [45], [46]), where a model trained on 50 labels (five per class) misclassified the examples whereas a 3000-label model succeeded. Note that the 3000-label model clearly relies on specific contour features in the image that the 50-label model does not recognize.
For UCM, the tendency of underperforming in some of the classes was also observed when using 105 labeled images (five images per class). Indeed, as shown in Table VII, the F1score value obtained by averaging an EfficientNet-B2 model over three seeds is higher than 80% for all the classes except for mediumresidential (74.89%), denseresidential (44.52%), and buildings (79,43%). In particular, the confusion matrix in Figure 8 shows that misclassified denseresidential images are often misidentifed as mediumresidential (28%) and mo-bilehomepark (28%), and as buildings (10%). This may also indicate that these classes do not feature sufficient distinctive properties for the network to pick up. Similarly, 7% and 13% of buildings images are predicted as denseresidential and storagetanks, respectively. This leads to comparatively lower precision on, e.g., storagetanks and mobilehomepark and low recall on denseresidential and buildings. Overall, these results highlight that by using five images per class the trained model is not fully capable to distinguish among mediumresidential, denseresidential, and buildings images, where as it can attain robust performance in predicting the other classes. As shown in in Figure 5 a F1-score higher than 80% is reached for the buildings, mediumresidential, and denseresidential classes when training with, respectively, 5, 10, and 21 images per class.

VI. CONCLUSION
This works presents MSMatch, a novel SSL approach that is able to vastly improve the state-of-the-art on the EuroSAT and UCM datasets compared to previous works. Depending on the number of labeled training samples, it improves accuracy by between 1.47% and 18.43% on EuroSAT and 2.71% and 7.85% on UCM compared to previous works. More importantly, it showcases that an accuracy of 95.86% and 90.71% on EuroSAT and UCM, respectively, is obtainable with just five labels per class. This bears the promise to make MSMatch applicable to scenarios where a lack of labeled data previously inhibited training neural networks for the task. The method also translates well to MS data, which is, however, harder to process given a lack of GPU-based data augmentation frameworks. Future research will aim to test datasets with even higher resolutions, such as AID, which are computationally more demanding, especially in terms of GPU memory. Adapting MSMatch to a segmentation problem is also conceivable given suitable augmentation methods and might be of interest to broaden the range of possible applications even further.
Gabriele Meoni , PhD, received the Laurea degree in electronic engineering from the University of Pisa in 2016 and the Ph.D. degree in information engineering in 2020. During his Ph.D., he developed skills in digital and embedded systems design, digital signal processing, and artificial intelligence. Since 2020 he is a research fellow in the ESA Advanced Concepts Team. His research interests include machine learning, embedded systems and edge computing.