Semi-supervised methods for CNN based classiﬁcation of multispectral imagery

Deep Convolutional neuronal networks, with their recent increase in performance, have become one of the standard techniques for RGB image classiﬁcation. Due to a lack of large labeled datasets, this is not the case for multispectral image classiﬁcation. To overcome this, we analyze the use of semi-supervised learning for the case of multispectral datasets. We use parameter reduction strategies to create small and efﬁcient multispectral CNNs and combine these computationally efﬁcient classiﬁers with semi-supervised learning methods. We choose the state-of-the-art semi-supervised methods MixMatch, ReMix-Match, FixMatch, and FlexMatch, to conduct experiments on the multispectral dataset EuroSAT. Additionally, we challenge this semi-supervised multispectral approach with a decreasing number of labeled images. We found that with only 15 labeled images per class, we can reach an accuracy above 80 %. If more labeled images are provided, the analyzed semi-supervised methods can even surpass basic supervised learning strategies.


Introduction
The use of deep convolutional neural networks for RGB image classification has led to a series of breakthroughs [1][2][3][4].Extending convolutional neural networks to process multispectral imagery is becoming increasingly prevalent, especially in the field of characterization of materials, quality insurance in the food industry, or recycling of waste materials [5].In these fields, it is common to use multispectral (MS) data to separate materials based on their different spectral characteristics.While AI systems like CNNs show superior performance on large RGB datasets [1,3,4], the lack of large labeled multispectral datasets makes them difficult to employ in a multispectral setting.Compared to RGB images where there exist large publicly available datasets such as CIFAR-10 [6], and ImageNet [7], large labeled multispectral datasets are rare.In this work, we aim to improve the performance of CNNs on small unlabeled multispectral datasets by combining semi-supervised learning (SSL) methods with CNNs optimized for multispectral data (multispectral CNNs).
Semi-supervised learning provides a powerful tool to leverage unlabeled data and too largely alleviate the need for labeled data.This is particularly advantageous when collecting labeled data is expensive or time-consuming because expert knowledge or expensive machinery may be involved in the labeling process.This approach has shown impressive results in a wide variety of tasks, including facial expression recognition and natural language processing [8,9].
To the best of our knowledge, the combination of SSL methods and multispectral CNNs is not discussed in previous work.We present a study on recently proposed state-of-the-art SSL methods in the context of classifying multispectral images.In this work, we show that modern SSL methods can be very effectively used to reduce the need for labeled data drastically.We also aim to make SLL methods more comprehensible for researchers outside the deep learning community.Therefore, in detail, we describe the methods used in the following section and then show results based on the EuroSAT dataset [10].

Semi-Supervised Methods
In image classification, semi-supervised learning (SSL) has proven to be a powerful paradigm for utilizing unlabeled data to mitigate the reliance on large labeled datasets.Compared with the results of previous SSL algorithms (π-Model [11], Mean teacher [12], Virtual Adversarial Training [13] and Pseudo-Label [14]), the four state-of-the-art SSL algorithms: MixMatch [15], ReMixMatch [16], FixMatch [17], and FlexMatch [18], all unify the current hybrid approaches for SSL.In this section, we bring an overview of these four algorithms.
1. MixMatch: Unlike previous methods [11,14], MixMatch introduces a single loss term unifying all three main semi-supervised approaches: entropy minimization [14,19], consistency regularization [11,20] and generic regularization [21,22].MixMatch utilizes a form of consistency regularization by using data augmentation for images.Two data augmentation methods are used subsequentially on both labeled and unlabeled images: first random horizontal flip and then random crop.Like Pseudo-Label [14], MixMatch applies multiple individual augmentations on an unlabeled image to create different instances, whose model predictions are then averaged to generate one pseudo-label for this unlabeled image.MixMatch uses a slightly changed version of the MixUp algorithm for regularization.Both labeled and unlabeled images and their corresponding labels are interpolated to generate mixed inputs and mixed labels.
2. ReMixMatch: To make MixMatch more data-efficient, two new techniques are introduced and directly integrated into MixMatch's framework: distribution alignment and augmentation anchoring.Distribution alignment maximizes the mutual information between model inputs and outputs so that unlabeled data is fully utilized to improve the model's performance.Distribution alignment encourages the marginal distribution of the model's predictions on unlabeled data to match the marginal distribution of the ground-truth labels.Recent work found that applying stronger forms of data augmentation can significantly improve the performance of consistency regularization [23].Augmentation anchoring is added as a replacement for the consistency regularization in MixMatch.The basic idea is to use the model's prediction for a weakly augmented unlabeled image as the pseudo-label for many strongly augmented versions of the same image.
3. FixMatch: FixMatch is a significant simplification compared with MixMatch and ReMixMatch.Its simplification lies in combining only two main approaches to semi-supervised learning: consistency regularization and Pseudo-Label [14].FixMatch first generates pseudo-labels on weakly augmented unlabeled images using their model predictions.For a given image, the pseudo-label is only retained if the model produces a high-confidence prediction.In other words, when the model assigns a probability to any class above the predefined threshold τ, the prediction is accepted, and the model output is then converted to a one-hot pseudo label.Then, the model's prediction for a strongly augmented version of the same image is used to train the model against this pseudo-label.
4. FlexMatch: FixMatch uses a predefined constant threshold τ for all classes to select unlabeled data that contribute to the training, thus failing to consider different learning statuses and learning difficulties of different classes.To address this issue, Curriculum Pseudo Labeling (CPL) is introduced to utilize unlabeled data according to the model's learning status.The core of CPL is to adjust thresholds for different classes at each time step to feed the model with the fitting unlabeled data for the current learning status.

Results
In this section, we discuss our three main results.First, we present our classifier with a reduced number of parameters optimized for MS data and show the classification results on RGB and MS datasets, using supervised learning (SL).Secondly, we present the classification results using our classifier in combination with the above discussed SSL methods.Lastly, we show how the combination of MS data and SSL methods performs on datasets with a drastically decreased number of labeled images.
We use the datasets CIFAR-10 [24] and EuroSAT [10].While CIFAR-10 is only used as a benchmarking dataset, EuroSAT is our main dataset for learning and testing the discussed strategies and methods.With 27,000 patches, EuroSAT is currently the largest labeled multispectral dataset for image patch classification.Additionally, it also contains the RGB bands, making it a perfect candidate for comparing RGB and MS learning strategies.Each multispectral image in the EuroSAT dataset consists of 13 channels, but only ten are relevant for identifying and monitoring land use classes and are used in our experiments.For the following experiments, we randomly sample 20 % and 10 % of labeled data from this dataset as validation and test sets respectively, while the remaining 18,900 labeled images are used as training data in either semi-supervised or fully supervised learning.We make sure that there is no overlap between these datasets.

Parameter Reduction
The success of deep neuronal networks like ResNet [25], or Wide ResNet [26], with their thousands of layers and millions of parameters, also lies in the availability of enormous datasets like CIFAR-10.In the case of multispectral imagery, where such datasets are lacking, very deep networks would easily overfit due to the extreme number of model parameters.Additionally, applying semi-supervised algorithms with deep CNNs as backbone classifiers can consume significant computational resources, making it a very costly and time-consuming combination of methods.To tackle this problem, we develop our own classifier optimized for the case of semi-supervised learning for multispectral imagery.This classifier is based on the Wide ResNet architecture and adopts parameter-reducing strategies presented in recent work on small and efficient CNNs, such as SqueezeNet [27] and MobileNet [28].
For further modification and evaluation, we choose the following Wide ResNet structures with fewer parameters while maintaining competitive accuracy according to the results in [26]: WRN-40-04, WRN-16-08, WRN-22-08 and WRN-28-10, where the first number depicts the depth and the second the widening factor k.
The structure of each residual block in the Wide ResNet consists of two 3x3 convolutional layers and hence is named B (3,3), where B indicates the building block and (3, 3) the list of two kernel sizes of the convolutional layers.To decrease the number of parameters further, we additionally apply the microstructure from SqueezeNet [27] in every building block.Specifically, we replace all the 3x3 convolutional layers in each B(3, 3) building block with Fire Modules from SqueezeNet.In Figure 1 a sketch of the Fire module is depicted, and a detailed description of all variables used in the following is given in the caption.In each Fire Module, we set s 1x1 equals to 0.125 • C In , e 1x1 equals to 0.75 • C Out and e 3x3 equals to 0.25 • C Out .The number of input and output channels of each 3x3 convolutional layer in the B(3, 3) block will be kept the same after replacement.The macro network structure of the original Wide ResNet will also be preserved.Hence, we call our network Wide ResNet with Fire Modules (WRN+FMs).It closely mimics the macro-architectural design of the Wide ResNet architecture while adapting the micro-architectural elements from the SqueezeNet to reduce network parameters.We evaluate the new set of classifiers on two datasets, the RGB dataset CIFAR-10, and the multispectral dataset EuroSAT.In this section, we only use fully supervised learning to be able to compare our results with other SL benchmarks.For data augmentation, we do not use heavy data augmentation as proposed in semi-supervised learning algorithms and use only horizontal flips and random crops for images.Supervised training of Wide ResNet-28-10 (without FM) consumes too much training time and computing resources; therefore, we show results from literature [26,29].Our experimental results are shown in Table 1.
It can be concluded from Table1 that applying Fire Modules into the Wide ResNet structure brings benefits and also some expected downsides.With this parameter reduction strategy, the total number of network parameters can be significantly reduced, up to about 90% of the original network size.As a result, our WRN-28-10+FMs consists of only 2.42 million parameters and is 15 times smaller than the original WRN-28-10.Nevertheless, it achieves a classification accuracy of 96.19% on the EuroSAT MS dataset, only 0.41% less than the benchmark network SpectrumNet.From the results on EuroSAT in Table2, we find that WRN-28-10+FMs can achieve the best validation accuracy among our four new networks.

Semi-supervised Methods on MS data
We conduct experiments for the four selected SSL methods on the Eu-roSAT dataset using our classifier WRN-28-10+FMs and exhibit the re-  2. For semi-supervised learning, the number of labels for RGB and MS imagery is limited to 165 per class, i.e., the total number of labeled images for training is 1,650.This represents 6% of the entire dataset.The number of unlabeled images is set to 4,000 for both RGB and MS datasets to create a more realistic setting, as collecting highdimensional MS images is more expensive and time-consuming.For comparison against supervised learning, we also conduct experiments using four different numbers of labeled images: (i) 5,650 to mimic the semi-supervised setting with the same number of samples: 4,000 unlabeled and 1,650 labeled images; (ii) 1,650 labeled images to simulate the same number of labeled images; (iii) 850 images and; (iv) 18,900 images to test the (unfair) lower and upper limit of supervised learning.Table 2 show that all four SSL methods can still help our network achieve comparative classification accuracy, even though only limited labeled data is used.As expected, the supervised approach with the full amount of labeled images performs the best, with 96.56%.However, if the total number of labels is reduced to 5,650, the supervised method is outperformed by the semi-supervised method ReMixMatch by 0.69%, although only 165 labeled images are used per class.One reason for this advantage of ReMixMatch lies in the utilization of strong data augmentation applied on both labeled and unlabeled images, which improves the performance of consistency regularization and helps the network achieve better robustness to noisy data.In general, MS images are expected to result in greater classification accuracy than RGB images in theory, given the additional information that is present in the spectral bands and increases the separation between classes.Except for MixMatch, all methods meet our expectations and perform better under MS conditions by 1.37% on average.

Limited number of labeled images
In this section, we drastically decrease the number of labeled images to test the limit of the discussed semi-supervised methods.The number of labeled MS images is decreased to 15, 30, 85 images per class, which represents only 0.5 %, 1% and, 3% of the entire dataset, while keeping the total number of unlabeled images the same with 4,000.This procedure is similar to other benchmarks in the literature [15][16][17][18].
The results from Figure 2 show that the classification performance of the network becomes better with an increasing number of labeled samples used in training.Among all the SSL methods, ReMixMatch consistently outperforms the other methods.FlexMatch follows ReMix-Match and proves to be the second best.The reason for this trend can be concluded as following: on the one hand, distribution alignment in ReMixMatch not only minimizes the entropy of pseudo labels for unlabeled data like all the other SSL methods do but also maximizes the mutual information between model inputs and outputs to incorporate unlabeled data for better model performance.On the other hand, a rotation loss [30] is directly included in the ReMixMatch loss term.Comparing SSL and SL for the case of 85 images per class drastically shows the power of semi-supervised learning.The SL approach with 850 images can only reach a classification accuracy of 68.65%, while the best SSL method reaches 95.07%.

Conclusions and Outlook
By adjusting the macro size of the Wide ResNet architecture and changing the micro-structure according to the SqueezeNet architecture, we obtain a small and efficient network with up to 15 times fewer parameters.We show that this network can compete with other popular networks on RGB datasets and can also be effectively trained on much smaller multispectral datasets.Based on the increased computational speed, it can be combined with modern SSL methods for RGB and multispectral datasets.To the best of our knowledge, the combination of SSL methods compressed CNNs, and multispectral datasets, have not been discussed in previous work.This work proves that using 85 images per class, state-of-the-Art SSL methods reach similar or even higher accuracies than supervised learning, depending on the augmentation strategies of the supervised approach.By decreasing the number of labeled images to 15 per class, the power of semi-supervised learning becomes even more prevalent, with 84.78% compared to SL 78.33% (1,650 images).Our results show that the newest SSL method in our comparison ReMixMatch outperforms the other methods not only for RGB but also for multispectral data.These results show that SSL can be applied to MS data, and expensive labeling can be reduced dramatically.However, more research is needed to improve the number of augmentation strategies for multispectral data.Data augmentation plays a vital role in semi-supervised learning.There are still only a few specialized data augmentations available for multispectral channels compared with RGB channels.In future work, we are interested in investigating data augmentation methods for multispectral imagery according to the characteristics of different channels.We expect that the shown methods can increase the total number of available labeled datasets, which would benefit the whole research community in the field of image classification.

Figure 1 :
Figure 1: Fire Module structure as replacement for 3x3 convolutional layer.C In , C Out : Number of input or output channels of the network block.s 1x1 : Number of output channels of the Squeeze-Layer.e 1x1 , e 3x3 : Number of output channels of the 1x1 or 3x3 convolutional layer in the Expand-Layer, where e 1x1 + e 3x3 = C Out .

Figure 2 :
Figure 2: Results for the four SSL methods with a limited number of labeled images.For the SSL methods, 4,000 unlabeled images are available in addition to the depicted number of labeled images.For supervised learning, a gray solid/dashed line is shown for the case of the same number of samples (5,650 images) and the same number of labeled images (1,650 images), respectively.

Table 1 :
Evaluation of different versions of Wide ResNet with and without Fire Modules on different datasets using fully supervised learning.The marked results are extracted from literature.
*sults in Table

Table 2 :
Results of different semi-supervised learning methods on EuroSAT RGB and MS dataset using our WRN-28-10+FMs as classifier.Supervised learning with 850 and 18,900 images are not comparable with the SSL methods, they show the upper and lower limit of the methods for benchmarking purpose.