Modiﬁed Convolutional Neural Networks Architecture for Hyperspectral Image Classiﬁcation (Extra-Convolutional Neural Networks)

Classiﬁcation of Hyperspectral Satellite Images (HSI) is a very important technology for object detection and cartography. Several problems can be detected, which make classiﬁcation difﬁcult (large size of the images, fusion between the classes, small amount of samples, etc.). Recently, several Convolutional Neural Networks (CNN-HSI) have been proposed for the classiﬁcation of hyperspectral images. In this article, an improvement to CNN-HSI is proposed, aiming to reduce the number of erroneous pixels during classiﬁcation (due to the limited number of samples). Thus, an extra-convolution technique (ExCNN) is proposed, where we add layers of global convolutions on the classiﬁed images, outgoing from classical CNN. The addition of 1 to 10 layers, on three real hyperspectral images, is tested. The results obtained are compared with other similar methods of the state of art, and show the effectiveness of the proposed method.


INTRODUCTION
Classification of satellite images is a very interesting technology for remote sensing, which can help in several areas such as cartography, security, ground control, and natural disasters [1][2][3]. Deep learning technologies are the ideal solutions for better classification of different types of images (satellite [4,5], drones, medical [6], facial [7], etc.). The classification of hyperspectral images by convolutional neural networks (CNN) has attracted the attention of researchers, by the perfect results obtained in recent years [8][9][10]. Thus, several factors can change the CNN-HSI results, either by the choice of parameters, the extraction method and the concatenation, or in the applied architecture [11,12].

Classification in one or more dimensions
The approach in [22] uses the band-by-band extraction methodology, that is, classifies it in 2D, then merges the results. Other works use [23,24] a 3D Network: The first network [23] builds convolutional layers in a low resolution space and extracts entities with spatial and spectral dimensions with an extended 3D core and a 3D deconvolution is used for the last layer, it is the enlargement of the desired size. The second network [24] consists of an unsupervised learning strategy for spatial and spectral entities proposed for HSI with a convolutional 3D auto-coder.
Finally, the application of different spectral reduction methods like smart feature extraction (SFE-CNN) [25], where is a probabilist reduction of the different spectral values of each spatial location of the image, to obtain at the end a single spatial plane (2D image), or deep-dual extraction (DCNN) [26], where first a CNN is used (1D) for the extraction of the spectral information, then the reduction of characteristics, and a second CNN (2D) for the extraction of the spatial information, then the classification of characteristics.

CNN architecture
The architecture of YANN in [27], which proposed the first architecture of CNN, with four phases, two convolutional, two fully connected, and ended with the Gaussian filter. Alex's network [28] applied eight phases, five convolutional, and three fully connected. Where each convolutional phase is composed of three repetitive parts which are convolution, ReLU, and Pooling, successively. The architecture of [29], which is mainly composed of four phases, of which the first two convolutional, composed of convolution, ReLU, and Pooling, followed by two fully connected phases FC, and which is adapted specifically for the classification of hyperspectral images. The technical report of [30], which described a MATLAB library for CNN by applying a model consisting of several repetitive convolutional phases and a last fully connected phase. The approach in [31] is to use a triple architecture. CNN is built to extract the spatial and cascading spectral characteristics of double-scale spectral entities, from shallow layers to low layers. Spatial data and multilayer spectra are merged to learn complementary information between shallow layers with detailed information and deep layers with semantic information.

Adaptation of parameters
In order to automate all the parameters of the proposed network, consideration is given to the adaptive choice of the sizes of the convolution filters [32] based on the lot sizes adaptively chosen. The approach [33]consists in setting up a regularized CNN, in order to take charge of optimization and regularization in order to solve the problem of over-adaptation of limited packaging samples. Although many types of literature have been tested, they have been effectively filtered. The HSI is now part of the spectrum in the spatial context, the question of scale is not really exploited.

Using other types of information
The approach in [34] consists of an integrated CNN system with additional hacking functionality using semantic information from the HSI. The approach [35] is to install a multi-scale CNN convolution. However, the usual metric learning has been implemented. This similarity leads to an obvious redundancy of the model and, therefore, to show it on the descriptive capacity of deep metrics.
From the results collected from these different works, we noticed that: the application of data reduction may hinder classification accuracy; most of the old architectures were applied on simple images (not hyperspectral) where the number of bands is limited to 3, therefore it cannot generate good results on HSI images (which have more than 100 spectral bands); some architectures work well but there are still some noises in the pixels (erroneous pixels); the CNN architecture applied to a great influence on the results, so, as each layer has its role, while the layers of pooling allow accelerating the convergence of the results, the layers of convolutions help to bring together the pixels to group them. As the number of convolutional layers increases, the accuracy also increases. In order to improve the classification based on these definitions, we have proposed in this article a novel classification architecture by CNN.
In this article, we present a new ExCNN-HSI architecture, improvement of a previous CNN-HSI architecture [1]. This method is done by adding global convolutional layers at the end of processing (on the classified image). This is due to solve the problem of limited number of samples, and to correct unknown or erroneous pixels. This also allows the detection of anomalies.

PROPOSED APPROACH (ExCNN)
In this section we present the applied work steps. We use the same architecture applied in SFE-CNN [25] and DCNN [26]. Thus, our network is composed of three layers of convolutions and ReLU, two layers of subdivision (MaxPooling), two layers of fully connected, and one layer of softmax. The peculiarity of this work is the addition of several layers of convolution on the whole image, at the end of the treatment chain, after the softmax layer. These additional layers allow the disappearance of erroneous pixels, generated by the limited number of samples ( Figure 1).

Convolution processing
For l layers, the neuron n of the convolution layer l is calculated by Equation (1).
The f l presents the function of activating the layer, where y l n input data, and the V l n presents the list of all groups in the layer l − 1. Finally, l m,n is the convolution kernels of the neuron m in the layer l − 1 highlight a group n in the layer l ; and l n is the bias. Biases are the simplifying hypotheses formulated by a model to facilitate the learning of the target function. The size of the output sub-block y l n is (h l −1 -r l + 1)×(w l −1 -c l + 1) pixels with (r l ×c l ) pixels the size of the convolution filters l m,n , and (h l −1 ×w l −1 ) pixels the size of the sub-block of input information y l −1 m .

Rectified non-linear units
For every sub-block apply Equation (2) :

MaxPooling processing
For each subset of data, calculate the maximum of r × c pixels, which will be multiplied by an adjusTab.weight before being added to a bias term. The result passed through an activation function to produce an output for the subset of data r × c. The output matrix z l −1 n is calculated by Equation (3).
Let z l −1 n be the result of r × c pixels of each subset of data. The neuron n and the subdivision layer l is calculated by Equation (4).
And the size of the output characteristics map is h l × w l , with h l = h l −1 ∕r and w l = w l −1 ∕c.

Full connected
Let l the exit layer and N l the number of output neurons. The output of the neuron n is calculated by (5). The parameter L n is the bias associated with the neuron n of the layer L. ( m, n) L is the weight of the characteristic map m from the last convolutional layer, to the neuron n in the output layer.

Softmax
Let y be a vector of the inputs, and i represents the output units; so let us apply (6) for i = 1, 2, … , n: .

Last convolution layers
Use the same Equation (1), instead of applying the convolution on a lot of pixel and reducing the size each time, here we apply the convolution on the whole image and we keep the same size H × W . We tested our approach on 10 additional convolution layers, that is, we arrive at convolution layer number 13 (counting the 3 layers at the beginning).

Classification
Create a temporary vector containing the Euclidean distance between the values known in the database and the value we wanted to classify. Let x be the active pixel, y c the values brought by the base. The simplified Euclidean Equation (7) calculates the distance between the two points.
Take the smallest values obtained and project as shown in Equation (8) the class C x corresponding to the pixel x.
The execution time of this step depends on the size of the recognition base that we have. In addition, our network does not require a pre-requisite knowledge of the number of class of the image, it discovers them as and when processing. By this function we come to the end of our network. The resulting image is a very well classified 2D image.

RESULTS AND DISCUSSIONS
In order to analyze water pollution in the Gabes Tunisia area, we mainly tested our approaches on hyperspectral imagery detected by the HyperionEO1 sensor. Then we validated our contributions on other public scenes downloaded from the website [36] which are: Salinas in California and Indian Pines in Indiana.

Datasets
The three datasets applied for the tests are: • IndianPines are composed of 220 spectral bands, reduced to 200 by removing the water absorption region. It is captured by AVIRIS , in northwestern Indiana, from wavelengths between 0.4 and 2.5 × 10 −6 m. Their sizes are from 145 × 145 pixels and are composed of 16 Classes, namely: alfalfa. two types of corn, three different grass pastures, oat hay, several types of soybeans, and wood.

Validation and discussion
Before starting, we randomly set the sub-block and filter sizes, which are shown in the Table 1. The architecture of the network is as follows : subset of data For the first tests, we applied a state-of-the-art method. This method showed good results, but several erroneous pixels. Then we started adding additional convolution layers at the end of the chain, applied over the entire image and not just on the block. We mean by N + X (Network) the following architecture: We present the results class by class and at the bottom of Tables 2-4, we present the Cohen's Kappa (K) (the effectiveness of classification with respect to the random assignment of values), overall accuracy (OA) (the number of correctly classified pixels divided by the total number of reference pixels), average precision (AA) (the average of each precision per class (sum of the precision for each class / number of classes)) values for the two datastets, and the time (seconds).
From the experiments, we notice that the proposed method is very efficient for the different number of extra layers. However, in some cases, we notice that the lesser of layers (e.g. 3 extra layers) produces better results than the addition of 10 layers.

VARIATION IN THE NUMBER OF SAMPLES
In this section, we have varied the number of samples of each class, on the proposed method (ExCNN , x = 10) (Tables 5-7).
From the experiments, we notice that the proposed method is effective for cases where the number of samples is very low; and it is very effective for a number of samples greater than 50%.
From the experiments, we notice that the proposed method is very efficient for the different data tests, and is better than the methods of the state of the art, in precision, and sometimes in calculation time.

CONCLUSIONS
In recent years, satellite imagery has become very important. It is used to meet the needs of security, environmental protection, and many object detection applications, due to the large amount of information provided. However, the limited availability of samples may reduce the quality of classification in processing hyperspectral images. In recent years, CNN present an efficient solution for image classification. In this paper, we propose a semi-supervised classification framework by extending the CNN classification structure to predict all bad or unknown pixels. These methods consist of increasing the number of convolutional layers at the end of the network, organizing the classification at 10 levels (each time a layer is added and tested). To validate our proposed designs, we tested them on three different hyperspectral image datasets of SalinasA, Gulf of Gabes, and Indian Pines. Therefore, we noticed the increase in precision with each addition of a layer, which validates our proposed approaches.

ACKNOWLEDGEMENT
This work was supported by the Ministry of Higher Education and Scientific Research of Tunisia.