Studying the Effect of Activation Function on Classification Accuracy Using Deep Artificial Neural Networks

Automation in remote sensing systems is a challenging field of research work due to the need of reducing cost. Artificial neural networks (ANN) approach is coming back to the research focus after it is extended to deep learning (DL). Although it is not a novel these days but some of its behavior stills enigmatic specially in remote sensing classification [1]. DL techniques have been proven useful in landcover classification. So, we need to advance the state of the art by learning not only weights between neurons or the network structure itself but also the activation functions. Current deep learning literature largely focuses on improving architectures and adding regularization to the training process. Remote sensing, particularly satellite imagery, is perhaps the only cost-effective technology able to provide data at a global scale. Within ten years, commercial services are expected to provide submeter resolution images everywhere at a fraction of current costs [2]. A little number of researches discussing the activation function (AF) which transfers the signal from the input units to the next hidden units. Recently due to the invention of DL researchers start to investigate the effect of AF on accuracy e.g., Michael Xie [3], who studied the transfer learning from deep features for remote sensing and poverty mapping. His study did not deal with multispectral images (MS) but only single band to study the light as a sign for richness or poverty areas. This research dealing with low level deep feature classification using MS images. The standard sigmoid reaches an approximation power comparable to or better than classes of more established functions investigated in the approximation theory [4,5]. Jordan presented the logistic function which is a natural representation of the posterior probability in a binary classification problem [5,6]. Özkan and Erbek [7] made similar study but using shallow learning in addition to examine only three AFs (linear, sigmoid and tanh). His study focusing on hard classification not fuzzy one like this research.


Introduction
Automation in remote sensing systems is a challenging field of research work due to the need of reducing cost. Artificial neural networks (ANN) approach is coming back to the research focus after it is extended to deep learning (DL). Although it is not a novel these days but some of its behavior stills enigmatic specially in remote sensing classification [1]. DL techniques have been proven useful in landcover classification. So, we need to advance the state of the art by learning not only weights between neurons or the network structure itself but also the activation functions. Current deep learning literature largely focuses on improving architectures and adding regularization to the training process. Remote sensing, particularly satellite imagery, is perhaps the only cost-effective technology able to provide data at a global scale. Within ten years, commercial services are expected to provide submeter resolution images everywhere at a fraction of current costs [2]. A little number of researches discussing the activation function (AF) which transfers the signal from the input units to the next hidden units. Recently due to the invention of DL researchers start to investigate the effect of AF on accuracy e.g., Michael Xie [3], who studied the transfer learning from deep features for remote sensing and poverty mapping. His study did not deal with multispectral images (MS) but only single band to study the light as a sign for richness or poverty areas. This research dealing with low level deep feature classification using MS images. The standard sigmoid reaches an approximation power comparable to or better than classes of more established functions investigated in the approximation theory [4,5]. Jordan presented the logistic function which is a natural representation of the posterior probability in a binary classification problem [5,6]. Özkan and Erbek [7] made similar study but using shallow learning in addition to examine only three AFs (linear, sigmoid and tanh). His study focusing on hard classification not fuzzy one like this research.
The novelty of this research work is make a sharp decision about selecting AF in remote sensing classification. This research makes a spotlight inside the black box of DANN because of severity in performance and accuracy obtained concerning with classification.

Deep Artificial Neural Network (Dann)
ANN are computational approach that simulate the microstructures of a biological nervous system depending on the signal transference.

Abstract
Artificial Neural Networks (ANN) is widely used in remote sensing classification. Optimizing ANN still an enigmatic field of research especially in remote sensing. This research work is a trial to discover the ANN activation function to be used perfectly in classification (landcover mapping). The first step is preparing the reference map then assume a selected activation function and receive the ANN fuzzified output. The last step is comparing the output with the reference to reach the accuracy assessment. The research result is fixing the activation function that is perfect to be used in remote sensing classification. A real multi-spectral Landsat 7 satellite images were used and was classified (using ANN) and the accuracy of the classification was assessed with different activation functions. The sigmoid function was found to be the best activation function. Basically, all ANNs have a similar topological structure. Some of the neurons interface with the real world to receive its input and other neurons provide the world with the network's output. All the remaining neurons are hidden from view. So, there are three types of neurons; input, hidden and output neurons. One must note that input neurons have single input and multiple output while the hidden neurons have multiple input and multiple output. On contrast to input neurons is the output neurons that have multiple inputs and single outputs. Figure 1 shows a typical multi-layer perceptron (MLP) architecture with shallow learning. In Case of DANN the number of hidden layers is larger so that the weight updates is very slow and gradually moves toward the optimized solution carefully as shown in Figure 2.

Studying the Effect of Activation Function on Classification Accuracy Using Deep Artificial Neural Networks
Using MLP in DANN is widely used epically in supervised classification in the field of remote sensing. Both ANN and DANN must be optimized in architecture and performance; the first is concerned with number of hidden layers and the second concerned with selecting activation function (AF) in addition to achieving the correct weights.
The back propagation neural networks (BPNN) algorithm is a generalized least squares algorithm that adjusts the connection weights between units to minimize the mean square error between the network output and the target output, its architecture is MLP form. The target output is known from training data it is classes' values (Experimental work Results). In the first data entered to input unit are multiplied by the connection weights and is summed to result the nets input to the unit in the hidden layer as shown in Figure 1 and given by: Where: Xi is a pixel vector of the input image of i th input layer; W is is matrix of the connection weights between the i th input layer unit to s th hidden layer unit. Each unit in s th hidden layer calculates a weighted sum of its inputs and passes the sum via an activation function to the units in the j th output layer through weight vector W Sj . There are a range of activation functions to transfer the data from hidden layer unit to an output layer unit. These include linear, tangent hyperbolic, sigmoid functions etc. Although, the use of these functions may lead to difference in accuracy of classification, it can be defined as: where: F is called activation function (will be explained later), O s , is the output from the S th hidden layer unit and λ is a gain parameter, which controls the connection weights between the hidden layer unit and the output layer unit. Outputs from the hidden units are multiplied with the connection weights and are summed to produce the output of jth unit in the output layer as: where: O j is the network output for j th output unit (i.e., the land cover class) and W Sj is the weight of the connection between s th hidden layer unit and j th output layer unit.
An error function (E), determined from a sample of target outputs (reference training data) and network outputs, is minimized iteratively. The process continues until E converges to minimum allowed value and the adjusted weights are obtained. E is given by: where T j , is the target output vector, of is the network output vector and c is the number of classes. The target vector is determined from the known class allocations of the training pixels, which are coded in binary form. The collection of known class allocations of all pixels will form the target vector. After computing the error of the network, it is compared with the limiting error E L of the network. If E<E L , the network training is stopped otherwise E is back propagated to the units in the hidden and the input layers. The number of iterations may vary from one dataset to the other, and is generally determined by trial and error. The process of back propagation and weight adjustment is explained in the following:  First the error vector at each unit of the output layer is computed as: Then the error vector for each unit at the hidden layer is computed as: Thereafter, the net error in connection weights between output layer and hidden layer is computed as, And error in connection weights between hidden layer and input layer as, E is =F m X i E s (8) where: F m is the momentum factor which controls the momentum of the connections between the hidden layer unit and the input layer unit.
The weights between output layer and hidden layer are updated as: (W sj ) new =(W sj ) old +E sj (9) And the weights between input layer and the hidden layer are updated as: The gain parameter λ is also updated as: where: L R is the learning rate which controls the time of the learning process.

Methodology
The research is focused on the optimization of AF for remote sensing image classification so, Landsat 8 MS image is used. The reference of the study area is prepared pixel by pixel in order to achieve best training and testing performance. ENVI 5.3 software is used to achieve the necessary tasks such as: spatial subletting, preparing regions of interest (ROIs), geofencing classification etc. Another self-developed software called advanced digital image processor for remote sensing (ADIPRS) is used to carry out the DANN with different AFs. This software was developed by Serwa [8]. It is modified to cover DL task.

System overview
In this research Landsat 8 MS image is used with its spectral 30 m bands resolution. The reference of the study area is prepared pixel by pixel to achieve best training and testing performance. ENVI 5.3 software is used to achieve the necessary tasks such as: spatial subletting, preparing regions of interest (ROIs), geofencing classification etc. Another selfdeveloped software called advanced digital image processor for remote sensing (ADIPRS) is used to carry out the DANN with different AFs. This software was developed by Serwa [8,9]. It is modified to cover DL task. System overview is indicated in Figure 3 in the form of block diagram. The reason for using ADIPRS in classification is the lack of selection of AF in any other remote sensing software.

Research data
The data used in this research work concerned with a part of great Cairo (Giza governate) in Egypt and it contains: 1-Eight bands of Landsat 8 satellite image. 2-AutoCAD maps with scale 1:5000. The reference map was produced using AutoCAD maps (mainly) and SPOT 5 satellite image (secondary) of the study area in addition to suite visits with Garmin GPS map 62s navigator (3 meters accuracy). Figure 4 indicates the study area on ENVI environment in before spatial and spectral subset. The study area is represented by 500 × 500 pixels (225 km 2 ) for the southern part of great Cairo (Giza governate). Figure 5 shows the reference that is built pixel by pixel to obtain a best training and testing performance.

Tools
Tools include two software; ENVI 5.3 for spatial subset and for all necessary tasks (reference preparing, all corrections) beside ADIPRS for applying DANN and for accuracy assessment. Figure 6 shows a flowchart diagram for DANN algorithm that ADIPRS apply according to the explained equations in section 2. In Figure 7 the interface of the DANN module of ADIPRS software. It is developed basically to achieve the research objective by selecting different AFs and apply DANN algorithm. The optimal architecture is indicated where number of input neurons equals number of image bands while the output neurons equals number of landcover classes. The hidden structure is:4 hidden layers with neurons 5, 8, 7 and 7.
The AF must be selected before run ADIPRS and the network architecture is fixed to study the effect of the variance of AF on the accuracy. After reaching the converging condition the classification report can be printed as shown in Figure 8. User accuracy, producer accuracy, overall accuracy and Kapa value (K-hat statistic) is included in the report. Both, user and producer accuracies for each class is also computed. A series of twelve AFs are studied they are; Linear, step, Piecewise Linear, Sigmoid, Complementary loglog, Bipolar, Bipolar Sigmoid, Tanh, Hard Tanh, Absolute, Rectifier and Smooth Rectifier. Figure 9 indicates the definition of each AF in the form of equation and graph.
The reason for selecting these twelve AFs is that they are the most mentioned in the literature. A heroic effort is made to develop and test each AF especially in testing its effect on classification accuracy. AF is affecting the classification results in remote sensing but rarely researches handle its effect on accuracy using real data. Performance accuracy (signal to noise ratio) was used to assess the accuracy.

Results
Each AF is selected to be used in classification then the accuracy assessment is carried out. Overall accuracy is chosen to express the accuracy gauge. The network architecture is fixed to examine only the effect of selecting AF on classification accuracy. Table 1 shows the end results numerically in the form of overall accuracy and number of iterations. Figure 10 shows the final abstracted results graphically.
Number of iteration is necessary because of its pointing to the cost function even if the cost is out of our research scope. We cannot neglect       the solution conditions such as number of iterations but it can be put as secondary criterion in the case of equality in accuracy. The maximum accuracy is achieved by using both bipolar sigmoid (95%) and sigmoid (94.98%) AFs. A test of hypothesis is carried out and found that it is not significant if we assumed that both results are equally. A statistical correlation test is made and the correlation is -0.6148 and it can be considered not correlated (overall accuracy vs. number of iterations). Sigmoid AF gives best accuracy in best number of iterations while the absolute AF gives the worst accuracy in max number of iterations. Some AFs such as step, piecewise linear and complementary log log gives unexpectable results because they are not familiar in remote sensing. Their accuracies varied in the range of 93% -94%. The rest of the tested AFs can be considered moderate in accuracy and cost but we cannot recommend it. The results show that accuracies of most of the tested classes behaves the same as the overall accuracy which means that the AF affect the general solution not a specific solution. That means the results can be generalized in behaviour.
In fact, the correlation study was not necessary the first dimension (AF) is variable and its order is not fixed but it is made just in case some researcher used to make it. Last issue, the accuracy varied from 68.45% to 95% due to the selection of AF and it is a meaningful change.

Conclusion
Little number of research works focused on optimizing AF for DANN. Remote sensing data behaves in different way as a random signal. AF must be selected carefully because it affects the classification results. One can conclude that AFs affect classification accuracy in remote sensing so that about 27% accuracy enhancement can be achieved by selecting a good AF. The classification time is affected by selecting a good AF. Both sigmoid and bipolar sigmoid AFs are recommended to be used in classification of remote sensing landcover features.