Segmentation model based on convolutional neural networks for extracting vegetation from Gaofen-2 images

Abstract. Convolutional neural network (CNN) models achieve state-of-the-art performance for natural image semantic segmentation. An approach for extracting vegetation from Gaofen-2 (GF-2) remote sensing imagery based on the CNN model is presented. We constructed a convolutional encoder neural networks (CENN) consisting of two layers. The first layer has two sets of convolutional kernels for extracting the features of farmland and woodland, respectively. The second layer consists of two encoders that use nonlinear functions to encode the learned features and map the encoding results to the corresponding category number. In the training stage, samples of farmland, woodland, and other lands are categorically used to train the CENN. After training is accomplished, the CENN would acquire enough ability to accurately extract farmland and woodland from GF-2 imagery. The CENN was trained on 36 GF-2 images and tested on three other GF-2 images. We compared the proposed method to a deep belief network, a fully convolutional network, and a DeepLab model using the same images. The experiments demonstrate that the proposed approach improves upon the accuracy of existing approaches. The average precision, recall, and kappa coefficient of the proposed approach were 0.91, 0.87, and 0.86, respectively. Thus, the proposed approach is proven to effectively extract vegetation from GF-2 imagery.


Introduction
Image segmentation is the precondition and foundation for the extraction and target identification of high-resolution remote sensing images. 1 In a high-resolution image, spectral confusion is more serious, differentiation is substantially reduced, and the accuracy of the spectral statistics-based segmentation method is reduced. 2,3Object-oriented image segmentation method can overcome the influence of "salt and pepper" noise and improve accuracy by using object structure and spectral signature.Because this approach must adjust the segmentation scale to obtain an acceptable image segmentation result and it is difficult to determine a suitable segmentation scale, development of the object-oriented segmentation method has been slow. 4,5ith the development of machine learning technology, researchers began to apply algorithms, such as neural networks (NNs) 6,7 and support vector machines (SVM), 8,9 to the segmentation of high-resolution images. 3,10,11Studies have revealed that image segmentation based on machine learning algorithms can obtain more optimal results compared to traditional statistical and object-oriented methods. 12,13Both SVMs and NNs are shallow learning algorithms, [14][15][16] owing to their limited network structures.Shallow learning algorithms have difficulty expressing complex functions effectively.So, when sample size and diversity are increased, shallow learning models cannot adapt to the increasing complexity. 17,184][25] The CNN is a deep learning network composed of multiple, nonlinear mapping layers with strong learning abilities that obtain excellent results in image segmentation. 26,27Traditional deep learning methods include deep convolutional neural networks (DCNN) 18,28 and deep deconvolutional neural networks (DeCNN). 29Since then, many methods of remote sensing image segmentation based on CNN have been developed. 30,31][34][35][36][37][38][39] A fully convolutional network (FCN) is a deep learning network for image segmentation originally proposed in 2015. 391][42] Researchers have since developed a series of segmentation models based on convolution, including segNET, 43 UNet, 44 DeepLab, 45 multiscale FCN, 46 and reSeg. 47Each of these segmentation models has its own strengths and works well with certain selected types of images.
Segmentation models such as FCN are effective because the multilayer structure of these models adeptly handles the rich detail features of images.However, in regions of vegetation in Gaofen-2 (GF-2) imagery, one pixel usually contains several different types of plants or crops.Thus, the information between pixels does not reflect this variety and the image texture is smoother.Although a single tree is larger than most crop plants, a typically sized tree occupies two or three pixels in the GF-2 imagery and smaller trees may only occupy one pixel or less, continuing the problem of little information difference between pixels.Therefore, because the detail features of a vegetation region may be lacking, the effect of deep layer CNN may also be weak and may even pull in greater noise, resulting in poor segmentation accuracy.To obtain accurate segmentation results from GF-2 images, the size of the coverage area of individual plants relative to the spatial resolution of the GF-2 imagery must be considered when designing a CNN.
Based on the above analysis, we constructed convolutional encode neural networks (CENN) to accurately distinguish and extract farmland and woodland from GF-2 imagery.Because the CENN considers the previously described complexities presented by the features of vegetation regions in the GF-2 imagery, such as the smaller coverage area size of a single plant, fewer detail features, and the continuous appearance of vegetation, our approach achieves improved accuracy compared to the existing approaches.
The following summarizes the proposed method for the segmentation of vegetation in GF-2 images and its evaluation.
(1) A network structure consisting of a convolution layer and an encoder layer has been designed based on the features specific to vegetation regions in the GF-2 imagery.The convolution layer is used to extract the features of farmland and woodland respectively, the encoder layer is used to encode the learned features and map the encoded results to the corresponding category number.(2) In the model training stage, a categorical training method was adopted.To train the CENN and obtain a set of convolutional kernels with which to identify farmland, we use an image sample of farmland as the positive sample and employ other images as the negative sample.Then, to train the CENN to identify woodland, we repeated the previous process, using a sample of forestland as the positive sample and others as the negative sample.After training, the CENN can accurately extract farmland and woodland from GF-2 images.(3) Finally, the trained CENN were used to segment GF-2 images.The accuracy of the segmentation results was then evaluated as per a comparative experiment conducted using three existing segmentation models.

Methods
In accordance with established conventions for image segmentation using CNNs, we divide the work into two stages, training and classification, as depicted in Fig. 1.The upper part of Fig. 1 shows the training stage.Together, GF-2 imagery and corresponding pixel-by-pixel artificial classification labels are input to the CENN as training samples.The error between the predicted classification labels and the artificial classification labels is calculated and backpropagated through the network using the chain rule.Then the parameters of the CENN are updated using the gradient descent method.The above process iterates until the error is less than a predetermined threshold.The lower part of Fig. 1 details the classification stage, in which the trained CENN accurately extracts vegetation from input GF-2 imagery.

Network Architecture
The CENN model is divided into four functional groups of components, input, convolution layer, encoder layer, and output, as shown in Fig. 2. In the training stage, the inputs are original images and artificial classification labels.In the classification stage, the inputs are the original GF-2 images, the output is a single-band file, and the content of each pixel in the output is the category number of the corresponding original image pixel.The CENN indicate farmland using category number 100, woodland is denoted by category number 150, and category number 200 distinguishes other land use.

Convolution layer
In Sec. 1, we analyzed the characteristics of vegetation areas in GF-2 imagery.Based on this analysis, we constructed a convolution structure in the convolution layer known as "width convolution."With this convolution structure, we can extract more features to improve segmentation accuracy.
The convolutional kernels of the CENN are in r Ã c Ã h form, where r denotes the width of the convolutional kernel, c denotes the height of the convolutional kernel, and h denotes the number of channels of the convolutional kernel.In this paper, h is set to 3 because only three channels of GF-2 imagery are employed.As shown in items (b) and (c) of Fig. 2, we Fig. 1 The training and classification stages of the proposed approach.
adopted four types of convolutional kernel in the CENN, which are referred to as A-type, B-type, C-type, and D-type, respectively; r and c of A-type are set to 1, r and c of B-type are set to 3, r and c of C-type are set to 5, r and c of D-type are set to 7.This structure earns the name "width convolution" owing to how all the convolutional kernels use the GF-2 images as direct inputs and process the images parallel to each other.
The primary function of the A-type convolutional kernels is to extract the color features of each pixel.The B-type convolutional kernels are divided into two groups.The first group of convolutional kernels must be trained.They are used to extract the texture features of the central pixels and the surrounding eight pixels.The second group consists of eight convolutional kernels.The values of these convolutional kernels are fixed, and no further training is required.These eight convolutional kernels are used to calculate the absolute value of the color difference between the central pixel and eight adjacent pixels surrounding the central pixel.The roles of the C-type and D-type convolutional kernels are similar to that of the B-type convolutional kernels, and these two types are also composed of two groups.However, the C-type and D-type convolutional kernels have a wider range with which to exploit the features between center pixels and their surrounding pixels.
As shown in Fig. 2, all the convolutional kernels were divided into two groups, a farmland group (b) and a woodland group (c).Both groups contain A-type, B-type, C-type, and D-type convolutional kernels.The purpose of this two-group design is to enable the CENN to better express the characteristics of vegetation, and thus improve their capacity for distinction of different vegetation types.

Encoder layer
As shown in Fig. 2, the encoder layer contains two sublayers of encoders to better simulate the nonlinear relationship between features and outputs.There are two encoders in the first encoder sublayer, the encoder-farmland (d) and the encoder-woodland (e), as observed in Fig. 2. The role of each encoder is to regress and simulate the characteristics of their respective land uses.The second encoder sublayer has a single encoder, the encoder-adjust (f), as depicted in Fig. 2. The role of the encoder-adjust (f) is to adjust the calculations of the upper sublayer so that farmland, woodland, and other lands are distinguished from each other.Then, the flag-function (g) maps the encoded result of encoder-adjust (f) to the appropriate category number.

Label sample
The training dataset comprised a total of 39 GF-2 remote sensing images (size 7300 × 6900 pixels) of Shandong Province, China.Of these images, 21 images were captured on February 17, 2016, and 18 images were captured on May 12, 2017.The spatial resolution of the panchromatic band was 1 m.The spatial resolution of the multispectral was 4 m.Environment for visualizing images (ENVI) software was used for preprocessing tasks, such as fusion and color stretching.We then selected the 321 band as RGB band to improve visual effects.
Artificial label samples are an important training foundation.Because the CENN use pixels as the primary learning object, they must be accurately labeled.We used ENVI software for labeling and designed a preprocessor to build the mask.The process of artificial labeling is as follows: (1) Use the region-of-interest (RoI) tool in the ENVI software to select farmland regions, woodland regions, and other regions in the image.Then, the map locations of the pixels in each region are output to different files according to category.(2) The preprocessor added a band to the image file as the mask band.The spatial resolution, size, and other parameters of the mask band were the same as the original image.Then, the category number of each pixel is written to the mask band according to the map location of the pixel previously output.
We manually labeled all images at the pixel level.Thus, for each image, there exists a 7300 × 6900 label map, with a row-column indexed pixel-class correspondence.We used 36 images for training and the remaining three images for testing.Figure 3 shows an example of one imagelabel pair.

Model training
Images from two different time periods were selected as training data.We select images from different periods to increase the anti-interference abilities of the CENN, mitigating complications, such as the change of seasons, and thus enhancing applicability.The training stage proceeded through the following steps: (1) Image-label pairs were input into the CENN as training samples.
(2) The cross-entropy loss was calculated and backpropagated.
Training obtained farmland group convolutional kernels and woodland group convolutional kernels.Each group enhanced the farmland or woodland features, whereas features of all other types were suppressed as much as possible.For example, in the woodland group, woodland features were enhanced while features of all other types were suppressed.
In our training, the SGD method with momentum was used for parameter updates, and the following expression illustrates the SGD 41,48 method with momentum: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 5 3 7 W ðnþ1Þ ¼ W ðnÞ − ΔW ðnþ1Þ ; ( where W ðnÞ denotes the old parameters, W ðnþ1Þ denotes the new parameters, and ΔW ðnþ1Þ is the increment for the current iteration.The iteration increment, which is a combination of old parameters, gradient, and historical increment, is calculated as shown below: ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 4 6 8 where, JðWÞ is the loss function, ϑ is the learning rate for step length control, d w denotes the weight decay, and m denotes the momentum.

Segment Using the Trained Network
After successful training, the CENN can be used to segment the input imagery pixels-by-pixel.According to our design, the output is written into a new band.The benefit of this design is that in saving the segmentation result to a new band, it avoids damaging the original file.

Experiments
We designed a set of test experiments and comparative experiments to verify the feasibility of the proposed CENN method.The proposed approach was implemented using Python 2.7 on a Linux Ubuntu 16.04 operating system using an NVIDIA GeForce Titan X Graphics device with 12 GB graphic memory.
The data and classification criteria used were described in Sec.2.2.1.

Learning Ability Indicators
The primary functions of the CENN are reflected in their feature extraction and encoding abilities.The concentration degree of feature values was used as an index to examine feature extraction capabilities, using distinctions between farmland, woodland, and other lands as an index to examine encoding capacity.

Comparison Model
We chose the deep belief network (DBN) model, the FCN model, and the DeepLab model as the comparison models.A comparative experiment was conducted using methods established in published literature.

Deep belief network
The paper of Dawei et al. 2 presented a method of pixel-by-pixel classification for high-resolution images using DBN.Their method calculated the texture features of an image through non-subsampled contourlet transform and used the DBN to classify high-resolution remote sensing images based on spectral-texture features.The training process included two subprocesses: pretraining and fine-tuning.Pretraining was performed in an unsupervised manner.A greedy algorithm was used to perform layer-by-level optimization during training, and the parameters of each restricted Boltzmann machine (RBM) were adjusted individually.After training the upper layer, the output is used as the input to train the RBM of next layer.After completing the pretraining, the last-level backpropagation network was trained in a supervised learning manner.The error was propagated backward through the layers and the weight of the entire DBN network was fine-tuned.

Fully convolutional network
For the FCN model, we directly employed the FCN-8s model proposed by Long et al. 40 The architecture of the model was derived from the VGG-16 network.After the upsampling operation, the final prediction was fused from the output of three branches-the primary network, the pool4 layer, and the pool3 layer.In the training phase, the input data and the training parameters for FCN-8s in the comparative experiment were the same as those used to train the proposed model.The testing stage also used the same classification parameters as applied in the proposed approach.

DeepLab
For the DeepLab model, we directly employ the DeepLab v3 model proposed by Chen et al. 45 DeepLab was also developed based on the VGG network.Unlike the FCN model, to ensure that the output size would not be not too small without excessive padding, DeepLab changed the stride of the pool4 and pool5 layers of the VGG network from the original 2 to 1, plus 1 padding.
To compensate for the influence of the stride change on the receptive field, DeepLab used a convolution method called "Atrous convolution" to ensure that the receptive field after pooling remains unchanged and the output is more refined.Finally, DeepLab incorporated a fully connected conditional random fields (CRF) model to refine the segmentation boundary.
Fig. 4 Learning result of A-type convolutional kernels.

Learning ability of the CENN
Figure 4 graphs the distribution of the feature values learned from the farmland samples, woodland samples, and other samples using A-type convolutional kernels.Figure 4 reveals that after the convolution operation, the feature values of farmland and woodland are concentrated in two regions while the other land use type is scattered.This is mainly because the different seasons in which the data were collected results in substantial differences in color values.The features' concentration of woodland is less than that of farmland, mainly because the seasonal color change of woodland is more dramatic than that of farmland.Figure 5 shows the feature values learned from the farmland samples, woodland samples, and other samples using B-type convolutional kernels.As shown in Fig. 5, because the B-type convolutional kernels are primarily used to learn the color difference between adjacent pixels, the features of farmland samples have a better concentration degree, and the dispersion level of woodland sample is substantially larger.The results reflect the smooth texture of farmland and the rough texture of woodland.
Additionally, for farmland samples, the features' concentration of C-type convolutional kernels and D-type convolutional kernels is both less than that of B-type convolutional kernels.This is because as the field of view expands, more other-type pixels are introduced at the boundary.For woodland samples, the C-type convolutional kernels achieve the best concentration of features.Figure 6 shows the learning results of C-type convolutional kernels.
Based on the learning results, we observe that although multispace convolutional kernels are more suitable for the extraction of farmland and forestland features than deep convolution, it remains necessary to combine multiple features to accurately determine the category to which each pixel belongs.
Figure 7 shows the encoding result of the first encoder sublayer, and Fig. 8 shows the encoding result adjusted by the second encoder sublayer.As demonstrated by the figure, the adjusted encoding result could already be used to segment image.

Experiment result comparison
In the comparative experiment, we apply our trained model to three GF-2 images for segmentation.All images were equally sized at 7300 × 6900 pixels.These images were only used for testing and were not involved in training.Figure 9 illustrates the results obtained from the comparison methods and the proposed method.
Table 1 lists the confusion matrix C of our classification results.From the table, we can see that our approach achieves higher classification performance.In the above example, our recall for farmland and woodland is 0.855.The average proportions of "farmland" wrongly classified as "woodland" and "others" are 0.053 and 0.033, respectively.The average proportions of "woodland" wrongly classified as "farmland" and "others" are 0.047 and 0.053, respectively.
We employ precision, recall, and kappa coefficient as indicators to evaluate our approach.These indexes are calculated from confusion matrix C. Precision denotes the average proportion of pixels correctly classified to one class from the total retrieved pixels.Precision is calculated as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 3 7 6 where C ii denotes the number of pixels of category i that are correctly classified, and C ij denotes the number of pixels of category j that misclassified into category i.
Recall represents the average proportion of pixels that are correctly classified in relation to the actual total pixels of a given class.Recall is computed as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 6 ; 6 8 7 The kappa coefficient measures the consistency of the predicted classes with the artificial labels.The indicator values are listed in Table 2.

Discussion
This paper presents a classification approach, which extracts vegetation from GF-2 images using the CENN.Compared with the three typical deep learning-based approaches, the proposed method substantially improves classification accuracy.In the following sections, we discuss the reasons for the improvement and the benefits of using the proposed approach to classify land use.

Deep Belief Network versus the Proposed Approach
In the method of pixel-by-pixel segmentation based on DBN, the texture features of an image are first calculated.The obtained two-dimensional texture features are converted into one-dimensional vectors.Then, three channel values of RGB are added to the vectors, and they are merged into a single vector.Finally, the DBN network is constructed using each component value of the vector as an independent input to classify the pixels.Although texture features are completely different from spectral values, this method uses them in combination, resulting in logical confusion.If only the value of the texture feature is used, it cannot indicate spatial relations represented by texture, which results in information loss in the texture extraction.Therefore, like traditional spectral-based methods, this method utilizes only the spectral characteristics of the pixel itself and effectively ignores the spatial relationship between pixels, which makes it easy to generate the incorrect classifications.
Unlike the DBN method, the CENN model makes full use of the advantages of convolution in information aggregation and uses A-type convolutional kernels to extract the common features of the original spectral value.Three kinds of convolutional kernels: B-type, C-type, and D-type are used to extract textural features in three sizes.The CENN use the two-stage encoder to simulate nonlinear equations and encode the features.These strategies effectively improve the classification accuracy.
In GF-2 images, it is easy to confuse tall crops and small, dense trees owing to the relatively small differences in texture and spectrum.As observed in Fig. 10, the proposed method is substantially more effective than the DBN method.

Fully Convolutional Network versus the Proposed Approach
The advantage of the FCN model is that it maximizes the information available in the rich details of GF-2 imagery using deep convolution.This advantage is obvious when extracting a target object that covers many pixels, but if the target object covers fewer pixels, even pixels that contain several target objects, the effect of deep layer convolution drastically weakens, and even greater noise may be pulled in, resulting in poor segmentation effect.When FCN is used to extract farmland and woodland from GF-2 images, although the farmland or woodland may cover many pixels, the advantages of FCN cannot be exploited because of the small differences between pixels.
Unlike the FCN model, which expands the view through deep convolution, the proposed CENN expand the view by using three convolutional kernels of B-type, C-type, and D-type.
By combining three convolutional kernels, the maximum observed area size is about 49 m 2 , which is fully capable of covering most of the canopy.The CENN not only make full use of the pixels' own features but also fully exploit the spatial features between pixels, adeptly accounting for the continuous appearances of crops and trees.Additionally, the CENN also fully consider the natural characteristics of farmland and woodland, providing further advantages in identifying corner pixels.
Figure 11 demonstrates that the FCN model and the CENN have the same segmentation accuracy when identifying the interior regions of farmland and woodland.However, when identifying the pixels in the corner region, the FCN model had many errors, whereas the CENN has almost no errors.

DeepLab versus the Proposed Approach
Compared to the FCN model, there are two important improvements in the DeepLab model: (1) the deconvolution part is improved and (2) the network uses the fully connected CRF model to refine the segmentation boundary.According to existing literature, when identifying large objects such as buildings, the segmentation accuracy at the boundary of DeepLab is better than that of the FCN.This is because DeepLab makes better use of the details of the image and spatial correlation of pixels.However, when DeepLab was used to identify woodland and farmland, because the information between pixels does not noticeably change and the image texture is smoother, the advantages of DeepLab are lost.
As explained in Sec.4.2, the CENN not only make full use of the pixels' own features but also make fully exploit the spatial features between conjoint pixels and account for the continuous appearance of crops and trees.Therefore, the CENN can effectively avoid the defects of the DeepLab approach and ensure the segmentation accuracy.As shown in Fig. 12, the segmentation accuracy of the CENN at the boundary and corners is much better than that of DeepLab approach.

Benefits of Using the Proposed Approach to Classify Land Use
Accurate land use classification results play an important role in scientific research and agricultural production.The use of remote sensing data to produce land use classification results is becoming more common.In the GF-2 image, it is difficult to accurately distinguish between forest land and farmland using traditional methods.Because our approach can better solve this problem, it has played an important role in agricultural surveys and improved the efficiency of agricultural surveys.Our approach has been applied in the Meteorology Bureau of Shandong Province, China.

Conclusions
This paper presents a CENN model that extracts vegetation in farmlands and woodlands from GF-2 images.Compared to the traditional DBN model, FCN model, and DeepLab model, the proposed CENN fully consider the characteristics of farmland and woodland in the GF-2 images.According to the characteristics of the model, categorical training was implemented to enable the model to effectively discriminate farmland and woodland from other land types and extract vegetation from GF-2 images with high accuracy.The paper also provides a software-based method of using RoI for sample annotation, which can reduce the manual workload and enhance the efficiency of marking.
The main limitation of our approach is that the accuracy of the extraction results is greatly reduced when the CENN is applied to submeter level images, which results in a limited application scope of the model.In the following work, we will try to use multilayer convolutions, to further enhance its applicability.
Shujing Wan is currently a engineer at the College of Network Information Center of Qufu Normal University.Her main research areas are remote sensing and digital image processing.Currently, she is mainly engaged in the research of information construction and remote sensing technology in agriculture and the environment.
Yingjuan Han is a senior engineer at the Ningxia Institute of Meteorology.Her main research area is remote sensing of ecology and agriculture.Her main research area is the application of remote sensing of ecology and agriculture in arid and semiarid areas.Currently, she is carrying out remote sensing extraction of crop planting information and monitoring of ecological environment elements by remote sensing.
Jing Wang is a senior engineer at the Ningxia Meteorological Science Institute.Her main research area is remote sensing in agriculture.Currently, she is mainly engaged in the research of fruit tree classification and planting area extraction by remote sensing technology.
Gang Wang is currently a junior studying at the College of Information Science and Engineering of Shandong Agricultural University.His main research areas are remote sensing and geographic information system.Currently, he is mainly engaged in the research of remote sensing technology in classification and LUCC.

Fig. 9
Fig. 9 Segmentation results on GF-2 images: (a) original images, (b) ground truth, (c) our results corresponding to the images in (a), (d) results of DBN, (e) results of FCN, and (f) results of DeepLab.

Table 1
Confusion matrix C of our approach for Fig.9.

Table 2
Comparison between approaches using DBN, FCN, DeepLab, and CENN.Zhang et al.: Segmentation model based on convolutional neural networks for extracting vegetation. . .