Road extraction from high resolution remote sensing image via a deep residual and pyramid pooling network

The road extraction from high resolution remote sensing image is of great importance in a variety of applications. Recently, the abundant deep convolutional neural networks are proposed for road extraction task. However, the existing approaches lack suitable strategy to utilize multiple views road features for road extraction, which fails to extract road with smooth appearance and accurate boundary under complex scenes. To address this problem, the authors propose a novel deep residual and pyramid pooling network (DRPPNet) for extracting road regions from high resolution remote sensing image. The DRPPNet consists of three parts: deep residual network (DResNet), pyramid pooling module (PPM) and deep decoder (DD). Specially, the DResNet uses several residual blocks to extract deep road features from input images, which can enhance learning ability of DRPPNet and avoid gradient vanish. Then, PPM is proposed to fuse road features from multiple views and it aims to address disadvantage of single view feature. Finally, the DD is used to recover size of feature maps to input size. Extensive experiments on two challenging road datasets demonstrate that proposed method outperforms the state-of-the-art methods greatly on performance of road extraction task.

image. Their method has achieved great road extraction performance. Following this work, [21,22] assumed that the relationship between multiple ground objects can improve performance of road extraction. Thus, they proposed a three channels CNN that considers the relationship among road, building and background to improve road performance. Furthermore, [23] proposed a junction-aware water flow model to extract road region from high resolution remote sensing image, and [24] introduced road skeletons as powerful information to enhance linear structure of road segments. These methods obtain satisfied performance and demonstrate effectiveness of CNNs in road extraction. However, their methods limit shallow network and there is still a huge space for improving road extraction performance.
In recent works, deep convolutional neural networks (DCNNs) have been proposed to further improve performance of road extraction. The first deep convolutional neural network is Vgg19 [25] that is a larger network and has the ability to learn appearance of more numerous and complicated objects. The behind philosophy of deep convolutional neural network is that deeper architectures can promote hierarchical feature representation of visual data [26], which, in turn, improves extraction performance. Furthermore, [27] combined objects obtained from source high resolution remote sensing image segmentation for automatic road extraction. [28] combined conditional random field (CRF) and deep convolutional neural network to extract road regions from object-level. [29] used adaptive graph cuts with multiple features, such as spectral, spatial, and gradient for road extraction. These methods have favorable anti-noise properties and are widely applicable in road extraction. However, these kind of methods can easily confuse road features and adjacent objects that have similar shapes with road. Therefore, the road extraction results are not satisfactory when remote sensing image is extremely complex. More recently, the residual network has become the most important network in deep learning and has been used in road extraction task. The core of idea of residual network is to add shortcut connections that bypass two or more stacked convolution layers by performing identity mapping. [30] proposed deep residual learning framework to facilitate training. [31] proposed a novel network Unet that removes skip connection in fully convolutional network. The Unet concatenates feature maps of low-level detail information and high-level semantic information for semantic segmentation. Following this work, [32] combined strengths of residual learning and Unet for road extraction task.
It is well known that accurate road extraction relies on enriched and abstract road features. The authors find that the major issue of current DCNN-based road extraction methods is lack of suitable strategy to utilize multiple views road features. In reality, road is usually covered by many kinds of ground objects, like vehicles, trees, and shadows, which causes huge difficulties in extracting road features, such as shape and texture. In this situation, the existing DCNN-based road extraction methods fail to obtain satisfactory road extraction performance. In contrast, multiple views road features can provide road structure information in more detail, which is beneficial for model to extract road network from high resolution remote sensing image, especially for road covered by other objects.
To this end, the authors propose a novel end-to-end deep residual and pyramid network (DRPPNet) in this paper, which can extract deep road features from multiple views for road extraction task. Specially, DRPPNet is composed of deep residual network (DResNet), pyramid pooling module (PPM) and deep decoder (DD). The DResNet is build based on several residual blocks and it aims to extract deep road features from high resolution remote sensing image. Then, the pyramid pooling module (PPM) receives feature maps of DResNet and fuses road features from different views by using four maxpooling operations. Finally, the DD is used to recover feature map to same size as input image. By embedding PPM, DRPPNet has strong ability to obtain multiple views road features, which is important for achieving smooth and coherent road extraction results. Moreover, since residual unit is used to extract road features, DRPPNet can avoid gradient vanish effectively and obtain strong learning ability. The experiments for verifying performance of proposed method are conducted on two challenging road datasets, Cheng-roads and Mnih-Roads. The contribution lies in following three aspects: 1. By combining residual unit and pyramid pooling, the authors propose a deep residual and pyramid network for road extraction task, which can achieve more consistent road extraction results under complex scenes. 2. Benefit to PPM, DRPPNet can obtain multiple views road features that are beneficial for DRPPNet to extract road network covered by other objects. 3. The DRPPNet is evaluated with two challenging road datasets and obtains better performance than other state-ofthe-art road extraction methods.
The rest of this paper is organized as follows. Section 2 introduces background and related works. Section 3 presents technical details. Section 4 presents experiments and evaluation. Section 5 concludes this paper.

BACKGROUND AND RELATED WORKS
The methods of road extraction from remote sensing image have been a hot research in recent years. In this section, the authors discuss innovated works related to our method.
The road extraction methods mainly depend on pixel-level or superpixel-level classification. [33] performed road extraction in two steps. In first steps, the support vector machine (SVM) was employed to classify image into different categories. Then, the road region was segmented into homogeneous objects using region growing technique. [34] first segmented input image using a traditional k-means clustering. Then the fuzzy logic classifier was used to automatically extract road region. To some extent, this method can distinguish road area from parking point. [35] presented an automatic road extraction method from remote sensing image using locally excitatory globally inhibitory oscillator network. Then, [36] proposed a multistage framework for road extraction from remote sensing image. Their work is based on probabilistic SVM and salient features. [37] proposed a convolutional neural network-based algorithm to learn features for road extraction.
Many researchers have provided various strategies to describe spatial features [38][39][40][41]. [42] proposed recurrent neural networks (RNN) based on conditional random field and convolutional neural network. The RNN is a end-to-end network and achieves great success in learning high level features. However, the fixedsize receptive field in RNN limits the performance in obtaining fine segmentation maps. [43] first proposed a deconvolution network that was demonstrated to be a great network in obtaining detailed structures. The core of deconvolution network is to recover resolution feature map same size as input maps. After that, [44] proposed a SegNet that is an encoder-decoder architecture. The decoder enlarged feature maps according to indices from corresponding encoder layer. Recently, a patch-based deep neural network method was proposed by [19] to extract road network from high resolution remote sensing image. [45] proposed a high order CRF model to extract road network. The road prior was obtained by high order cliques that connects sets of super-pixels along straight line segments in their method. Overview of our proposed DRPPNet. Given a remote sensing image, the authors first use DResNet to get abstract features from input image, then the PPM is applied to further extract road features from different views. Finally, the representation is fed into DD to get final pixelwise prediction

METHODOLOGY
In this study, the authors design a spatial deep residual and pyramid pooling network (DRPPNet) to better model road features from different views. The DRPPNet can explicitly explore different views road features by pyramid pooling module. In this following, the authors first describe the basic structure of proposed DRPPNet. Then, the learning algorithm is introduced to show how to train DRPPNet with an end-to-end strategy.

Deep residual and pyramid pooling network
The DRPPNet is an encoder-decoder network, and Figure 1 shows architecture of DRPPNet. As seen from Figure 1, the input to DRPPNet is RGB channels of high resolution remote sensing image, whereas the output is road map of corresponding remote sensing image. DRPPNet is composed of three modules, deep residual network (DResNet), pyramid pooling module (SPPM) and deep decoder (DD). In the following, the authors describe design in detail.

Deep residual module
For DResNet module, it aims to obtain general feature about road in remote sensing image. The core of DResNet is residual unit assuming that multiple nonlinear layers can asymptotically approximate residual function. The specific formula of residual unit can be written as where x l and x l +1 are input and output of l th residual unit.  is residual function, f (y l ) is activation function, and h(x l ) is an identity mapping function, a typical one is h(x l ) = x l . The add layers of residual unit are constructed as identity mappings. In this way, the DResUnet should have training error no greater than its shallower counterpart. In addition, with the help of residual unit, when the identity mappings achieve optimal, the solvers may simply drive weights of multiple nonlinear layers toward zero to approach identity mappings. Table 1 visualizes structure of DResNet that is inspired by Vgg networks [25]. Specifically, for Conv1, the kernel size is 7x7, the stride is 1 and the padding size is 4. To keep the size of output features as same with input one, we reshape convolutional results to 128 × 128. For example, let m is feature map after convolutional operation, we obtain output feature map by m = m(0 ∶ 128, 0 ∶ 128). There are 4-level residual unit in proposed DResUnet and the convolutional layers mostly have 3 × 3 filters. In each residual unit, instead of using pooling operation to downsample feature map size, a stride of 2 is applied to first The structure of DResUnet follows two simple design rules: (1) for same output feature size, the layers have same number of filters; and (2) if the feature map size is halved, the number of filters is doubled so as to preserve time complexity per layer. Benefit from DResUnet, the proposed DRPPNet has two great advantages in extracting road from high resolution remote sensing image. First, since the residual unit is introduced, the DRPPNet will ease to train. Second, the skip connections within a residual unit and between low levels and high levels of network will facilitate information propagation without degradation, which make it possible to design a neural network with much fewer parameters, however, could achieve comparable ever better performance on road extraction from the remote sensing image. The experimental results show that satisfactory performance can be obtained by our method.

Pyramid pooling module
Designing an effective method for extracting road is difficult due to complex morphology of road features in terms of shape and scale in remote sensing image. Therefore, a pyramid pooling module (PPM) is introduced, which encodes spatial context at multiple views, thereby making full use of feature extraction in a complex background. The overall architecture of PPM is shown in Figure 2.
The PPM module fuses road features from different views by using four maxpooling operations. The size of four maxpooling operations are different range from 1 × 1 to 6 × 6, which aims to obtain road feature from different views. The large size of maxpooling operations is global pooling highlighted in extracting road features from coarse level. By contrast, the small maxpooling operations is used to obtain fine feature of road. The four maxpooling operations are arranged in pyramid and can sperate input feature maps into different regions. As a result, the output of different maxpooling operations in PPM has different sizes. In order to keep weight of general features, the 1 × 1 convolution layer is added into after maxpooling operations to reduce dimension of feature map. For example, if the size of input feature map is N , the output of 1 × 1 convolution layer will be 1 N . Next, the upsample technology, such as bilinear interpolation, is applied to enlarge feature maps to same size as original input features. Finally, the authors use concatenated operation to fuse all feature maps from different maxpooling operations. In this way, the output of PPM can collect road features at a high semantic level from different views.
It is noted that the structure of PPM is not fixed and it can change with the size of feature maps that is produced by DRes-Net. In addition, the strides of maxpooling operations also is set by width road in remote sensing image. In here, the authors adapt four levels with size of 1 × 1, 2 × 2, 3 × 3 and 6 × 6 respectively for PPM. The experiment shows that our structure of PPM is reasonable for extracting road from remote sensing image.

Deep-decoder module
In proposed DRPPNet method, the pooling layer reduces size of input image. To upsample back to image resolution, the authors introduce a deep decoder (DD) as third part of DRPP-Net. The DD consists of three unsampling layers and five convolutional layers. The role of decoder network is to recover feature map to same size as input image. The detail of decoder network is described as Table 2.
The DD module receives feature map produced by PPM as input and then one convolutional layer is applied. Following that, the upsampling operations is performed and the resulting output is enlarged by a factor of 2. The similar operations are performed twice for enlarging size of feature maps in Softmax decoder network. The last two convolutional layers are used to adjust size of feature map to input map. The output of DD is fed to a softmax for producing probability of pixel to be road class.
where the h(x) is output of last convolutional layer at pixel x. y is label of x. p(y = k) is probability of pixel x being k th class. There are two channels in output images. The first channel includes probability that the road class is assigned to pixel at same location. And the second channel contains probability that the pixels are predicted to be background class.

Training
Given a set of training images and corresponding ground truth segmentations {S i , m i }, our goal is to estimate parameters of DRPPNet, such that it can produce accurate and robust road areas in testing dataset. This is achieved through minimizing loss between segmentations generated by DRPPNet and ground truth m i . In this section, the authors use cross-entropy as loss function to train DRPPNet.
where m i = 1 denotes that the i th pixel is road class. Otherwise it is background class. The p(y i = 1|x i ) is probability that the pixel x i is classified to road class and it belongs to 0 to 1.
To train DRPPNet with end-to-end manners, L ce need to be minimized with respect to DRPPNet parameter . The authors can obtain all derivative of loss in Equation (3) to different layers using chain rule, and then update layer-by-layer with backpropagation strategies. For clarity, the authors only show derivative of loss to output of last convolutional layer in DD. The derivative of L ce to h(x) of last convolutional layer in DD is defined as where

EXPERIMENT
To assess effectiveness of proposed method, the authors conduct experiments on two remote sensing image road datasets, that is, Cheng-Roads dataset and Mnih-Roads dataset. First, the authors provide a short description of both datasets and detail of experiment setup. Then, the authors compare proposed network with other state-of-the-art methods both visually and quantitatively.

Datasets
The authors employ two public datasets in this paper to evaluate proposed method. The information of these datasets is presented in the following. The Cheng-Roads consists of 224 images and corresponding road maps is labeled by hand, which was published by Cheng et al. [46]. One of 180 images are used for training and 14 images are used for validation and remaining 30 for testing. There are three spectral bands in each image (red, green and blue). The image has different spatial size range from 600 × 600 to 1200 × 1500 pixels. The spatial resolution of all images is  Figure 3 shows some examples in Cheng-Roads dataset, which includes street roads, rural roads in complex residential areas and highways in city areas.
The Mnih-Roads was published by Mnih [19], which consists of 1157 high resolution aerial images and covers an area of 2603 km 2 . Each image is downloaded from google maps and the size is 1500 × 1500 pixels, covering an area of 2.25 km 2 . The corresponding maps are collected from OpenStreetMap. The spatial resolution is 1.0 m 2 . To train and evaluate network, the authors utilize 1108 images for training and build testing dataset with remaining 49 images. Figure 4 shows some examples in Mnih-Roads dataset. The dataset covers a wide variety of urban, suburban, and rural regions, which is a challenging road extraction dataset.

Experimental setup
The authors implement proposed method using tensorflow 1.13 and Python 3.6. Theoretically, the proposed DRPPNet can take arbitrary size image as input. However, the large size image needs to cost amount of GPU memory to store the feature maps. In this paper, the authors fix image size to 128 × 128 for training model. The mini-batch size is set to 8 and the authors train model on an GeForce GTX 980 TI GPUs with 6GB memory. The SGD with initially learning rate 0.00001 is used to optimize the model. In addition, the learning rate drops by a factor of 0.1 every 100000 iterations.
The network is trained until no further performance increase is observed.

Evaluation metrics
The road extraction task can be considered as binary classification, where road pixels are positive and the remaining non-road pixels are negatives. Let TP, FP, and FN represent the true positive, false positive, and false negative, respectively. the authors present four benchmark metrics, that is, Precision (Pre), Recall (Rec) F1 and IoU to evaluate quantitative performance about road extraction. The Pre measures fraction of matched areas in reference map. Rec represents percentage of matched road areas in segmentation map. F1 is an overall metric, which is computed using Pre and Rec. The IoU is percentage of overlapping area to union area between label map and predicted map. The authors calculate all metrices with following formula:

FIGURE 4
The examples from Minh-Roads dataset, where a column refers to each image. The first row is original images. The second row is label maps

Parameters analysis
In this section, the authors analyze effects of different scales of PPM. The performance of PPM with different maxpooling size on Cheng-Roads and Mini-Roads is shown in Table 3. The best value is showed in bold in each metric term. In Table 3, the first column represents setup of kernel size of PPM. For example, taking first row parameters for example, the authors set four maxpooling kernels in PPM to 1 × 1, 2 × 2, 3 × 3 and 6 × 6, respectively. It can be found that if we set maxpooling kernels to small size, the DRPPNet shows high accuracy on Mnih-Roads. Specially, when 1 × 1 is used in PPM, the DRPPNet can obtain 0.7432 and 0.7160 in F1 on Mnih-Roads, which is higher 1% at least than without 1 × 1 maxpooling kernel. But when the maxpooling kernel is set to slight large size, such as third row, the authors find that DRPPNet can obtain 0.8810 in F1 on Cheng-Road, which is higher about 0.04%˜3% than other rows. The main reason for these results is that average road width in Cheng-Roads is larger than one in Mnih-Roads. In this case, the kernel with slight larger size is more fit to extract road in Cheng-Roads, but the small is more prefect for Minh-Roads. Considering the overall accuracy for road extraction, the authors set 1 × 1, 2 × 2, 3 × 3 and 6 × 6 to default parameter setting of PPM. The following experimental results show that satisfactory performance of DRPPNet can be obtained on Cheng-Roads and Mnih-Roads.

Experimental results on Cheng-Roads
This section illustrates details of our experiments on Cheng-Roads Dataset. To evaluate effectiveness of proposed DRPP-Net, the authors compare road extraction methods with other state-of-the-art methods, including SegNet [44], Vgg19 [25] and Unet [31]. All experiments adopt same postprocessing to obtain final road maps. Table 4 lists quantitative performance of road extraction results with different state-of-the-art methods. The F1 and IoU is an overall metric than Pre and Rec. In Table 4, the first three rows are performance of three sampled images, and the last two rows are average performance of all methods in testing dataset. The best value is showed with bold in each metric term. It can be seen from Table 4 that our proposed DRPP-Net achieves the best performance compared with other methods. Specially, the DRPPNet significantly increases Rec, with at least 4.5% improvement. Meanwhile, the average F1 of DRPP-Net is 0.8772, which is far better 4.6% than of SegNet. And  the IoU obtained by DRPPNet is 5.1% higher than SegNet. Besides SegNet, the average F1 of DRPPNet is nearly 8% higher than Vgg19. These experimental results demonstrate validity of DRPPNet on road extraction task. Figure 5 shows qualitative and quantitative results with stateof-the-art methods. The first column shows original images. The remaining column shows road extraction results of different methods. The white color denotes region that is true road region in label map. The yellow color represents false positive and the blue color represents false negative. It should be noted that all compared methods are DCNN-based methods, which has great performance in road extraction. From Figure 5, it can be seen that Unet is sensitive to occlusions of trees since its predicted maps have too much region with blue color. The maps of Vgg19 and SegNet show more coherent results than Unet, which effectively alleviates effectiveness of occlusions to some extent. However, the Vgg19 and SegNet tend to bring in too much FPs. To alleviate this shortcoming, the authors propose DRPPNet and introduce pyramid pooling module to extract road feature from multiple views. Thus, the DRPPNet shows more satisfactory and coherent road detection results and brings in less FPs than other compared methods. In addition, our method is more robust against occlusions. Finally, the authors list training and inference time for Seg-Net, Vgg19, Unet and DRPPNet on Cheng-Roads in Table 5. The inference time is time that network extract all road region in testing dataset. And the training time is time that network needs to cost for achieving convergence. From Table 5, it can be seen that DRPPNet only cost 0.8 h before convergence, which reduces 0.2 h at least than other road extraction networks. Although the PPM is added in DRPPNet, which may bring a little complexity into network, the authors find DRPPNet can have an appropriate inference time on testing dataset, about 0.9 min .

Experimental results on Mnih-Roads
In this section, the authors conduct experiments on Mnih-Roads to further verify effectiveness of DRPPNet. The Mnih-Roads is a more complex road dataset than Cheng-Roads since this dataset covers a wide variety regions, such as urban, suburban, and rural regions. The compared methods are FCN [47], CIS-CNN [21] and Vgg19. The road extraction results are presented in Table 6 and Figure 6. The authors list quantitative results of all compared methods in Table 6. The best results are marked in bold. Although metric value on Mnih-Roads is not comparble with Cheng-Roads, DRPPNet achieves all of the best performances on Mnih-Roads. Specially, the average score of DRPPNet achieves 0.7446, 0.7419, 0.7432 and 0.5852 in terms of Pre, Rec, F1 and IoU, respectively, which is 0.9%, 1.2% , 0.6% and 1.5% higher than Vgg19. Compared with CIS-CNN that the performance only achieves 0.7109, 0.5157, 0.5978 and 0.405 in Pre, Rec, F1 and IoU; the performance of DRPPNet is improved to 3.4%, 22.6%, 14.5% and 15.5%, respectively. Moreover, the road extraction results of FCN on Mnih-Roads are not compared with DRPPNet and the average F1 score of FCN only is 0.5215. These experimental results confirm that DRPPNet indeed can improve road extraction from high resolution remote sensing image. The reason is that DRPPNet has the ability to integrate multiple views road features through PPM.
Furthermore, in order to directly compare road extraction performance of different methods, Figure 6 present a visual comparison of different methods about road extraction results on Mnih-Roads dataset. The method with visual results from first column to fourth column are FCN, CIS-CNN, Vgg19, respectively and the last column is proposed DRPPNet. It is clear that proposed DRPPNet retains most details and achieves the best road extraction result. The road extraction results of FCN exhibit most disconnection, compared to other methods. The main reason for this is that the network structure of FCN is too simple, and it only has five layers. Thus, it has weak ability to extract road region with complex background. The other two methods, CIS-CNN and Vgg19, can capture more detail and achieve better performances. By comparing visual results of Figure 6b, Figure 6c and Figure 6d, it can be seen that the road maps obtained by CIS-CNN and Vgg19 present less FPs and FNs than FCN, which shows clear superiority of CIS-CNN and Vgg19 than FCN. In all methods, proposed method, DRPP-Net, shows the best performance. From Figure 6e, it can be seen that DRPPNet obtains a complete road region with little discontinuity. The road maps obtained by DRPPNet is closer to real situation of image. All these situations further validate great advantage of proposed approach in capturing road details.
Similarly, the authors list training and inference time for FCN, CIS-CNN, Vgg19 and DRPPNet on Mnih-Roads in Table 7. From Table 5, it can be seen that training time of DRPPNet is about 4h and inference time on testing dataset is appropriate, about 1.6min. Although DRPPNet needs to cost more training time than FCN and CIS-CNN, the road extraction accuracy is higher than FCN and CIS-CNN.

Effectiveness of PPM
To better explore effectiveness of PPM among different road extraction methods based on same feature extraction network,    Tables 8 and 9 and Figures 7 and 8.
It is noted that SegNet and Unet are encoder-decoder network. Therefore, the authors directly put PPM on last of encoder but before of decoder, which aims to capture multiple views road features. For FCN and Vgg19 that are not encoderdecoder structure network, the authors view them as feature extractor and put PPM and DD on the last. By comparing average performances in Tables 8 and 9, it can be seen that all modified methods (i.e. SegNet+PPM, Unet+PPM, FCN+PPM and Vgg19+PPM) achieve better quantitative performance than their originial methods (i.e. SegNet, Unet, FCN and Vgg19) to some degree. Specially, the SegNe+PPM achieves about 1.95%  To better illustrate this improvement in visual effects, Figures 7 and 8 show visual road extraction results. It is obvious that the maps obtained by modified methods bring in less FPs and FNs. These visual results show that the modified methods can achieve more satisfactory and coherent road extraction results than original network. Besides, the modified methods are more robust against occlusions of trees.
Finally, the authors make an experiment to compare DRPPNet with a state-of-the-art road extraction method, ResUnet [32]. The ResUnet is built with residual units and has similar architecture to Unet, which achieves great performance in road extraction task. From Table 10, it can be seen that the average F1 of ResUnet achieves 0.8592 in F1 on Cheng-Roads. This demonstrates that ResUnet can obtain relatively coherent road areas. This is because residual unit of ResUnet utilizes  identity add to fuse deep features and shallow features. Thus, the ResUnet has ability to extract road network more effectively. Our proposed DRPPNet is also based on residual units, but we add PPM to further obtain road features from multiple views. Therefore, the DRPPNet can obtain higher road extraction performance than ResUnet. For example, the average F1 and IoU of DRPPNet are about 4.6% and 4.2% higher than ResUnet on Minh-Roads.

CONCLUSION
In this paper, a novel end-to-end deep residual and pyramid network is proposed to improve performance of road extrac-tion from high resolution remote sensing image. The authors first utilize deep residual network (DResNet) to extract deep road feature from input image, which can effectively avoid gradient vanish and enhance learning ability of DRPPNet. Furthermore, a pyramid pooling module (PPM) is proposed to fuse road features from multiple views using four maxpooling operations and it can help DRPPNet obtain smooth and coherent road extraction results. Finally, the deep decoder (DD) is used to recover feature map to same size as input image. The experimental results on two challenging road dataset, Cheng-Roads and Mnih-Roads, verify the superiority of proposed method. In addition, the proposed method can be widely applied in server computing and edge computing because of its generalization ability and efficiency.