Road Extraction from Very High Resolution Images Using Weakly labeled OpenStreetMap Centerline

Road networks play a significant role in modern city management. It is necessary to continually extract current road structure, as it changes rapidly with the development of the city. Due to the success of semantic segmentation based on deep learning in the application of computer vision, extracting road networks from VHR (Very High Resolution) imagery becomes a method of updating geographic databases. The major shortcoming of deep learning methods for road networks extraction is that they need a massive amount of high quality pixel-wise training datasets, which is hard to obtain. Meanwhile, a large amount of different types of VGI (volunteer geographic information) data including road centerline has been accumulated in the past few decades. However, most road centerlines in VGI data lack precise width information and, therefore, cannot be directly applied to conventional supervised deep learning models. In this paper, we propose a novel weakly supervised method to extract road networks from VHR images using only the OSM (OpenStreetMap) road centerline as training data instead of high quality pixel-wise road width label. Large amounts of paired Google Earth images and OSM data are used to validate the approach. The results show that the proposed method can extract road networks from the VHR images both accurately and effectively without using pixel-wise road training data.


Introduction
With the rapid development of remote sensing technology, images obtained from the remote sensors (installed on drones or satellites) have made a considerable contribution to disaster/emergency management, urban planning, and object detection [1][2][3]. The road networks are also an indispensable part of our daily life in city planning, traffic management, GPS navigation, and road condition monitoring [4][5][6][7]. In a fast-growing area, road networks evolve frequently. Therefore, it is necessary to extract up-to-date road networks to effectively support spatial applications.
Recently, deep learning techniques are widely used in different kinds of applications, and many road extraction methods based on deep learning are proposed. These methods can achieve better performance than the traditional road extraction methods using artificial features [8]. However, most deep learning based end-to-end road extraction approaches need a large amount of high quality pixel-wise annotated datasets. Human annotation is usually labor-intensive and time-consuming, which makes large-scale annotated datasets very expensive. At the same time, the rapid development of the OSM (OpenStreetMap, a collaborative project to create a free editable map of the world) [9][10][11][12] makes it easier to acquire road centerlines. Figure 1 shows high quality pixel-wise annotation and the The paper is organized as follows. In Section 2, the related work of road extraction and weakly supervised learning is introduced. The proposed method for weakly supervised road extraction is detailed in Section 3. Experiment and discussion are presented in Section 4. Finally, the conclusions are drawn in Section 5.

Road Extraction
Previously, the approaches of road network extraction from high resolution images can be classified into two main categories: unsupervised and supervised.
The unsupervised road extraction does not need training samples, but instead usually uses clustering algorithms to extract the road networks. The clustering algorithms used in road extraction include K-means, spectral clustering, mean shift [21,22], and graph theory [23,24]. Miao et al. proposed a semi-automatic approach using the mean shift to detect roads [22]. Unsalan used probabilistic theory to extract road centerline and graph theory to infer the road networks formation [23]. However, compared with the supervised extraction methods, the accuracy of these unsupervised methods is generally lower [8].
Different from the unsupervised road extraction, supervised road extraction methods need lots of labeled images to train the model. The accuracy of supervised extraction methods rely on the features used and the labeled samples. Supervised classification methods mainly include AAN (Artificial Neural Network) [25,26], SVM (support Vector Machine) [27], MRF (Markov Random Field) [28], and ML (Machine Learning) [29]. Many ANN models, such as BP neural network [25], fuzzy neural network, spiking neural network, and hybrid neural network [26,30], have been used for road extraction from remote sensing images.
Recently, as the deep convolution neural networks (DCNN) [31,32] have shown dominance in many visual recognition and image segmentation tasks, several road extraction methods based on deep learning are proposed. These deep learning methods have greatly improved the performance of road extraction.
Dragos et al. [33] proposed a dual-generation GAN (DH-GAN) network for extracting road topologies. DH-GAN and SBO (Smoothing-Based Optimization) combination methods have significant improvements in topology and accuracy. Wei et al. [34] proposed a road structure refined convolutional neural network (RSRCNN) approach to obtain structured output for road extraction in aerial images. Xu et al. [35] used a global and local attention model based on U-Net and DenseNet [36] (GL-Dense-U-Net). Wu et al. [37] proposed an FCN-based model to implement pixel-wise classifications for remote sensing image in an end-to-end way, and an adaptive threshold algorithm to adjust the threshold of Jaccard index in each class. Zhang et al. proposed a semantic segmentation neural network which combined the strengths of residual learning and U-Net for road area extraction [4]. Moreover, during CVPR2018 (International Conference of Computer Vision and Pattern Recognition), a road extraction competition [38] has rapidly promoted the development of road extraction, where many methods were proposed to extract the road networks effectively [39,40]. To extract accurate road networks, the organizing committee provided a large amount of annotation datasets for the participants. Zhou et al. proposed a semantic segmentation neural network, named D-LinkNet to win the champion [41]. The D-Linknet consists of encoder-decoder structure, dilated convolution, and pre-trained encoder. Sun et al. proposed a road extraction method using crowd-sourced GPS data to improve and support road extraction from aerial imagery [42].
These end-to-end deep learning algorithms have promoted the road extraction performance considerably, but they have significant limitations because of the demand for a massive amounts of high quality pixel-wise annotated training datasets, which is expensive to obtain.

Weakly Supervised Learning
Instead of using full annotated datasets as supervision, the weakly supervised learning method [13][14][15][16] was widely used for image segmentation. Lin et al. [17] developed an alternating training scheme. By iterating the unary terms using graph theory and FCN based methods, the FCN was gradually fed with more reliable annotation and thus propagated more accurate labels. Instead of alternating between FCN and graphical methods, Tang et al. [43] proposed a method to train a single FCN via a joint loss function with two terms. One is a partial cross-entropy loss for scribbles only and the other is a relaxed normalized-cut regularized that implicitly propagated accurate labels to unknown pixels during training.
These scribble supervised image segmentations need the scribbles for each category, including the background. For the OSM data, it is difficult to label the scribbles of each category and background just from the OSM centerline. Therefore, the existing scribbles supervised image segmentation methods cannot be directly used to the road extraction problem.

Methodology for Weakly Supervised Road Extraction
In this paper, a weakly supervised method MD-ResUNet is proposed to extract roads from the VHR images. Using OSM centerline as annotation, our method presents a weakly supervised road extraction scheme combined with graph cut theory and deep learning technique.
The road extraction algorithm mainly consists of three parts. The first is to generate the initial road annotation from the OSM centerline using prior knowledge of the road width. Then the regularized semi-supervised loss for weakly road extraction used in this paper is presented; finally, the MD-ResUNet is described in detail for road extraction.

Initial Road Annotation Inference
As the OSM data contains the incorrect and incomplete centerline (not always in the center) of the roads, it is difficult to get correct annotation of the VHR images using only the OSM road centerline. Thus it is impossible to use the fully supervised learning to directly extract the pixel-wise level road.
The images and OSM data were projected to the same geographic coordinates system. Firstly, the VHR images were projected to the same coordinate map with the OSM data to keep the images geographically consistent. Then the corresponding VHR images and OSM centerline annotation were extracted from the same geographic coordinates.
The initial road annotations are inferred from the centerline using the prior knowledge of the image resolution. The schematic diagram is shown in Figure 2. The roads and background are inferred by the road centerline, respectively, and other pixels which cannot be determined by the distance to the road centerline were labeled unknown.

Regularized Semi-Supervised Loss
To extract pixel-wise roads from VHR images, we use the deep learning methods which are proved to be effective in these applications. The inferred annotation we used for supervision have two categories of labels (known and unknown). The loss function is an important part to guarantee the quality of extraction results in deep learning methods; we use two separate loss functions to reflect the feature of roads.
Normalized cut [44] is widely used in unsupervised image clustering problems to reflect the similarity between pixels. In this paper, we use a high order regularized loss (normalized cut loss [43]) to reflect the similarity between these pixels, which can reflect the feature of the pixels labeled unknown in the road extraction methods. The partial loss is used to reflect the feature of the pixels labeled road or non-road. The integral loss function is described in Equation (1).
The loss function consists of two separate parts. The first part of Equation (1) is called partial loss which can be separated into two parts BCEloss (Binary Cross Entropy) and Dice coefficient loss. In the BCEloss function, s p represents the network's output for p ∈ Ω l represents the set of pixels which is labeled l(roadornon − road) inferred by the centerline. y p infers the ground truth of the labeled pixels p. This loss function describes the cross entropy of the pixels labeled known (road and non-road). In the Dice coefficient loss, Pred represents the prediction output of the network, and GT known is annotation of the labeled pixels inference by the centerline annotation present in Figure 2. Pred ∩ GT known represents the intersection of the Pred and GT known , and | · | is the L1 norm. The Dice loss is just related to the pixels labeled known. It is clear that the pixels labeled unknown does not affect the partial loss function.
The second part of the loss function comes from the normalized cut [43,45,46]. The normalized cut is a typical spectral clustering method and embedding algorithms for image segmentation [47][48][49]. Its energy function was defined by the ratio of the cut and the volume, which is described below: where S k ∈ [0, 1], k ∈ {0, 1} represents the network's output, W is an affinity matrix which represents the similarity of each pixel, and for the degree vector d, there is d = W1. In the normalized cut clustering, the lower the energy of the normalized cut, the better the clustering performance. Taking this information into consideration, we combine the partial loss with the normalized cut loss in Equation (1). To take the spatial information into accounts, the affinity matrix is defined by the Gaussian kernel W i,j , which combines colors (RGB) and spatial (XY) information in a five dimensions affinity matrix [50]. The Gaussian kernel not only takes the color information but also the spatial information between pixels into consideration. The W i,j is defined in Equation (3).
where (i − j) represents the spatial similarity and | f (i) − f (j)| represents the feature similarity between two separate pixels i and j, σ rgb and σ xy are constant in the color and spatial domain. Equation (3) consists of two independent parts, spatial and color domain. When the normalized cut loss is used in the deep learning methods, it is necessary to compute the gradient because the gradient descent method is widely used to solve the deep learning problem. Moreover, the gradient of the normalized cut can be described as [43]: and its gradient w.r.t S k is: As the computation of the normalized cut loss is a time-consuming process, we use the permutohedral lattice [51] to reduce the computational complexity so that it can achieve a linear time complexity. Accordingly, each forward evaluation and back-propagation through the normalized cut loss is efficient.

Road Extraction Using Multi-Dilated ResUNET
To achieve better performance for road extraction, in this part, we proposed a novel deep neural network named MD-ResUNet, which is shown in Figure 3. The roads in most images stretch across the entire image, and there are some natural properties, such as connectivity and complexity, in the roads. Moreover, the pixel-wise roads extraction can be regard as an image segmentation problem. U-net performs well in pixel-wise segmentations by representing multi level features of the images. In this paper, the proposed MD-ResUNet is based on ResUNet [4]. The MD-ResUNet is symmetrical and it consists of three main parts ( Figure 3). The left is the encoder which is used to extract the multi-layer feature map of the VHR images. The middle bridge consists of multi dilated convolution layers. The right is the decoder which is used to restore the original resolution.  Considering the model size and the computing complexity, the encoder is based on the ResNet34 [52,53] pretrained on ImageNet [54]. The ResNet34 was proved to be effective for image recognition and feature extraction. Due to the connectivity, complexity, and long extend of roads, it is significant to increase the receptive field, as well as keep the details of the images. Pooling layers could increase the receptive field but may reduce the resolution of the feature maps and drop spatial details. As described in state-of-the-art deep learning methods in [41,55,56], a dilated convolution layer is effective to expand the field features, while keeping the spatial details. So, in this paper, the MD-ResUNet takes advantages of different layers the multi dilated convolution to expand the receptive field in the feature map.
Dilated convolution can be constructed in parallel style ( Figure 4). For different dilation rates, the receptive is different. If the dilation rates of the dilated convolution layers are 1, 2, 4, 8, respectively, the receptive field of each dilated convolution layer will be 3, 5, 9, 17. In the MD-ResUNet, the encoder (RseNet34) has 5 downsampling layers, which can produce the 5 layers feature map represented in Figure 3. If an 1024 × 1024 image goes through the encoder part, the output feature map of each layer will be sized at 512 × 512, 256 × 256, ..., 32 × 32 . Unlike the D-linknet, MD-ResUNet uses a multi-dilated convolution network which employs non-linear dilated convolution that can exchange information with various corresponding layers of ResUNet and expand the receptive field of convolution operations in ResUNet. Therefore, MD-ResUNet works better with the partial loss and normalized cut loss in describing the local or global information of the images. Figure 4 shows the dilated layer in bridge. Still, MD-ResUNet takes advantage of multi-resolution feature maps, and the bridge part of MD-ResUNet can expand the perception of feature maps. The decoder of MD-ResUNet remains the same as the original ResUNet. The decoder uses transposed convolution [57] layers to do upsampling, restoring the feature map from 32 × 32 to 1024 × 1024.

Training Algorithm
In this part, we will introduce the training algorithm of the proposed MD-ResUNet. In Algorithm 1, the L partial−loss defined in the left part of Equation (1); the E NC is defined in Equation (4); the corresponding ∇ ω E NC is presented in Equation (5). The algorithm is divided into two parts. As the computation of the normalized cut loss is slow, we train the model firstly using just partial loss function in the algorithm line 4 to line 6. Then we add the normalized cut loss with the weight of λ into the training algorithm to improve the extraction performance in the line 7 to line 9.

Algorithm 1: Training the MD-ResUNet input :
The input VHR images Input; The annotation inferenced by the OSM centerlines Sup; The parameter of the normalized cur weight λ; The parameter of the learning rate α; The parameter of the max iteration times for partial supervised learning partialiteration; The parameter of the max iteration times for the whole learning wholeiteration ; output : The model parameters ω; 1 randomly initialize the model parameter ω; 2 iternum = 0 3 for iternum < wholeiteration:

Dataset Description
To verify the performance of the proposed MD-ResUNet, several experiments for road extraction from the VHR images were carried out on two different datasets. The dataset 1 was collected from Google Earth by Cheng et al. [58]. The pixel-wise ground truth was manually annotated from the reference map. Moreover, the corresponding centerline annotation was collected by the pixel-wise ground truth. The dataset consists of 224 images for training and 30 images for testing. The spatial resolution in the remote sensing imagery is 1 m. The corresponding description of the dataset is shown in Table 1.
The dataset 2 was collected to prove the effectiveness of the proposed road extraction methods in real-world OSM data. It contains 315 aerial images of the Seat from Google Earth.The corresponding centerline annotation was obtained from the OSM. The pixel-wise annotation was manually labeled. The datasets cover urban, suburban, and rural regions; 285 images were used as train data, and 30 images were used for the test. The spatial resolution in the remote sensing imagery is 1.2 m. In this dataset, most VHR images have a complex terrestial environment, such as rivers and buildings, which could be perceived as roads. Moreover, the occlusions, shadows of buildings or the trees, make it difficult to separate the roads from the backgrounds.

Data Processing
In this paper, we collected different supervision datasets for road extraction. The specific annotation dataset is described in Table 2. We used the weakly supervised OSM vector data to generate the initial annotation directly inferenced by the information of the OSM vector and the road resolution of the aerial images. In this experiment, we set the road width between 7 m to 50 m, so we annotate all the pixels as non-road pixels which are 50 m away from the centerline. The pixel within the distance of 7 m to the centerline is regarded to be a specific road pixel. The expand mask represents the mask directly inferred by the centerline with constant road width. The pixels within the distance to the road centerline are regarded as road; the others are regarded as non-road. The full mask dataset represents the manually labeled pixel-wise annotation. Deep learning performance requires a large amount of training datasets. Since our number of datasets was too small, we generated synthetic datasets by altering original ones through horizontal flip, diagonal flip, image shifting, and scaling. After the augmentation of the origin images, the training data will ∼4-8 times larger. This can also prevent the road extraction method from overfitting on the training data.

Results Comparison
Our proposed MD-ResUNet with partial loss combined with the normalized cut is implemented using the framework Pytorch [59]; the code was executed using 2 GTX1080Ti GPUs.
In this paper, we selected the state-of-the-art approach ResUNet [4] and D-Linknet [41] as our baseline. The ResUNet was first proposed in the [4], and D-linknet was the winner of the CVPR2018 digital challenge of road extraction.
All experiments were evaluated based on precision, F 1 score [60], and mIoU. To train the network of the MD-ResUNet, we used Adam [20] as the optimizer. We initially set the learning rate as 0.0002., decreasing it by 5 if there was not decrease in training loss after three times. The batch size during the training phase was set to 2 according to the number of GPU we used. We set the σ ( rgb) = 15 and σ xy = 100 for the computing of the normalized cut loss. The hyper-parameter λ was set as 0.01 according to the experiment in the Section 4.4.
For the experiments, firstly, the experiment was conducted to evaluate the weakly supervised methods using just the partial loss; then we added the normalized cut loss into the experiment to verify the effectiveness of using the normalized cut.

Evaluation the Partial Loss Performance of the Road Extraction
In order to verify the effectiveness of road extraction using the partial loss, we compare the road extraction method with the approach directly inferenced by the centerline with certain road width. The certain width used in the road extraction is the mean width of roads in the training data.
The precision (P), F 1 score, and mIoU results with different deep learning methods and different annotation data is presented in Tables 3 and 4. It can be seen that when the supervised annotation remains constant, the proposed MD-ResUNet achieves improved performance on the test images. For different supervised annotation, the proposed partial supervision infered by the centerline performs better than the expand supervision using a certain width infered by the centerline. Table 3. Different performance of road extraction using different annotation with different loss functions in dataset 1.

Supervised Loss-Function
Model  Table 4. Different performance of road extraction using different annotation with different loss functions in dataset 2.

Supervised Loss-Function
Model The output of the test images are shown in Figures 5-7. The pixels labeled red represent the FP (false negative); the pixels labeled blue represent the TN (true negative); the pixels labeled white represents the TP (true positive); and the black represents FP (false positive). It is clear that the proposed partial loss adapts to different widths of roads from the output of (e) and (f). When the test road width is close to the road width of the certain training data, the road extraction results are similar with using partial loss or certain width supervision shown in Figure 5e,f. When the testing road differs from the supervision road width, the partial loss supervision achieves improved performance in the road extraction as shown in Figure 6e,f. The results indicate that for both the methods supervised by the centerline, it is difficult to extract effectively the pixel-wise road.

Evaluation Road Extraction Using Normalized Cut Loss Combined With Partial Losses
We examined the overall road extraction approach based on the weakly supervised centerline annotation. To evaluate the performance of the proposed methods , first, we compare our approach with ResUNet and D-Linknet. The F 1 score, precision, and mIoU with the corresponding test image is presented in Tables 5 and 6. To evaluate the effectiveness of the proposed weakly supervised method, we compared the approach with the model trained with the full mask. The results are shown in Table 5.
From Tables 5 and 6, we see that when the supervised data and loss function remain constant, the proposed MD-ResUNet achieves better performance both in F 1 score and mIoU than the state-of-the-art methods. This proves that the proposed MD-ResUNet works well in road extraction. When using partial supervision and the normalized cut loss function, the MD-ResUNet can achieve improved performance compared with the full supervised methods. This proves that the MD-ResUNet performs better when used with the normalized cut loss. When compared with different supervision data, the method with full annotation achieves the best performance. This is because the full annotation has the largest amount of accurate supervision information. When compared with the different loss function, the centerline supervision with partial loss and normalized cut achieve better performance than centerline supervision just using partial loss function because the centerline supervision with just partial loss and normalized cut loss take both the color information and the neighboring relationship into consideration. The centerline supervision with partial loss and normalized cut obtain closer performance results with the full mask supervision methods. This is because the supervision data is weaker than the full supervision data. The results from Tables 5 and 6 show that the MD-ResUNet supervised by centerline with the partial loss and normalized cut loss function achieve better or similar performance than the ResUNet supervised with full annotation. For the proposed methods, the centerline supervised with partial loss and normalized cut loss demonstrate better performance than the methods just with partial loss . This proves that the adding of normalized cut loss to the partial loss can improve the performance of road extraction. In general, the results imply that the weakly supervised method using MD-ResUnet achieves a better performance than the other full mask supervision method. The proposed MD-ResUNet with partial loss and normalized cut loss is just 1% lower in F 1 score than the full mask supervision method. From the results, we can conclude that the weakly supervised method using just the centerline has closer results with the full mask supervision methods. Figure 6 shows the outputs with different loss function on the dataset 1, and Figure 7 shows the results of dataset 2. It is easy to find that the proposed partial loss adapts to the different widths of roads. When the test road width is similar to the training width, the road extraction results are close to methods that use partial loss or certain width supervision. When the width of the testing road differs from the supervision road width, the road partial loss supervision will contribute to better performance.  Figures 6d and 7d show the results of the road extraction using partial loss combined with normalized cut loss function and centerline partial supervision, while Figures 6c and 7c show the results supervised by the pixel-wise annotation. It can be seen that the centerline partial supervision method gets closer results in general. However, there are still some differences in details. The centerline partial supervision method will be less accuracy when the boundary of the road is not obvious (Figure 6d). This is because the normalized cut loss would decrease when the non-road pixels was indentified as roads pixels. Figures 6e and 7e represent the results using only partial loss with centerline partial supervision. Compared with the method only supervised with partial loss shown in Figure 6e, the method with normalized cut shows greater details. It can extract the clutter of the road and obtains more accurate road width (Figure 6d). It implies that the normalized cut loss plays a significant role in the weakly supervised road extraction.
From the above analysis, we can conclude that: 1. The proposed MD-ResUNet achieves better performance than the state-of-the-art methods ResUNet and D-Linknet in VHR images road extraction, especially for the partial supervised dataset. 2. Compared with those methods supervised by full annotation, our proposed method supervised by partial centerline annotation achieves close performance. 3. The normalized cut loss promotes the road extraction performance because it can extract more details of the VHR images.

The Influence of the Parameter to the Weakly Supervised Road Extraction
To find the proper parameter λ in Equation (1) for the weakly supervised method, we evaluate the influence of the parameter using different normalized cut loss combined with the partial loss. In order to accelerate the convergence of the training process, we use the pre-trained model of partial loss supervision. The different normalized cut weight and corresponding results are shown in Tables 7 and 8 with different datasets. The trends extraction performance of different weight λ is shown in Figure 8.
The results show that the road extraction get better performance when the weight λ was set to 0.01. When the λ increases or decreases, the performance will get worse. This is because the results of the training process are a trade-off in partial loss and normalized cut loss. When the λ is too large, the road extraction will be closer to the non-supervised image clustering using normalized cut. When the λ is too small, the results are closer to the partial loss supervised road extraction method. Table 7. The performance output for different parameters λ of normalized cut combined with the partial loss function in dataset 1.

Conclusions
In this paper, a novel model called MD-ResUNet with partial loss and normalized cut loss was proposed to extract road from VHR images. It achieves close performance compared with fully supervised methods by only using linear OSM centerline data as supervision. Moreover, the proposed methods could be used with any other linear data sources.
Our proposed methods can preserve more details in road extraction, such as road width and the clutter of the road, using just the centerline supervision. This is attributed to the use of the normalized cut loss, which can describe the high order information of the VHR images. The experiments show the proposed MD-ResUNet can extract the road effectively when only supervised by the scribble OSM centerline.
For future work, we will pay more attention to road topology extraction using weakly labeled centerline supervision.