A framework for large-scale mapping of human settlement extent from Sentinel-2 images via fully convolutional neural networks

Human settlement extent (HSE) information is a valuable indicator of world-wide urbanization as well as the resulting human pressure on the natural environment. Therefore, mapping HSE is critical for various environmental issues at local, regional, and even global scales. This paper presents a deep-learning-based framework to automatically map HSE from multi-spectral Sentinel-2 data using regionally available geo-products as training labels. A straightforward, simple, yet effective fully convolutional network-based architecture, Sen2HSE, is implemented as an example for semantic segmentation within the framework. The framework is validated against both manually labelled checking points distributed evenly over the test areas, and the OpenStreetMap building layer. The HSE mapping results were extensively compared to several baseline products in order to thoroughly evaluate the effectiveness of the proposed HSE mapping framework. The HSE mapping power is consistently demonstrated over 10 representative areas across the world. We also present one regional-scale and one country-wide HSE mapping example from our framework to show the potential for upscaling. The results of this study contribute to the generalization of the applicability of CNN-based approaches for large-scale urban mapping to cases where no up-to-date and accurate ground truth is available, as well as the subsequent monitor of global urbanization.

lated to HSE have become available. One, the Global Urban Footprint (GUF), was derived using TerraSAR-X as well as TanDEM-X Synthetic Aperture Radar (SAR) images [2,3]. Another, the Global Human Settlement (GHS) built-up grid, was derived from the Landsat as well as the Sentinel-1 image collections. GHS built-up grid is a product derived within the GHSL image analytics framework, which also utilizes remote sensing images from other missions such as SPOT-5 and 6 [4,5].
Still others, the GlobeLand30 land cover map and the Global Human Built-up And Settlement Extent (HBASE), were derived from the 30m resolution Landsat data [6,7]. There are several other global land cover maps, such as finer resolution observation and monitoring of global land cover with 30 m (FROM-GLC30) and 10 m (FROM-GLC10) resolution, Global Land Cover 2000 (GLC2000) with 1 km resolution, and those derived from Moderate Resolution Imaging Spectrometer (MODIS) data with 500 m resolution, which are also produced using remote sensing image analysis [8,9,10,11]. It is difficult to compare these products directly as they each have slightly different foci. Generally, among these products, GUF outperforms the others [12], especially in rural areas where most of the products fail to detect impervious surfaces. GUF, however, is not feasible for frequent update as it was derived from the relatively expensive high resolution TerraSAR-X and TanDEM-X SAR images.
Novel approaches for urban mapping explore cloud computing services like Google Earth Engine and the large amount of remote sensing data it offers [13,14,15]. In these examples, it is expected that the globally available multi-spectral Sentinel-2 data, with a 5-day temporal resolution and 10-meter spatial resolution, are going to play a key role in more accurate HSE mapping at a large or even global scale, with the potential for frequent monitoring of global urbanization. This is already being shown by some regional-scale studies, with similar applications on urban impervious surface mapping [16] and land cover mapping [9,17].
In the past, urban mapping approaches typically started by extracting handcrafted features such as the normalized difference spectral vector (NDSV) and the gray-level co-occurrence matrix (GLCM), followed by feeding the extracted features into a traditional classifier such as Random Forests [13,18,19], and ending with postprocessing to remove potential mis-classifications. However, as a form of semantic segmentation task (or pixel level labeling), HSE mapping can theoretically be carried out through deep learning-based approaches, because plenty of neural network architectures have been proposed and shown to be powerful for semantic segmentation tasks. For example, SegNet, U-Net, the deconvolution network, as well as other improved variants based on multi-scale context fusion, attention mechanisms, and recurrent neural networks, were all proposed after fully convolutional networks (FCNs) were introduced in 2015 [20,21,22,23,22]. The fundamental advantage of all these deep neural networks is their ability for enhanced feature representation and pixel-level recognition. Examples where convolutional neural networks (CNN) and, in particular, FCNs are used for remote sensing image classification or segmentation include [24,25,26,27,28,29,30,31,32,33,34,35]. Apart from the works focusing on very high resolution satellite or aerial imagery (i.e., with a ground sampling distance equal to or even less than 1 m), data of lower spatial resolution is also being studied, since the images of lower resolutions such as globally openly available Sentinel-2 imagery remain the key candidates for large-scale mapping [36,37].
Good performance, however, is not guaranteed when directly employing these existing approaches for large-scale HSE mapping from Sentinel-2 images. There are three reasons for this, each with possible solutions. First, getting sufficient reliable pixel-wise ground truth data, a major prerequisite for deep learning-based approaches, is more challenging than labelling standard photos that are the main subject of computer vision research. Therefore, we suggest to create annotations by exploiting geo-referenced map products such as the CORINE Land Cover data [37] and the MOD500 data [38,35], as well as governmental data [30], which contains information relevant to the task one seeks to achieve. Second, remote sensing images differ significantly in appearance from the close-range images used in the standard literature on scene segmentation [39]. As mentioned before, remote sensing images are usually not with the same high resolution, and multi-spectral remote sensing images come with more bands than conventional photographs. Furthermore, they usually capture large geographical areas with different kinds of land cover, with occlusions, and with illumination changing over time and space. Taking these characteristics into account, downsampling should be avoided to fully exploit the rich information within the remote sensing data. Finally, the specific application scenarios, which in this study is large-scale or even global HSE mapping, should always be taken into account in the whole framework. This means that a spatial split of training and test data should be well designed [40], and it is not enough to train a model with high test accuracy on a single experimental test set. Instead, the framework should include further applying the trained model on images acquired over all potential regions of interest, for which reasonable accuracy should also be achieved. This requires a robust model in the face of spectral signature changes resulting from social and cultural differences and changing acquisition conditions. Therefore, an independent accuracy assessment should be carried out in order to comprehensively assess the mapping results. In this way, a reliable interpretation and understanding of the performance of the framework will be gained. This paper will present a framework that takes into account the three problems described above, by fully exploiting state-of-the-art algorithms and techniques, as well as the freely available global satellite images of the Sentinel-2 mission for largescale HSE mapping. We propose a framework for large-scale HSE mapping from Sentinel-2 imagery using deep learning-based approaches with three major parts: 1) preparation of labels and image data, 2) training a well-generalizing semantic segmentation network to learn to map HSE from Sentinel-2 images (Sen2HSE-Net), and 3) a statistically sound accuracy assessment of the HSE results. This study is intended to provide answers to the following questions: How can large-scale HSE mapping benefit from CNNs and remote sensing images of medium resolution, in a situation where potentially noisy ground truth data is only available at a regional scale? How will the network architecture and experimental setup affect the mapping results? How good are the resulting HSE maps, compared to the existing state-ofthe-art products derived at a similar scale?
The remainder of this paper proceeds as follows: Section 2 elaborates the proposed HSE mapping approach. Section 3 details descriptions about the study area and the experimental setup. Section 4 evaluates the HSE mapping accuracy and visualizes and compares the produced HSE maps to GUF, the GHS built-up grid, and other datasets from recent studies such as FROM-GLC10, for several sample test scenes. The following Section 5 provides answers to the questions raised above, based on the interpretation and analysis of the achieved results, and discusses the remaining challenges and the possible solutions for the future work. Finally, Section 6 summarizes and concludes the work.

HSE mapping with Sen2HSE-Net
Considering the spatial resolution of available reference data (20 m), the sub-pixel geolocation accuracy of Sentinel-2 data [41], as well as the resolution of existing related products (mostly lower than 20 m), the specific goal of HSE mapping in this study is to detect whether buildings, roads, or other man-made structures are presentedthat is, larger than 0% in a 20 × 20 cell. Using this definition, the resulting HSE output from Sentinel-2 imagery will be a binary layer in the Universal Transverse Mercator (UTM) coordinate system, with a ground sampling distance (GSD) of 20 m. This definition is also consistent with the 30 m Global Human Built-up and Settlement Extent (HBASE) dataset derived from Landsat, which consists of human settlement, built-up areas, and roads [7].
The procedure used in the proposed HSE mapping framework is illustrated in Fig.   1, which consists of image and reference data preparation, deep neural segmentation network training, and HSE mapping and assessment. Each step will be detailed in the following subsections.

Sentinel-2 image pre-processing and reference ground truth preparation
For each of the cities under study, one (mostly) cloud-free Sentinel-2 image is prepared with Google Earth Engine (GEE) [42], by exploring a cloud-based engineering approach. The processing approach, described in detail in [43], relies on pixel-wise cloud detection and the combination of multi-temporal images within short time periods. For each study area, we used three Sentinel-2 images compiled from all data acquired for spring, summer, and autumn 2017. The image data contains 13 spectral bands representing Top of Atmosphere Reflectance scaled by a factor of 10000.
These images are orthoimages in UTM projection. We used ten of the bands: specifically, the channels with a GSD of 10 m, B2 (blue), B3 (green), B4 (red), and B8 (Near-infrared), as well as the 20 m GSD bands, B5 (red edge 1), B6 (red edge 2), B7 (red edge 3), B8a (red edge 4), B11 (short-wavelength infrared 1), and B12 (short-wavelength infrared 2). In order to create composites with a consistent image size, we up-sampled the second group of bands to a GSD of 10 m using cubic resampling.
The employed reference data is "High Resolution Layer Imperviousness 2015," an operational product, released as part of the Copernicus Land Monitoring Service's product portfolio [44]. "High Resolution Layer Imperviousness 2015" is a raster layer indicating built-up areas with a spatial resolution of 20m, created from Copernicus high resolution remote sensing images (mainly the Indian Remote Sensing Satellite and SPOT 5). It is produced using supervised classification, NDVI-based calibration, and subsequent visual improvement. The producer and user accuracies are supposed to be about 90%. For registration of reference data and Sentinel-2 images, the reference data is re-projected to the UTM coordinate system and resampled to the extent of the corresponding images.
As an example, Fig. 2 illustrates the processed Sentinel-2 image of central Munich, Germany, and the reference data.   There are two approaches for remote sensing image classification via deep learning: working with either patch-based CNNs designed for image classification [24,26,30,31,32,49,50,51] or encoder-decoder-like neural networks designed for semantic segmentation [25,27,28,29]. The former works under the assumption of just a single label for each image patch, and applies the trained model to the image of a study area via a sliding window approach, with the target GSD as the stride of the sliding window. In contrast, the latter approach, FCNs are designed to predict pixel-level labels, and after training, they can accept inputs of arbitrary size. Their advantages are a potentially higher accuracy resulting from the inter-patch context information (only the intra-patch context is considered in patch-based CNN approaches), and less expensive computation, since overlapping patches are avoided when using the sliding window method for dense prediction.
Given both the goal of our task-to assign a label, HSE or non-HSE, to each 20 × 20 meter patch-and the advantages of pixel-level recognition, we decided to combine the patch-based CNN approach and pixel-level recognition approach. Instead of inputting a 20 × 20 meter patch into the network and outputting one label for the patch, we feed larger patches to the network and predict labels for each 2 × 2 pixels by including one pooling (downsampling) layer in the network.

Architecture and training of Sen2HSE-Net
Considering that the network should be kept as simple as possible to make it feasible for reproduction and upscaling, we implemented a simple FCN, the architecture of which is illustrated in Fig. 3. It consists of four convolutional layers in the beginning to extract low-level features from the input Sentinel-2 images, two pooling layers (maximum and average pooling) in the middle to abstract the learned features to a higher level, then four convolutional layers to extract high-level features, and one convolutional layer in the end for predictions. The kernel sizes for the two sets of four convolutional layers are 3 × 3; the last convolutional layer has a kernel size of 1 × 1. Additionally, there are two drop-out layers to avoid model overfitting to the training data, given that the goal is to map HSE globally. No additional pooling layers are used to avoid the information loss during downsampling process, which is also the design idea in [34] and [52]. As defined, the output prediction is with a 20-meter GSD, while the input data is with a 10-meter GSD; thus no upsampling layers are used.
Filter weights are initialized using the algorithm proposed by [53]. The number of output filters of the first convolutional layer, f , is set as 16 in the experiments and adjusted for investigations in Sec. 5. The input images and their corresponding reference labels are used to train the network with the Nesterov Adam optimizer implementation of Keras [54]. We used a minibatch size of 8 images and fixed learning rate of 2 × 10 −4 . To control the training time and avoid overfitting, early stopping was used, and the monitored metric is the validation loss with patience of 10 epochs, which means that the training stops if the validation loss does not decrease for 10 epochs. All the experiments were carried out using the same setups described above, in order to make for meaningful comparisons. : Architecture and details of Sen2HSE-Net. The terms "h", "w", and "f" denote height, width, and the channel number of the first feature maps, respectively. The different size of the final prediction from the input image is due to the different resolution of the HSE prediction (with a 20-meter GSD) to the input image (with a 10-meter GSD).

Study area and training data preparation
The training areas are five cities in Central Europe, as shown in Fig. 4. These cities are chosen for training because the reference ground truth data is only available in Europe. The test areas are ten cities across the world, as shown in Fig. 4.
In addition to these ten test scenes distributed across the world, three test scenes in Europe are also chosen to provide a basis for evaluating the regional-to-global generalization capability of the proposed framework. Table 1 describes the main characteristics of the selected test cities, which differ in urban area, topography, and land-cover features in the surrounding countryside. After coregistration, HSE reference data and Sentinel-2 images were cropped into patches of 128 × 128 px with a stride of 96 px. The final patches were spatially split into a training and a validation subset. The exact number of patches from each training scene is presented in Fig. 5. The number of HSE and non-HSE pixels in the training, validation, and test datasets in Europe is presented in Fig. 6.

Accuracy assessment strategy
Manually labeled ground truth is employed for a quantitative assessment. In order to avoid human-induced bias, an equally distributed grid is generated for each      to these datasets with respect to independent references.
In addition to quantitative and visual comparisons with similar products, a quantitative assessment is also performed with respect to the OpenStreenMap building layer used as the ground truth reference. Because the mapped HSE includes not only buildings but also other man-made structures, such as roads, we only employed recall as the indicator. That is recall = N 1 N 0 , where N 0 is the number of all building pixels based on OSM, and N 1 is the number of pixels (in N 0 ) also mapped as HSE.
This way, we are aiming at the detection rate of buildings in the mapping results. A good HSE map should include all buildings provided in the OSM building layer. It should be mentioned that the quality of the crowdsourced OSM reference data is not homogeneous over the cities and the suburban areas, as well as over developing and developed countries, in terms of completeness and thematic accuracy [56,57,58,59].
Therefore, in our study, the OSM-based evaluation results are only provided as an additional rough accuracy estimate of the HSE mapping results and should be primarily used for a relative comparison of the results. Additionally, buildings are also included in both the GHS built-up grid and GUF datasets, according to their definitions. Therefore, the detection power of these two layers is also presented by the above defined recall metric for comparison, in order to gain an intuitive estimation of the quality of the mapped HSE.

HSE mapping results
The results of the experimental assessment of the proposed HSE mapping framework are illustrated in this section. First, accuracy assessments with respect to different reference data are shown. We then compile the comparison between the mapped HSE and the state-of-the-art products for several cities across the world.
For better evaluation, we visualize the comparison at both the city scale and building block scale. Finally, case studies for large-scale HSE mapping are provided to demonstrate the upscaling potential of the proposed framework.

Quantitative assessment of HSE mapping results
For the ten globally distributed cities, accuracy assessments are carried out with two kinds of reference data, MLGCPs and OSM. The kappa coefficient, average accuracy (AA) of the two classes (HSE and non-HSE), commission error, recall, and F-Score of HSE are shown in Tab. 3. To provide a sense of the quality of the achieved results, we also list the corresponding assessment results for the state-ofthe-art products, GUF and GHS.  Table 3 indicates that the achieved HSE mapping results are promising, as they provide the highest kappa, AA, recall, and F-Score on average over ten test scenes, when compared to both of the baseline products. In particular, we achieve the highest F-Score (with respect to the MLGCPs) for all ten distinct test areas across the world. In addition, more buildings (from the OSM layer) are included in the mapping results, compared to both GUF and the GHS built-up grid. This can be seen from the improved mean recall, from 86.3% and 88.9% to 96.7%, compared to GUF and the GHS built-up grid, respectively. This improvement is apparent for eight of the ten cities.
The commission error from our mapping results, however, is relatively high, es-pecially when compared to GUF, which means that the HSE is overestimated in our results. On the one hand, this shows that GUF is strong at excluding non-HSE from HSE. On the other hand, it is also due to the different mapping focus (vertical artificial structures) of GUF. Still, even considering commission error, our results are generally better than the GHS built-up grid, which is closer to our mapping focus.
The GHS built-up grid provides the highest recall in three test scenes with respect to the MLGCPs and two test scenes with respect to OSM. The differences among these three results will be further analyzed in the discussion section. Considering the varying characteristics of the three layers, it should be mentioned that the comparison presented in Tab. 3 is not intended to rank their quality, but rather to provide a validation reference for our mapping results through comparisons.
The presence of fewer outliers in the representative test scenes shows the good generalization ability and the robustness of the trained model. However, the achieved results do reveal differences among different test scenes. For instance, the result in Nairobi is worse than the average for all three dataset. This is probably due to different urban structures and surrounding terrains, and is indicative of the challenges for large-scale mapping.

Qualitative assessment of HSE mapping results
The comparison of the produced HSE maps to the state-of-the-art products can be found in Fig. 9 for the three subset test areas in Munich, Nairobi, and Tehran.
Overall, the mapped HSE results are in agreement with the GHS built-up grid, GUF, and FROM-GLC10, while the HMGUL is in a relative coarse resolution. From the comparison, it can also be seen that the mapped HSE does include roads, streets, in addition to buildings, as expected. Some roads are also included in the GHS layer, FROM-GLC10, and HMGUL.
Some superiority of the mapping results can be observed from Fig. 9. For instance, the mapped HSE is able to exclude the park area within the city, as illustrated by the second Munich subset. Also, it is able to include small buildings surrounded   Fig. 10, it can be seen that the main part of a city can be detected by all three datasets, with the urban morphology being shown clearly. The test in Nairobi shows obvious disagreement among these three datasets, which is also noticeable in the other test scenes and will be further analyzed in the discussion section. Figure 10 qualitatively shows the general feasibility of the proposed HSE Figure 9: Comparison of the produced HSE results to four state-of-the-art products for three predefined subsets in three test scenes.
mapping framework and can be further confirmed by the closer view in Fig. 11.
By comparing the high resolution images in Fig. 11, we can see that in general our results are able to provide a compact boundary between HSE and non-HSE under a variety of environments in cities across the world. A detailed analysis of this visualization will be presented in the discussion section, providing more evidence of the outstanding performance of the proposed framework.

Examples of regional-scale and country-wide HSE mapping
In order to validate the stability of the proposed framework, we tested the workflow on a regional-scale and country-wide HSE mapping task, in Henan province,

Discussion
In the section, we provide some empirical evidence of the framework setup and network design, as well as addressing the problems and questions posed in Sec. 1, using insights gained from the extensive experimental results presented in Sec. 4 and some additional investigations. This section will also discuss some lessons learned that are relevant to similar topics and further possible improvements toward more accurate and operational HSE mapping.

Choice of the proposed framework
To demonstrate the rationale behind our design choice, this section provides some sensitivity analyses. The achieved results are first compared to those from several state-of-the-art baseline methods in Sec. 5.1.1. In addition, two different ways of splitting of training and validation data are compared to justify our experimental setup. Last, the effect of network depth and width are investigated for the employed architecture, to provide more insights into our approach.

Comparison with baseline methods
The achieved HSE mapping results from the proposed Sen2HSE-Net are compared to those from baseline networks in Tab. 5 for test, both beyond and within Europe. Table 5 shows that the proposed shallow network with 9 layers is able to provide even better mapping accuracy than the much deeper and relatively complicated U-Net [23], with more trainable parameters. In addition, the achieved results from Sen2HSE-Net are much more accurate than those from ResNet-PSPNet [60], ResNet-FCN-8 [20], and attention-based FCN [61], which have been shown to be more powerful for detailed semantic segmentation. One possible reason is the information loss from the pooling layers in the encoding process (by ResNet), which is not suitable for our HSE mapping task and Sentinel-2 data. Furthermore, this loss cannot be compensated, even with the sophisticated design of the decoding part, either with pyramid scene parsing by ResNet-PSPNet, or upsampling with low-level features considered by ResNet-FCN-8, or attention modules proposed in [61]. These observations confirm the assumptions that motivate our framework design: good performance is not guaranteed when simply and directly using the state-of-the-art networks for remote sensing tasks. Instead, characteristics of both the task and data need to be integrated into the network design. Additionally, Tab. 5 shows that it is possible to use a simple FCN to achieve promising HSE mapping results, instead of relying on the existing rather sophisticated networks. Even though comparable results can be achieved from directing employing U-Net, the proposed Sen2HSE-Net is much lighter, which is significant for large-scale mapping. training. From the illustration in Tab. 6, we can see that spatial split is better than random split, since almost all metrics are better from a spatial split, which is true for both networks. This may be because the distribution of training and validation data is more similar in random split than spatial split, which leads to a validation accuracy that is closer to the training accuracy. As a result, the chosen model is optimal for the training areas, rather than the unseen test areas.

Effect of network depth and width
It is important to know whether better HSE mapping results can be achieved from a deeper and wider version of Sen2HSE-Net using the setup of this study. Table 7 sheds light on this potential improvement, by comparing the results of one wider and three deeper versions, as well as the number of trainable parameters in each network. From these comparisons, we observe no gain from a wider network and a slight improvement from a deeper network. Interestingly, the improvement is not present when the depth increases further, from depth 13 to 17 and 21. This might result from the characteristics of the task, the use of Sentinel-2 images, which are not high resolution, as well as the testing choice (in unseen areas).

Analysis of the HSE mapping framework
While the quantitative and qualitative results presented in Sec. 4 have shown the promising performance of our framework, there are some details requiring analysis for a better understanding of both the method and the produced results. These details will be addressed in this subsection.

Mapping power of the proposed framework
The goal of this study is to explore a better solution for mapping HSE with the potential of upscaling. Figure 14 illustrates products. This can also be observed in the first subset in Beijing, the first subset in Nairobi, and the first subset in Santiago, as shown in Fig. 11. In the second subset of This can also be seen in the second subset of Beijing and the third subset of Nairobi, as shown in Fig. 11. The red and cyan color outlines indicate areas that are not mapped by GUF and GHSL, respectively. Since these areas are mapped not only by our results but also by one of the baseline datasets, they are very likely HSE, and correctly detected by our approach. This can be seen from the third subset in Fig. 14, as well as the first subset in Nairobi, the first subset in Rome, and the second subset in Sydney in Fig. 11.
More evidence of the mapping power of the proposed framework can be seen in Fig. 15, a close-up of Fig. 12, where we are able to detect buildings in small villages as well as GUF, which is derived from very high resolution SAR images. The other products unfortunately fail to map these areas. This also shows the improvement of

Differences between HSE mapping results and baseline products
Comparisons in Sec. 4 also reveal some notable differences among our HSE mapping results, GHS built-up grid, GUF, FROM-GLC10, and HMGUL. These differences are further visualized in Fig. 17 for four distinct areas around the world. Similar to the HSE mapped by our approach, both GHSL and HMGUL include not only buildings but also impervious surfaces such as roads and parking lots, even though they are not focused on impervious surfaces. This is because the medium-resolution data employed is not enough to exclude small gaps among buildings, especially when the gaps are covered by the same materials as buildings. It is thus challenging to distinguish these areas that are highly related to HSE and bear a similar spectral signature as buildings when using the spectral information from optical satellite images.
In contrast, GUF does not contain such impervious surfaces, as can be seen from the red regions in Figs. 10 and 11. This is because GUF focuses more on vertical building structures, removing roads and paved surfaces during the post-editing period [62]. It is also due to the peculiarities of the SAR images used for the production of GUF. The local speckle information and the texture information in the SAR images makes it possible to specifically detect vertical structures such as buildings [63,64].
Specifically, buildings are characterized by stronger back-scattering signals than airport roads, even though they are made of the same materials. However, when using optical satellite images, it is challenging to distinguish different land covers within the super-class of impervious surfaces, as they share similar spectral signatures. An illustrative example is the Sydney Airport (the red cross-shape in the lower right corner of Fig. 10), where the aircraft runways are mapped as built-up areas in both GHSL and the result of this study.
Also due to the peculiarities of the SAR images used for the production of GUF, some sparse trees can be mistaken as buildings, as shown in the the first and third subset of Nairobi in Fig. 11. For GHSL, error prone areas are forests and bodies of water, as shown in the second subset of Rome, the first subset of San Francisco, and the third subset of Sydney. As a result of the two phenomenon discussed above, sparsely built-up areas surrounded by sparse forest can be challenging, as can be seen from the "noisy" visualizations in the suburban areas in Fig. 10.
In Fig. 17, it can be seen that FROM-GLC10 and HMGUL are subject to obvious omission errors in Mumbai and Tokyo, respectively, providing one more piece of evidence for the proposed approach's improved performance over state-of-the-art layers. A further comparison between our results and GHSL shows that more roads are mapped by our approach, as shown by the purple lines in the Nairobi and Beijing test scenes in Fig. 10, demonstrating the powerful mapping capability of our framework.
These characteristics of each product discussed above relate to the differing definitions of "urban," "human settlement," and "built-up," as well as the the mapping approaches employed and the datasets used. End users of these products in particular should take note of these differences. On the other hand, understanding these differing characteristics also makes it possible to extract complementary information from different products for various applications.
It should be mentioned that all comparisons in this study are intended merely to provide an assessment with reference to the state-of-the-art products. The occasional inferior performance of the GHS built-up grid and GUF is certainly partially due to temporal gaps in data collection: the ongoing urbanization of the world has changed many originally suburban areas to newly built-up areas after the GHS built-up grid and GUF were released. This cannot be easily ignored, especially for cities in developing countries, such as Beijing. This issue highlights the necessity for up-to-date worldwide HSE information, in addition to the existing products: GUF, with its unprecedented spatial resolutions, the GHS built-up grid, with its multi-temporal resolution , and FROM-GLC10, with its detailed land cover information.

Further improvements toward operational mapping
We are able to achieve state-of-the-art HSE results for several representative scenes across the world. Furthermore, comparable accuracy is achieved for both regional mapping (three test scenes in Europe) and large-scale mapping (the ten worldwide distributed test cities), as shown in Tab. 5. However, there is still much room for further improvements toward an operational large-scale-even global-process.
The improvements can mainly be achieved with respect to three aspects: the input satellite images, the deep neural network architectures, and the post-processing of the mapped HSE results. First, Level-2A Sentinel-2 images (bottom-of-atmosphere reflectance) and the spectral ratios could bring accuracy improvement. In order to produce HSE maps at a regular frequency, it is not enough using Sentinel-2 images alone, especially in regions with heavy cloud cover throughout the year such as the Southeast Asia [65]. One solution is to employ multi-sensor, multi-temporal, and multi-modal data fusion, thus improving accuracy and enhancing temporal and spatial sampling [66,67,68,69,70,71,72]. Considering the scale and aiming applications, Landsat-8 and Sentinel-1 images could also be exploited for HSE mapping.
It should be mentioned that the proposed framework can be easily adapted for these two datasets after proper preprocessing, like filtering for SAR images and cloud removal for optical images. In addition to the input images, improvement can also be realized via an ensemble with other deep CNNs in order to take advantage of their complementary characteristics and heterogeneous properties, as demonstrated by [21]. Furthermore, the performance of the proposed framework should be further investigated and evaluated in rural areas, where built-up areas tend to be sparse and can be easily omitted. Finally, once the HSE results are acquired, further post-processing could be carried out independently for each city. For instance, a conditional random field could be applied to the output mapping results, in order to homogenize the segmentation [73]. Furthermore, in this process, any locally available datasets such as census data, as well as prior knowledge, could be exploited. Other directions worth exploring include adapting the trained model with semi-supervised learning-based strategies and transfer learning, including multitask learning, domain generalization, and domain adaptation, for the purpose of better generalization [74].

Conclusions and outlook
Detailed and up-to-date HSE maps provide essential information about the human footprint on the earth, thus making sustainable development possible via proactive conservation. This paper presents a framework for large-scale HSE mapping from Sentinel-2 images, by exploiting a shallow yet effective FCN for semantic segmentation. In particular, the newly proposed framework takes advantages of globally available images from the Sentinel-2 mission, featuring medium spatial resolution, high revisit time, and multi-spectral imaging. As demonstrated in this paper, higher accuracy than state-of-the-art products can be achieved with the proposed approach.
Our main conclusions and contributions can be summarized as follows: • We propose a deep learning-based framework for large-scale HSE mapping from medium resolution Sentinel-2 images (10 m and 20 m GSD) with a small amount of reference data (with a temporal gap) from Europe. No manually labeled data is needed in the framework. This framework is potentially applicable for images from other satellites, such as Landsat and Sentinel-1, and the specific network architecture used in this study can be replaced by other state-of-the-art architectures or improved versions.
• We propose the use of a simple FCN instead of the sophisticated ones originally proposed for high resolution images, to avoid overhead and facilitate upscaling.
The design choice of the framework is supported by comparisons with several baselines and investigations on the depth and width of the network as well as the experimental setup.
• We achieve HSE mapping results that are better than the state-of-the-art products, for several representative cities from six continents across the world. In order to carry out a fair comparison among different products and avoid human behavior-induced bias, two approaches for quantitative assessments, in addition to city-scale and building block-scale visualizations, are performed. Differences among HSE-related datasets are analyzed. HSE mapping examples at regional and country scale demonstrate the general performance of the framework.
We hope that our work encourages the explorations of the deep-learning-based approaches along with the rich array of geo-coded products for large-scale urban mapping. To this end, we will publish the trained models so that researchers can extract the HSE information of a specific region of interest via the proposed framework. Trained models and sample data are available at https://github.com/ ChunpingQiu/Human-settlement-extent-detection-from-Sentinel-2-imagesvia-fully-convolutional-neural-networks-. Our future work includes further improving the mapping results of a specific region of interest. Additionally, the newly acquired Sentinel-2 images will allow for more timely and frequent HSE mapping and the 10-and 20-meter pixel spacing of Sentinel-2 images will allow for more detailed and accurate HSE mapping than those employing multi-spectral Landsat images with 30-meter pixel spacing. The promising results also motivate us to map more detailed multi-temporal HSE information from Sentinel-2 images in future.