Enabling Country-Scale Land Cover Mapping with Meter-Resolution Satellite Imagery

High-resolution satellite images can provide abundant, detailed spatial information for land cover classification, which is particularly important for studying the complicated built environment. However, due to the complex land cover patterns, the costly training sample collections, and the severe distribution shifts of satellite imageries, few studies have applied high-resolution images to land cover mapping in detailed categories at large scale. To fill this gap, we present a large-scale land cover dataset, Five-Billion-Pixels. It contains more than 5 billion labeled pixels of 150 high-resolution Gaofen-2 (4 m) satellite images, annotated in a 24-category system covering artificial-constructed, agricultural, and natural classes. In addition, we propose a deep-learning-based unsupervised domain adaptation approach that can transfer classification models trained on labeled dataset (referred to as the source domain) to unlabeled data (referred to as the target domain) for large-scale land cover mapping. Specifically, we introduce an end-to-end Siamese network employing dynamic pseudo-label assignment and class balancing strategy to perform adaptive domain joint learning. To validate the generalizability of our dataset and the proposed approach across different sensors and different geographical regions, we carry out land cover mapping on five megacities in China and six cities in other five Asian countries severally using: PlanetScope (3 m), Gaofen-1 (8 m), and Sentinel-2 (10 m) satellite images. Over a total study area of 60,000 square kilometers, the experiments show promising results even though the input images are entirely unlabeled. The proposed approach, trained with the Five-Billion-Pixels dataset, enables high-quality and detailed land cover mapping across the whole country of China and some other Asian countries at meter-resolution.


Introduction 1.Motivation
Land cover information is crucial for various research fields involving environment science, climate monitoring, food security, urban planning, disaster management, and ecosystem protection [1].With the continuous development of technology and the economy, human activities have an increasing impact on both urban and natural environments [2,3].There is, therefore, an urgent need for timely and reliable large-scale land

Related work
In recent years, tremendous effort has been dedicated to this challenging task.In the early stage, spectral and spectral-spatial features were widely utilized to identify land cover categories based on pixel-or object-spatial units [20][21][22][23].Nevertheless, restricted, hand-crafted rules cannot fully define and represent the complicated land structures or patterns in high-resolution images [16].To address this problem, deep learning has attracted broad attention in the remote sensing community.Deep Convolutional Neural Networks (DCNNs) are able to adaptively approximate the relationship between image information and land information through multi-layer transformations [24].Thus, compared with conventional land cover classification methods, deep models can accurately characterize complex contextual information contained in high-resolution images [1,19,[25][26][27].Although deep models have reported great superiorities in many remote sensing issues [24,28,29], their performance strongly relies on the quality and quantity of training data [30][31][32], resulting in two main problems in applying them to real-world land cover mapping: -The application gap caused by limited representativity of land cover datasets: Deep learning is a datadriven approach, and its potential for practical land cover mapping depends heavily on whether the training data fully reflects the distribution of real-world ground objects.An insufficient amount of data may lead to overfitting of the model, insufficient data diversity may lead to low generalization capability of the model, and an incomplete category system will make the model unable to meet the actual mapping requirements [33,34].
-The inadequate generalizability of deep models over different data domains: Even if a practicable deep model is already trained on a well-annotated dataset, it may not be valid for other geographical areas or sensors because of the feature distribution shifts between the source and target domains [18,19,35].
To adapt this deep model to large-scale land cover mapping, an intuitive way is annotating sufficient samples for the target domain and performing model retraining.However, dense annotation for each newly captured image is not realistic.
To alleviate the first problem, a number of densely labeled land cover datasets with sub-meter to meterspatial resolutions (0.05-10 m) have been released and contributed substantially to land cover classification research.But most of them have geographical coverage areas below 10 km 2 and are located in concentrated regions, such as ISPRS Potsdam [36], ISPRS Vaihingen [36], Zurich Summer [37], RIT-18 [38], and Zeebruges [39].Existing large-scale datasets, with coverage areas more than 1000 km 2 and wide geographical distributions, are typically annotated with about 10 classes, and do not contain detailed urban functional categories, including SpaceNet [40], DeepGlobe [41], MiniFrance [42], Gaofen Image Dataset (GID) [19,43], and LandCoverNet [44].Although these large-scale datasets possess adequate data amount and data diversity, their incomplete land cover category systems prevent them from fully bridging the gap between algorithmic research and real-world applications.
To solve the second problem, unsupervised domain adaptation (UDA) has been commonly considered by recent remote sensing literature [19,27,[45][46][47][48]. UDA aims to adapt models trained on the source domain to the target domain without supervised information [18].Two major types of deep-learning-based UDA have been studied: discrepancy-based and adversarial-based methods.Discrepancy-based methods minimize the discrepancy criteria between the source and target domains to reduce the distance of their distributions [27,45].The discrepancy criteria are implemented in the form of manually designed loss functions, such as Correlation Alignment (CORAL) [49] and Maximum Mean Discrepancy (MMD) [50].In contrast, adversarial-based methods, such as the domain-adversarial neural network (DANN) [51] and Adversarial Discriminative Domain Adaptation (ADDA) [52], do not require manually designed criteria for domain matching.They instead learn criteria by simultaneously training a feature generator and a domain discriminator, which attempt to extract indistinguishable features for both domains and distinguish the features of different domains, respectively [46,47].
The essential idea of these two types of UDA methods is to align the feature distributions of the source and target domains [49,51,53].However, this idea works on a key assumption that it is possible to find an appropriate match for the two distributions, while real-world situations are often not that ideal.First, for large-scale land cover mapping, both domains may contain images from diverse imaging conditions, resulting in a widely dispersed feature space within each domain.In this context, a rigid alignment of two dispersive domains may further accumulate intra-domain variance [54,55].Secondly, class imbalance is prevalent in the actual land cover patterns.The most common category may cover an area hundreds of times larger than some other categories.During a global alignment, a few common categories may, thus, pull the entire domain toward their optimal distribution, causing negative adaptation to other categories.
To improve the performance of UDA, recent works [17,56,57] have combined distribution alignment with pseudo-labeling.The main idea of pseudo-labeling is to select valuable samples from the target domain for model training according to the predicted classification confidence.However, the quality of pseudolabels depends on the way they are selected.The current approaches are empirically choosing a confidence threshold [17,58] or setting a fixed proportion for sample collection [19,[55][56][57], which is hard to guarantee the accuracy of pseudo-labels assigned by the prediction model to its unknown domain.Furthermore, the model always tends to select more easy samples, which may exacerbate the category imbalance.These issues make it difficult for existing UDA methods to satisfy the demands of large-scale land cover mapping applications.

Contribution of this paper
To address the above-mentioned problems, in this paper, we propose the Five-Billion-Pixels dataset, which extends the land cover dataset GID [19,43].Instead of 5/15 basic categories of GID, it contains more than 5 billion labeled pixels of 150 high-resolution Gaofen-2 (GF-2) satellite images annotated in a more complete category system, consisting of 24 land use and land cover classes.Then, we propose a UDA approach for practical large-scale land cover mapping.Instead of the domain alignment strategy, our approach lets the deep model already defined on the source domain gradually and adaptively learn the distribution of the target domain.Concretely, we introduce a Siamese network [59] with two branches that separately generate feature maps for images from the source and target domain.The branches share the same fully convolutional architecture and the same parameters pre-trained on Five-Billion-Pixels.In the target domain branch, information entropy of the feature maps is treated as the indicator to select image pixels with high confidence, and category predictions on the selected pixels are considered as pseudo-labels.These pseudolabels are then used to construct a joint classification loss with the source domain branch.To trade off the adaptation over two domains, the number of pixels assigned with pseudo-labels is dynamically changed with training iterations.To prevent over-adaptation to some common categories, the joint classification loss is weighted according to the class distribution in the source domain.
The main contributions of this paper are as follows: -We present a large-scale land cover classification dataset, Five-Billion-Pixels.It has the spatial resolution of 4 m, covers areas over 50,000 km 2 in China, and contains more than 5 billion labeled pixels.Its category system covers artificial-constructed, agricultural, and natural classes, which well-reflects the distribution of real-world ground objects and can widely benefit land-cover-related studies.
-We propose a deep-learning-based UDA approach for large-scale land cover mapping.It avoids changing the domain distributions in a rigid way but softly corrects the domain shifts according to the knowledge learned in the source domain.The negative adaptation caused by intra-domain diversity and class imbalance can, therefore, be mitigated by our approach even under very complicated practical situations.
-We carry out land cover mapping on five megacities in China and six cities in other five Asian countries severally using unlabeled PlanetScope (3 m), Gaofen-1 (8 m), and Sentinel-2 (10 m) satellite images.
Encouraging experimental results are achieved over a total area of 60,000 km 2 , demonstrating the potential of the proposed dataset and approach for high-quality, detailed land cover mapping across the whole country of China and some other Asian countries at meter-resolution.

Study data
To reduce the gap between high-resolution land cover datasets and real-world application requirements, we reorganize and augment the category system of the land cover dataset GID.GID is available in versions with 5/15 classes; interested readers can refer to [19] and [43].Our new dataset, named Five-Billion-Pixels, consists of 150 GF-2 satellite images annotated in a more complete category system (see Fig. 1).It has the advantages of rich categories, large coverage, wide distribution, and high-spatial resolution of 4 m.
For the case study of large-scale land cover mapping, we perform land cover classification for five megacities in China and six cities in other five Asian countries using three data sources with diverse spatial resolutions.Concretely, for Chinese megacities, PlanetScope (PS) satellite images are used for Chengdu and Shanghai, Gaofen-1 (GF-1) satellite images are used for Wuhan, and Sentinel-2 (ST-2) satellite images are used for Beijing and Guangzhou.And for other Asian cities: Bangkok, Thailand; Delhi, India; Naypyidaw, Myanmar; Seoul, South Korea; Tokyo, Japan; and Yangon, Myanmar, ST-2 satellite images are used.The Chinese megacities cover a total geographical area of 53,088 km 2 and are separately located in the eastern, western, northern, southern, and central regions of China.And the other six Asian cities are located in South, Southeast, and East Asia, respectively.
The Five-Billion-Pixels dataset is introduced in Section 2.1, and the study areas with their data sources are introduced in Section 2.2.

Creation of Five-Billion-Pixels
The creation of Five-Billion-Pixels fully relied on human manual annotation.To maximize label consistency and minimize human error, the annotation process contains four phases: coarse labeling, fine labeling, fine checking, and spot checking.First, the category system of Five-Billion-Pixels is determined with reference to Chinese Land Use Classification Criteria (GB/T 21010-2017 ), and the classes are adjusted based on the recognizability of 4 mresolution optical remote sensing images.During the coarse labeling process, the interpretation experts roughly delineate regions belonging to different classes on each GF-2 image according to the category system.For uncertain areas, Google Earth and Google Map with corresponding geographic coordinates are considered as references.These rough annotations are then passed to the labeling crew for fine labeling.The labeling crew uses the lasso tool in Adobe Photoshop software to frame the ground objects so that the edge of the label map and the edge of the ground objects can be strictly coincident; some details of annotations are presented in Fig. 2. Fine checking consists of two rounds, check of categories and check of edges.The interpretation experts carefully collate each area of each label map and mark the inaccurate categories or edges, which are then passed to the labeling crew for correction.The final spot checking is to slice GF-2 images and corresponding label maps into 500 × 500-pixel patch pairs, at which scale it is easier to find errors, and present the patch pairs randomly to the interpretation experts for inspection.The inspection results are then given to the labeling crew for modification, and the interpretation experts conduct the next round of spot checking on the revised results.In the final round of spot checking, 10% samples of Five-Billion-Pixels are examined and no obvious errors are observed.

Properties of Five-Billion-Pixels
Rich Categories: The category system of Five-Billion-Pixels concretely includes: industrial area, urban residential, rural residential, stadium, square, road, overpass, railway station, airport, paddy field, irrigated field, dry cropland, garden land, arbor forest, shrub forest, park, natural meadow, artificial meadow, river, lake, pond, fish pond, snow, bare land.Miscellaneous or unclear areas that are extremely difficult to annotate are considered as unlabeled.This category system covers artificial-constructed, agricultural, and natural classes, more closely resembling the distributions of ground objects in the real world.Notably, the category system contains a number of land use classes subdivided from land cover classes in GB/T 21010-2017, including: stadium and square from public service land; road, overpass, railway station, and airport from transportation land; park and artificial meadow from artificial non-agricultural vegetated areas.This is designed to make full use of the spatial information of high-resolution images and to enrich the application scenarios of urban environmental analysis.As Five-Billion-Pixels are mainly collected from human activity areas (cities, villages, cultivated lands, and mountainous areas around cities), the category system covers all land categories except mangroves, tundra and permanent ice.The percentage of pixels belonging to each category among all labeled pixels is listed in Table 1.
Table 1: The percentage of the number of pixels belonging to each category.The abbreviations for categories are defined as: Indu -industrial area, Urba -urban residential, Rura -rural residential, Stad -stadium, Squa -square, Over -overpass, Rail -railway station, Airp -airport, Padd -paddy field, Irri -irrigated field, Dryc -dry cropland, Gard -garden land, Arbo -arbor forest, Shru -shrub forest, Natu -natural meadow, Artiartificial meadow, Rive -river, Fish -fish pond, Bare -bare land.The category proportion is not deliberately controlled, but it is labeled according to the distribution of real-world ground objects.As can be seen, the category distribution of Five-Billion-Pixels is quite imbalanced.Large Coverage: The 150 GF-2 satellite images contained in Five-Billion-Pixels have a total geographical coverage of over 50,000 km 2 .On this basis, more than 5 billion pixels are carefully annotated, which can provide abundant samples for advancing research in data-driven methodologies.
Wide Distribution: The image source of Five-Billion-Pixels is collected from more than 60 dispersed administrative districts in China, as Fig. 1 shows.Due to the wide geographical distribution, Five-Billion-Pixels can reflect the variation of landscapes with different climate, altitude, and geology.We select five Chinese megacities with diverse geographical environments, development degrees, and city structures as our study areas: Beijing, Chengdu, Guangzhou, Shanghai, and Wuhan.

Study areas and data sources
Real-time, large-scale land cover mapping may require the joint use of images captured by multiple sensors; hence, the adaptation of the classification approach to diverse sensors is of great significance.Based on this consideration, we construct our study areas with imageries from three different sensors, as shown in Fig. 3.
Specifically, the study data of Beijing are mosaicked from 9 ST-2 images acquired between November 8, 2020 There is great heterogeneity in images acquired by different sensors.And, due to the impact of cloud obscuration, swath width, and revisit period, it is necessary to utilize images captured in different seasons and lighting conditions to mosaic the complete image map for each city.As a result, not only are there significant differences between data sources, but there are also distribution shifts between images within each city, as displayed in Fig. 4.

Chengdu (PS) Shanghai (PS)
Wuhan (GF-1) Guangzhou (ST-2) Beijing (ST-2) Figure 4: Each column indicates images from the same city.There are obvious spectral shifts even between images within the same city due to differences in imaging illumination and season.In addition, spatial resolutions of different data sources are distinctly diverse.These heterogeneities pose huge challenges to land cover mapping.

Additional Asian cities
To verify the applicability of our approach to different regions in the world, we select six cities in five Asian countries as study areas: Bangkok, Thailand; Delhi, India; Naypyidaw, Myanmar; Seoul, South Korea; Tokyo, Japan; and Yangon, Myanmar, as shown in Fig. 5.

Bangkok, Thailand
Delhi, India Naypyidaw, Myanmar Seoul, South Korea Tokyo, Japan Yangon, Myanmar As can be seen, although we treat "city" as study subject, our experimental areas actually contain other types of landscapes besides built-up.For Chinese megacities, their administrative regions cover large agricultural land and forests, and for the additional Asian cities, we classify the entire images, i.e. including the surrounding areas outside the administrative regions.Therefore, these study areas can test the performance of classification approaches for urban, countryside, agricultural, and mountainous scenes.
Gaofen-1: GF-1 is the first satellite of HDEOS proposed by China.It is configured with two PMS, providing a spatial resolution of 2 m pan/8 m MS and a combined swath of over 60 km.The MS images used in our study cover the spectral range of blue (0.45-0.52 µm), green (0.52-0.59 µm), red (0.63-0.69 µm), and near-infrared (0.77-0.89 µm).
Sentinel-2: ST-2 is an Earth observation mission from the European Union's Copernicus Programme.It currently comprises a constellation with two satellites, Sentinel-2A and Sentinel-2B, offering 13 spectral bands and a field of view of 290 km.Blue (central wavelength 0.49 µm), green (central wavelength 0.56 µm), red (central wavelength 0.66 µm), and near-infrared (central wavelength 0.83 µm) bands with a resolution of 10 m are used in our study.Because of the free, open data policy and the advantages in spatial and spectral resolution, ST-2 is one of the most commonly used data sources for recent land cover mapping studies [60][61][62].

Test areas
Because the study areas are overly large, it is impossible to densely annotate each test image for quantitative evaluation.We therefore adopt two annotation strategies, sparse labeling and dense labeling.Concretely, sparse labeling is to evenly annotate small polygons on each image of each city, and dense labeling is to densely annotate sub-regions for each city.For dense labeling strategy, each Chinese megacity is labeled with two sub-regions of 1000 × 1000 pixels, and each additional Asian city is labeled with a sub-region of 500 × 500 pixels since their administrative districts are smaller, as illustrated in Fig. 6.In total, the five Chinese megacities are sparsely labeled with 5.21 × 10 7 pixels and densely labeled with 7.77 × 10 6 pixels, and the percentage of each category in test areas is displayed in Table 2.
The additional Asian cities are sparsely labeled with a total of 2.40 × 10 6 pixels and densely labeled with a total of 1.29 × 10 6 pixels, and the percentage of each category in test areas is listed in Table 3.  Sparse label covers all categories and can be used to evaluate the performance of classification on the entire images.And dense label can be utilized to verify the fineness of the classification results in local areas.Note that these annotations are only used for accuracy assessment and not for model training.

Methodology
To adapt the knowledge learned from the labeled dataset to land cover mapping of large-scale areas, we propose a UDA approach that can softly correct the domain shifts by adaptively learning the distribution of unlabeled data.We refer to two domains, the source domain as D S and the target domain as D T , representing the Five-Billion-Pixels dataset and the unannotated images used for land cover mapping, respectively.
First, we utilize D S to pre-train a semantic segmentation model, which is presented in Section 3.1.Subsequently, we take the pre-trained semantic segmentation model as the backbone to construct a Siamese network, of which the two identical branches separately process images for D S and D T .In the D T branch, a fraction of image pixels with high confidence is selected and then used to perform domain joint learning with the D S branch, which is described in Section 3.2.

Semantic segmentation model for land cover classification
There are two general strategies for deep-learning-based land cover classification: object-oriented approaches based on DCNNs and semantic segmentation approaches based on end-to-end DCNNs.The former ones use DCNNs to process images in the form of fixed-sized patches, and then distinguish deep features with shallow classifiers or directly employ the category predictions of DCNNs, to perform classification based on object-spatial units [1,19,25,26].In contrast, semantic segmentation models can predict dense classification maps for arbitrary-sized images in an end-to-end, pixels-to-pixels manner [63][64][65].
The latest semantic segmentation models typically consist of two main paths: the encoder path that gradually reduces feature map size and captures higher-level information, and the decoder path that gradually recovers spatial resolution as well as clear object boundaries [66].In our work, we adopt U-Net [67] as the backbone for land cover classification and domain adaptation.It is worth noting that U-Net specially achieves concatenation of each corresponding encoding and decoding stages using skip connection of feature maps, as illustrated in Fig. 7, so that more raw information of the input image is retained and added to the decoding  path.This design helps to compensate for the information loss in the encoding path, which is significant for the classification of satellite images that rely heavily on spectral information.
To enable U-Net to process MS remote sensing images, we adjust the channel number of its input to 4, i.e., we change the kernel size of its first convolutional layer from 3 × 3 × 3 to 3 × 3 × 4. In addition, we adjust the channel number of its output feature map according to our category system, i.e., we set the kernel number of its last convolutional layer to 24.

Domain joint learning for unsupervised domain adaptation
To adapt DCNNs to a new domain, there is no better way than having examples of its feature distribution [18].Faced with D T without annotation information, we are inspired by pseudo-labeling [19,68] and propose a UDA approach that collects reliable pixel-wise examples from D T for model adaptation.Compared to discrepancy-based and adversarial-based UDA methods, which force the two distributions to be aligned in feature space, pseudo-labeling is more flexible and potentially more dependable for complicated real-world situations.
To prevent DCNNs from biasing toward incorrect pseudo-labels or categories of easy samples, our approach introduces a Siamese network (Section 3.2.1) to collect pseudo-labels of which the number is dynamically increased with training iterations (Section 3.2.2).These pseudo-labels are used to train jointly with the true labels from D S , and the joint classification loss is weighted according to the category distribution of D S (Section 3.2.3).

Siamese network
To avoid introducing incorrect category information in the training, only a very small number of pseudolabels are used in the initial iterations of domain joint learning, which leads to two problems: (1) D T can only provide very few training samples at the beginning; (2) the samples selected from D T may be extremely homogeneous.Therefore, to ensure that the parameters of DCNNs are effectively updated at each training iteration, we pre-train U-Net on D S and use it as the backbone to construct a Siamese network.Siamese network has two branches, each of which have an input and an output [59].The two branches have an identical architecture and share the same parameters during both initialization and training, which allows the Siamese network to learn information from two distributions simultaneously, as presented in Fig. 7.
Formally, given D S ⊂ R H×W ×4 along with associated labels L S ⊂ [1, K] H×W , and unlabeled D T ⊂ R H×W ×4 , where H × W indicates the size of images and label maps, K is the total number of classes.The two branches of the Siamese network separately take images x S ∈ D S and x T ∈ D T and predict K-dimensional feature maps F x S ∈ R H×W ×K and F x T ∈ R H×W ×K .

Dynamic pseudo-label assignment
Unlike the existing pseudo-labeling methods that empirically choose a threshold [17,58] or set a fixed proportion for sample collection [19,55,56], we assign pseudo-labels to a dynamic number of samples at different training epochs, as demonstrated in Fig. 8. Shannon Entropy [17] is employed as our indicator to quantify the confidence of each image pixel.Lower information entropy represents higher classification confidence.In the D T branch, entropy map E x T ∈ R H×W is calculated as where is the value of E x T at pixel (h, w).We arrange all the pixels in entropy map E x T in ascending order and select the first N pixels, where where n e denotes that the current training is the n e th epoch, N e is the total number of training epochs, and λ is used to control the overall size of pseudo-labels.At the end of training, the proportion of selected pixels in all pixels of x T is λ.
An intuitive interpretation of this design is that the network may give reliable predictions for only a small number of pixels when it is not adapted to D T , and as the network gradually learns the distribution of D T , it can make reliable predictions for an increasing number of pixels.
For the selected pixel located at (h, w), the softmax function is used to obtain its category probability vector: where ∈ R K is the feature vector of pixel located at (h, w), and P (h,w) x T ∈ R K , of which the kth element represents the probability that this pixel belongs to class k.

Class-balanced domain joint training
Class balancing is a common strategy for the training of semantic segmentation models [69], but it is rarely used in UDA approaches because the category information in the target domain is unknown.Since we assign pseudo-labels to D T , it is possible to reduce the distribution bias caused by unbalanced categories through this strategy.
For D S , we count the ratio of the number of pixels in each category to the number of all labeled pixels.Supposing that the ratio of the class k is µ k , its weight is Then, the loss function of the D T branch is calculated as where F CE (•) is the Cross Entropy loss function, l n and P n x T denote the pseudo-label and the category probability vector of the nth pixel selected from x T , respectively.
If there are errors in pseudo-labels, a small number of mistakes may eventually lead to a relatively large bias during the iterative training.When gradually learning the distribution of D T , to maintain the discrimination of the network for the true labels, we adopt joint learning of both the D S branch and the D T branch.The overall loss function of the Siamese network is where Loss D S is calculated by all pixels of x S and is also applied with class-balanced weighting.
When the training of the Siamese network is completed, forward propagation is performed on only one of the branches during the inference phase.

Experiments
Our experiments comprise two parts: (1) to explore the performance of different land cover classification approaches, we provide a benchmark on Five-Billion-Pixels of three types of representative algorithms, including object-oriented classification based on spectral-spatial features, object-oriented classification based on deep learning, and semantic segmentation based on deep learning; (2) to validate the effectiveness of the proposed UDA approach, we perform practical land cover mapping on 11 cities using images from three different sensors.The implementation details, comparison approaches, and evaluation metrics are introduced in Section 4.1.Section 4.2 presents the benchmark on Five-Billion-Pixels. Section 4.3 presents the results of land cover mapping.

Experimental setup
4.1.1Setup for benchmark on Five-Billion-Pixels Data Processing: Since object-oriented approaches and semantic segmentation approaches have different requirements for training data, i.e., object-oriented approaches only allow each input sample to have one label, while semantic segmentation approaches require a label for each pixel of input sample, we prepare different training data for them.The Five-Billion-Pixels dataset is randomly divided into a training set of 120 images and a test set of 30 images.For the two types of object-oriented approaches, we train the models using image patches with multiple scales [19].The patch scales are set to the optimal values.Specifically, patches of sizes 64 × 64 and 128 × 128 pixels are randomly sampled from images of the training set.If more than 80% pixels in a patch are covered by the same category, this patch is considered as a training sample.The ratio of the number of 64 × 64-pixel patches to the number of 128 × 128-pixel patches is 3 : 1.In particular, for road, the size of all patches is 32 × 32 pixels.To balance the categories, we control the proportion of patches belonging to each class.A total of 130,000 multi-scale patches are randomly selected for model training.For semantic segmentation approaches, the original large images are cropped into image tiles with a size of 512 × 512 pixels for model training.To improve the training efficiency, we only use tiles that are more than 50% annotated and contain two or more categories.A total of 40,000 tiles are randomly selected from 120 training images.
Baseline Methods: For object-oriented classification based on spectral-spatial features, we employ multi-feature fusion strategy to aggregate spectral feature and gray-level co-occurrence matrix (GLCM) [70] by normalization and vector concatenation.Multi-layer perceptron (MLP) and random forest (RF) are utilized as classifiers.Selective search [71] is adopted for object-spatial unit segmentation.The parameters of these methods are set to the optimal values.The window size of GLCM is 7 × 7 pixels.MLP has 4 hidden layers with 20 nodes per layer.The number of trees for RF is 500.The initial segmentation size is 400 pixels for selective search.Classifiers are trained with image patches and are used to classify test images in units of objects.
For object-oriented classification based on deep learning, we employ two representative DCNNs: GoogLeNet [72] and ResNet-101 [73].Both models are trained with the same hyper-parameters.The epoch number is 120, the batch size is 256, the momentum value is 0.9, and the weight decay is 10 −4 .The initial learning rate is 0.1 and is divided by 10 after every 30 epochs.In training, image patches are uniformly resized to 224 × 224 pixels before being input to the models, and 20% patches are used for model validation.Image augmentation strategies are adopted.In the testing phase, selective search is used for object segmentation with an initial segmentation size of 400 pixels.The test images are classified in units of 64 × 64-pixel patches and then the patch-level classification map and the object-level segmentation map are combined via voting strategy [19].
For semantic segmentation based on deep learning, we utilize U-Net [67] and DeepLabv3+ [66] as baseline models.The backbone chosen for DeepLabv3+ is ResNet-101 pre-trained on Five-Billion-Pixels.And the out stride of DeepLabv3+ is set to 16.The weights of U-Net are randomly initialized.Both models are trained under the same conditions.The epoch number is 120, the batch size is 32, the momentum value is 0.9, the weight decay is 10 −5 .The initial learning rate is 0.05, and the poly learning policy [74] is used to adjust the learning rate during epochs.In the loss function, unlabeled regions are ignored, and class-balanced weighting is implemented according to Table 1.During training, 20% tiles are randomly selected for model validation.Image augmentation strategies are adopted.In the testing phase, models directly segment the test images in units of 512 × 512-pixel tiles, and the overlap-tile strategy [67] is used to prevent context missing in the border region of tiles, where the overlap ratio is set to 50%.
Evaluation Metrics: We assess the experimental results with overall accuracy (OA), mean F1-score (mF1), mean intersection over union (mIOU), and user's accuracy (UA).mF1 is the category mean of F1score.mIOU is the category mean of the intersection over union (IOU), and IOU is obtained by dividing the intersection of prediction and truth by their union [69].mF1 and mIOU describes the ability of the model to minimize both overestimation and underestimation for each category.UA indicates the performance of the model in reducing overestimation [75].

Setup for land cover mapping
Data Processing: To adapt deep models to satellite images with different resolutions, we construct a multi-scale source domain using Five-Billion-Pixels.Image tiles with different sizes are randomly cropped from GF-2 according to the spatial resolution of multiple data sources, including 512 × 512, 1024 × 1024 (for GF-1), and 1280 × 1280 (for ST-2) pixels, and are then uniformly resized to 512 × 512 pixels.Since the 3 m resolution of PS is obtained by resampling the raw data and its effective spatial resolution is 3.7-4.1 m, we use the original image resolution of GF-2 to adapt it.The total number of tiles in the source domain is 12,800, and the ratio of the three sizes is 2 : 1 : 1.
We prepare a data domain for each target city.The raw satellite images are cropped into non-overlapping tiles with a size of 512 × 512 pixels.In particular, PS images are resized to 3/4 of their original image resolution before cropping, equivalent to restoring their spatial resolution to 4 m.The target domains of Beijing, Chengdu, Guangzhou, Shanghai, and Wuhan consist of 4126 ST-2 image tiles, 4144 PS image tiles, 1398 ST-2 image tiles, 3117 PS image tiles, and 1764 GF-1 image tiles, respectively.And the target domains of Bangkok, Delhi, Naypyidaw, Seoul, Tokyo, Japan, and Myanmar separately contains 441 ST-2 image tiles.
Comparison Methods: We compare our approach with the recent leading and representative UDA methods: AdaptSeg [54], AdvEnt [76], CLAN [77], and FADA [78], where AdaptSeg and CLAN are adversarialbased domain alignment methods, while AdvEnt and FADA combine adversarial domain alignment and pseudo-label learning.Specifically, AdaptSeg incorporates adversarial learning at different feature levels of the segmentation model; CLAN aligns each class with an adaptive adversarial loss to enforce local semantic consistency; AdvEnt minimizes the prediction entropy of the target domain using adversarial loss and pseudo-label loss; FADA implements fine-grained class-level feature alignment based on the class information of pseudo-labels.U-Net is used as the generator for these comparison methods.We also test the combination of these methods and our dynamic pseudo-label assignment approach by adding their loss functions.In addition, the baseline method is U-Net trained only with the source domain.
U-Net is initialized using the network parameters trained on Five-Billion-Pixels (see Section 4.1.1).For our approach, the batch size is 16 for both the source and target branches (total 32).For the comparison approaches, the batch size is 32.This is because our approach inputs the source and target data simultaneously, while the comparison methods alternately inputs the source and target data.The initial learning rate for our approach is 0.001.And for the comparison approaches, the initial learning rate are 0.001 and 0.0001 for the generator and discriminator, respectively.For all methods, the epoch number is 100, the momentum is 0.9, the weight decay is 10 −5 , and the poly learning policy is used to adjust the learning rate during epochs.Image augmentation strategies are adopted.And class-balanced weighting is implemented according to Table 1.For our approach, λ (see Section 3.2.2) is empirically set to 0.5.
To prevent different target domains from interfering with each other, we separately train a model for each city.Since the tile number of the source domain is much larger than those of the target domains, at each epoch in training, tiles of equal number to the target domain are randomly selected from the source domain.This results in a different sub-source domain at each epoch, allowing the model to select pixels with diversity from the target domain.

Benchmark on Five-Billion-Pixels
The baseline results for Five-Billion-Pixels are listed in Table 4.It can be seen that deep-learning-based methods bring huge performance margins compared with methods based on spectral-spatial features and shallow classifiers.This shows that conventional methods lack discriminative ability for high-resolution images with complicated spatial information.Table 4: Benchmark on Five-Billion-Pixels.The abbreviations for categories are defined as: Indu -industrial area, Urba -urban residential, Rura -rural residential, Stad -stadium, Squa -square, Over -overpass, Rail -railway station, Airp -airport, Padd -paddy field, Irri -irrigated field, Dryc -dry cropland, Gard -garden land, Arbo -arbor forest, Shru -shrub forest, Natu -natural meadow, Arti -artificial meadow, Rive -river, Fish -fish pond, Bare -bare land.Accuracy results are expressed as percentage values (%).Within deep-learning-based methods, semantic segmentation models (U-Net and DeepLabv3+) significantly outperform object-oriented methods (ResNet101 and GoogLeNet).This is because semantic segmentation models can capture contextual information over larger areas and simultaneously maintain more accurate edges for ground objects by assigning labels to each pixel.The performance advantages of deep learning, especially of semantic segmentation models, demonstrate the importance of large-scale, pixel-wise annotated datasets for advancing land cover classification research.

Method
ResNet101 behaves better than GoogLeNet in overall and has significantly superior results on railway station, airport, shrub forest, natural meadow, and artificial meadow.Owing to the residual connection structure [73] that enables the combination of different levels of features, ResNet101 can learn low-level features to distinguish natural classes as well as high-level features to identify artificial buildings with complex structures.
DeepLabv3+ achieves the best results for urban functional areas, including stadium, square, overpass, railway station, park, and artificial meadow, which contain complicated spatial structures.Due to the atrous convolution [79] and spatial pyramid pooling [66] adopted in DeepLabv3+, it can capture multi-level contextual information for these categories.U-Net behaves best on industrial area, urban residential, rural residential, different agriculture, and different forest classes.The recognition of these categories relies heavily on textural and spectral information.U-Net has stronger discriminative ability for them because it retains more raw image information through the concatenation structure [67].
Another issue worth noting is that regardless of the method type, there are high performance discrepancies of different categories.For instance, all methods behave poorly on square, overpass, railway station, airport, garden land, park, and pond.This is due to two factors, first, these categories represent small percentages in the Five-Billion-Pixels dataset, and second, they are inherently easier to confuse with other categories.These classes cover a much smaller area in the cities compared to residential and agricultural categories, and the models will be biased towards common and simple classes in training.In addition, the distinctive characteristic of these categories is that they are composed of multiple basic ground cover types; for example, railway station contains multiple tracks and stadium-like building roof, park includes grass and woods, and airport contains roads and lawns, which causes them to be easily misclassified into other categories.This is why even if we control the proportion of different categories to be consistent in the object-oriented approachs (see Section 4.1.1),their accuracy still cannot be improved.To demonstrate the results more intuitively, a set of land cover classification maps is displayed in Fig. 9. MLP+Fusion can identify some water areas, RF+Fusion can identify some water, built-up, and paddy field areas, but the rest of the map is heavily confused.GoogLeNet and ResNet101 fail in extracting road and misclassify paddy field into fish pond.In contrast, DeepLabv3+ and U-Net can segment clear road networks and different built-up areas.DeepLabv3+ performs better on river than U-Net.And U-Net can recognize lake and irrigated field more accurately.

Experimental results of Chinese megacities
The land cover maps of five Chinese megacities are demonstrated in Fig. 10.Although images from different sensors are utilized, and there is no annotated information on them, our approach is able to distinguish industrial area, urban residential in city center, rural residential dispersed in suburb, transportation networks, and river systems for each city.Fish pond along the coast of Guangzhou and paddy field in the suburbs of Wuhan are correctly identified.Obvious errors occur in the forested mountains and at the mosaic borders.Small areas of mountains around Beijing are misclassified as water bodies, and some areas of mountains around Chengdu are misclassified as irrigated field.This is caused by the severe spectral shifts due to different image sources and different imaging conditions.Unlike artificial-constructed categories, the classification of natural classes relies more on spectral information.When the spectral shifts are particularly significant, pseudo-labels inevitably contain errors, which will continuously accumulate in iterative domain adaptation learning.
Table 5 displays the quantitative evaluation results based on different test strategies.It can be seen that the results on dense label are generally better than that on sparse label in OA, while sparse label outperforms dense label in mF1 and mIOU.This is because mF1 and mIOU are more sensitive to overestimation and underestimation, in other words, the edges of the ground objects.And sparse label marks only portions of the ground objects, while dense label strictly outlines the edges of the ground objects, which leads to poorer mF1 and mIOU results on dense label.Whereas OA is the accuracy of the entire test area, and sparse label which is distributed evenly over the entire images contains more areas that are difficult to identify, e.g.complex urban functional areas, it therefore has lower OA values.More classification details are displayed in Fig. 11.
Our approach achieves promising performance on different built-up, traffic, and agricultural classes, as well as river, and bare land.Table 6 shows the quantitative evaluation of different UDA approaches, where all accuracy results are averaged across the five megacities.It can be seen that D S constructed by multi-scale GF-2 image tiles brings a significant improvement compared to U-Net pre-trained with single-scale tiles.This indicates that it is feasible to adapt models to images with different spatial resolutions by using multi-scale source domain  data.There is a decline in performance of AdaptSeg, AdvEnt, and CLAN compared to the baseline (D Sonly).This is due to the negative adaptation caused by the complex feature distributions of both the source and target domains.In addition, the size of D S is much larger than that of D T , and each training epoch uses a different sub-source domain (see Section 4.1.2),bringing more confusion into UDA approaches based on domain distribution alignment.FADA performs better than other comparison methods because FADA aligns intermediate level features rather than deep pixel-level features, which avoids rigid global matching.In addition, FADA achieves fine-grained class-level feature alignment according to the category information of pseudo-labels, thus alleviating the negative adaptation caused by category imbalance.The combination of these methods and pseudo-label assignment improves the accuracy, but is inferior to our method in all evaluation results, indicating that pseudo-labeling is not enough to compensate for the negative adaptation caused by domain distribution alignment.
Table 6: Comparison with the recent leading UDA approaches on Chinese megacities."Pre-trained" means U-Net pre-trained with single-scale GF-2 image tiles, and PS images keep the original scale for this strategy."D S -only" is the baseline, meaning only using the multi-scale D S to train the Siamese network."DPA" indicates dynamic pseudo-label assignment.Accuracy results are averaged over the five megacities and expressed as percentage values (%).Fig. 12 illustrates details of the results obtained by different UDA methods.Here we demonstrate the performance of combination of AdaptSeg, AdvEnt, CLAN, and FADA with pseudo-labeling, which behaves better than these methods alone.In Beijing and Chengdu, the comparison methods lost fine road and river in built-up areas.And in Guangzhou, the comparison methods misclassify fish pond as river, pond, and irrigated field.This is because these misclassified categories are much less common, and the adversarial-based methods match domain distributions in a global manner, causing the aligned feature space to tend to prioritize the categories with larger sample amounts.In Shanghai and Wuhan, the results of bare land, road, gerden land, and paddy field are significantly improved by our approach compared with the baseline.This indicates that our approach can learn information of the target domain while maintaining the ability to identify the distribution of both domains.In Chengdu, our approach misclassifies park into pond and gerden land, which are the ground objects contained by the park.It is probably because pseudo-labels in pixels are difficult to capture contextual information, and the adapted model is more biased towards the categories in local regions.In addition, it can be seen that for different cities, the improvements of our approach compared to the baseline are different, which is caused by the differences in feature distributions and category distributions.When the features in the target and source domains are quite distinct, the pseudo-labels are more "valuable" in domain joint training and can improve the transferability of the model to the target domain more significantly.And when the category distribution of the test area is very unbalanced, pseudo-labels of the hard samples can also lead to greater performance improvements.

Experimental results of additional Asian cities
Fig. 13 shows the results of land cover mapping for the additional six Asian cities.In particular, "rural residential" here refers to low-rise residential buildings, and "urban residential" refers to high-rise residential buildings.Although the target and source domains of this set of experiments are located in different countries with different geographical environment and urban landscapes, promising results are achieved.Paddy field and garden land located in the suburbs of Bangkok and Yangon are correctly identified, rural residential areas spread around Delhi are accurately extracted, and the dense built-up areas of Seoul and Tokyo are well distinguished.Obvious errors occur in water bodies, where lake and river are heavily confused.Part of arbor forest in Naypyidaw is misclassified as irrigated field.In addition, the extraction results of road are much less refined than those in Chinese cities.The quantitative evaluation of land cover mapping are presented in Table 7, where all accuracy results are averaged across the six cities.Compared to the baseline, our approach obviously boosts the performance, which indicates that our approach can mine reliable pseudo-labels even for very different geographical regions.The performance gap between the comparison methods and our approach is greater in these six cities than in Chinese megacities.This is because in this scenario, the feature distributions of the source and target domains are more disparate, it is harder for the adversarial-based approaches to find a suitable match between the two distributions, and more intra-domain variance is introduced into the model by the rigid alignment.Another phenomenon is that the results on sparse label and dense label are generally comparable.There are two reasons for this phenomenon, on the one hand, the spatial resolution of ST-2 is lower than those of PS and GF-1, and it cannot present very fine edges in both ground truth and results.On the other hand, agricultural and natural categories that are easier to classify occupy a larger total area in these six cities, and there are fewer complex urban functional classes located in the test areas.Therefore, neither sparse label nor dense label can pull apart the gap between overall accuracy and edge accuracy.
More details can be seen in Fig. 14, our approach improves the results of paddy field in Bangkok, railway station in Delhi, irrigated field in Naypyidaw, arbor forest in Seoul, industrial area in Tokyo, and airport in Yangon.However, even though the overall accuracy is encouraging, road lines in this set of results are discontinuous.This is due to the feature distribution differences caused by the diverse architectural styles and city landscapes in different countries.The adversarial-based UDA methods demonstrate severe negative adaptation; they tend to classify complex areas into common categories when the distributions of the source and target domains are very different.For instance, in Delhi, Naypyidaw, and Seoul, the comparison methods classify built-up areas into irrigated field, and in Tokyo, high-rise residential and industrial area are misclassified as low-rise residential.We test three different epoch numbers to study its influence, which are 50, 100, and 150, and the value of λ is fixed to 0.5.The relationship between mF1 and the epoch number is presented in Fig. 15 (a).It can be seen that there is an obvious performance improvement when the epoch number is raised from 50 to 100.However, from 100 to 150, the change in performance flattens out.There are two reasons for this phenomenon: On the one hand, a larger epoch number means fewer pixels are selected at the beginning, which better ensures the quality of the pseudo-labels.On the other hand, a larger number of training iterations enables the model to fully adapt to the target domain.Since 150 epochs would take more computation resources, and the precision improvements are not that significant, 100 epochs are more appropriate in practice.
To investigate how λ affects our approach, we test a set of continuously varying values for it, and the epoch number is set to 100.mF1 obtained by each λ value is shown in Fig. 15 (b); it can be seen that the accuracy of each city first rises as λ increases and then falls as it becomes larger.The highest accuracy of Wuhan is reached when λ value is 0.7, while in other cities it is 0.5.This may be because the sensors of GF-1 and GF-2 have more similar imaging processes, resulting in closer distributions of the source and target domains, and the quality of the pseudo-label can be guaranteed to some extent even if 70% pixels are selected at the end of training, whereas using more pseudo-labels in other cities would introduce more errors.And when λ is set to 0.9, 90% pixels in the target domain are involved in the last training epoch, too many errors in pseudo-labels, thus, lead to an obvious decrease in the accuracy of each city.

What information is important for land cover classification?
The built environment and the quality of people's lives are under the combined effects of various land categories [8], so it is significant to analyze the land cover information in a more complete category system.However, the recognition of heterogeneous ground objects in high-resolution satellite images is quite difficult.In addition, it depends on different information to identify different categories.For example, classifying independent urban functional buildings relies more on structure and shape features, distinguishing dense built-up areas requires spatial relationships, and the identification of agricultural and natural categories cannot be done without texture and spectral information.
There is a noteworthy phenomenon in Table 4 and Fig. 9.The highest accuracy values for different categories are achieved by different models, mainly DeepLabv3+ and U-Net.We show detailed classification results for Five-Billion-Pixels obtained by these two models in Fig. 16.In columns 1 to 3, DeepLabv3+ can segment more complete and smooth airport, bare land, and stadium, respectively.And the results of U-Net contain a lot of noise.In visual, DeepLabv3+ tends to identify ground objects as independent "instances", whereas it seems difficult for U-Net to aggregate land information into homogeneous segmented regions.This is because the deep features learned by DeepLabv3+ are better at describing contextual and spatial relationships.However, U-Net shows superior performance in columns 4 to 5, where it correctly classifies irrigated field, which are partially misclassified by DeepLabv3+ as garden land and road, which cannot be clearly extracted by DeepLabv3+ from dense built-up areas.This shows that U-Net can more accurately represent the spectra, textures, and sharp boundaries of ground objects.
One explanation for these results is that the "low-level features" (those closest to the input image) used in DeepLabv3+'s decoder path are the feature maps that have been forward-propagated through 101 layers and are 16 times smaller compared to the input image.In contrast, U-Net uses "low-level features" that go through two layers and have the same scale with the input image since it adopts the concatenation of the encoder and decoder paths.Therefore, U-Net can maintain more raw spectral and edge information to generate dense classification maps.
The above analysis gives us some inspiration, in future research, the combination of "low-level" spectral, textural information and "high-level" spatial contextual information is likely to facilitate land cover classification in complex category systems.

How can the performance of land cover mapping be further improved?
As can be seen in Fig. 10, while our approach performs well in most areas of each Chinese megacity, the results for mountainous forest surrounding the cities, especially Beijing and Chengdu, are not that satisfactory.To discuss this phenomenon, we visualize the feature spaces of different data domains with t-SNE [80] in Fig. 17, where the combination of spectral features and texture features (GLCM) is employed, and the coordinate systems of the feature spaces are aligned.Three characteristics of feature distributions can be observed from Fig. 17: (1) distribution shifts occur between different domains; (2) within each domain, the distribution of almost every category is dispersive; (3) within each domain, the distribution of different categories may be partially mixed.The last case is evident in forest and cropland categories of D S .This is because their features are largely influenced by geographical location and seasonal changes.For example, the Five-Billion-Pixels dataset covers a large amount of irrigated field reclaimed in the mountains of northwestern China, and their spectral and texture features may be similar to those of mountainous arbor forest in winter.Then, in the process of domain joint learning, the DCNNs model is likely to assign irrigated field pseudo-labels to a small number of arbor forest samples in D T .And these errors will further accumulate in continuous iterative training, eventually leading to misclassification in the land cover mapping results.
Since this problem is caused by the inherent constraints of UDA and the intrinsic properties of vegetation categories, in future studies, the integration of other sources of information may lead to improvement.For instance, the all-season sample dataset [81] provides vegetation samples from multiple seasons at 30 m resolution, and multi-temporal analysis methods [82,83] can better distinguish the coverage of different  rapid urban development in China, new high-rise residential buildings are mixed with old ones, which have the same appearance as rural settlements, creating a special urban landscape.The richer categories of our results have the potential to contribute to studies on urban planning, urban heat islands, urban quality of life, and so on.
Although we studied only 11 cities in this paper, our approach can be easily generalized to other cities, towns, and villages throughout China and even other Asian countries.In addition, the discrete pseudo-labels used in our UDA approach are validated to be capable of improving the classification results.This suggests that the annotation of newly acquired images do not need to be dense and pixel-wise when a large-scale, well-annotated dataset is already available.Therefore, for other countries and regions with very different land distributions and land category systems than China, it has potential to perform land cover mapping based on Five-Billion-Pixels and, for example, semi-supervised domain adaptation with sparse annotations in the form of patches or polygons.This is an issue of interest to us in the future.

Conclusion
The increasing volume of high-resolution satellite data is a "gold mine" waiting to be explored and mined.Yet land cover mapping on a large-scale in high-resolution remains a challenging task.In this paper, we present a large-scale land cover dataset, Five-Billion-Pixels, which can provide the remote sensing community with a high-quality benchmark to advance land cover classification algorithms.At the same time, we propose an unsupervised domain adaptation approach that can deal with complicated real-world distribution shifts.The land cover mapping results for five megacities in China and six cities in other Asian countries show the generalizability of our approach across different sensors and geographical regions.In general, our work has the potential to be extended to land cover mapping at country-scale and to contribute to various applications involving land cover information.

Figure 1 :
Figure 1: Left: Distribution of 150 images of Five-Billion-Pixels. Right: Examples of GF-2 images and their corresponding label maps, where black indicates unlabeled areas.The category system covers artificialconstructed, agricultural, and natural classes.

2. 1 . 1
Gaofen-2 imagery GF-2 is the second satellite of the High-Definition Earth Observation System (HDEOS) promoted by China National Space Administration (CNSA).It is equipped with two panchromatic and multispectral (PMS) sensors, providing a combined swath of 45 km.The effective spatial resolution of the sensors is 1 m panchromatic (pan)/4 m multispectral (MS).The MS images we used to construct Five-Billion-Pixels possess a spectral range of blue (0.45-0.52 µm), green (0.52-0.59 µm), red (0.63-0.69 µm), and near-infrared (0.77-0.89 µm), with an image resolution of 6800 × 7200 pixels.Owing to the combination of high-resolution and wide swath, GF-2 allows the observation of detailed land information over large geographical areas.

Figure 2 :
Figure 2: Examples of annotation details.Miscellaneous or unclear areas that are extremely difficult to annotate are considered as unlabeled.The categories of labeled pixels are double-checked and ensured to be correct.
, and October 21, 2021.The data of Chengdu are mosaicked from 205 PS images captured between January 13, 2019, and December 31, 2019.The data of Guangzhou are mosaicked from 3 ST-2 images collected between February 18, 2021, and October 26, 2021.The data of Shanghai are mosaicked from 149 PS images obtained between April 1, 2019, and December 13, 2019.And the data of Wuhan are mosaicked from 22 GF-1 images taken between March 28, 2016, and July 25, 2016.

Figure 5 :
Figure 5: Administrative areas of six Asian cities on ST-2 images.

Figure 6 :
Figure 6: Two annotation strategies for quantitative evaluation.Sparse label: small polygons are evenly labeled throughout the entire image.Dense label: sub-regions with sizes of 1000 × 1000 and 500 × 500 pixels are labeled for Chinese megacities and the additional Asian cities, respectively.

Figure 7 :
Figure 7: Approach overview.We construct a Siamese network with two branches separately generating feature maps for images from D S and D T .In the target domain branch, image pixels with high-confidence are assigned pseudo-labels.These pseudo-labels are then used to construct a joint classification loss with the source domain branch.U-Net is used as the backbone of the Siamese network.

Figure 8 :
Figure 8: (a) Pseudo-label assignment.(b) Dynamic labeling and class balancing.The number of pixels assigned pseudo-labels is dynamically changed with training iterations, and the joint classification loss is weighted according to the class distribution in D S .

Figure 11 :
Figure 11: Details of mapping results of different Chinese megacities.The first column shows partial regions of the input satellite images; the second column shows the corresponding classification results.And in the third column, the classification results are overlaid on the input images, which demonstrates the performance more visually.

Figure 12 :
Figure 12: Details of performance of different UDA approaches on Chinese megacities."Baseline" means only using D S to train the Siamese network."DPA" indicates dynamic pseudo-label assignment.

Figure 13 :
Figure 13: Land cover mapping results for the additional six Asian cities.

Figure 14 :
Figure 14: Details of results of different UDA approaches on the additional Asian cities. "Baseline" means only using D S to train the Siamese network."DPA" indicates dynamic pseudo-label assignment.

Figure 15 :
Figure 15: Sensitivity analysis for epoch number and parameter λ on Chinese megacities.

Figure 16 :
Figure 16: Detailed classification results of DeepLabv3+ and U-Net on Five-Billion-Pixels. From top to bottom of each column: input satellite image, ground truth, classification result of DeepLabv3+, and classification result of U-Net.

Figure 18 :
Figure 18: Comparison with Google's Dynamic World and ESA's World Cover in Beijing (ST-2) and Guangzhou (ST-2).(a-b) Land cover map of central area of Beijing from Dynamic World and World Cover.(c) Our result of Beijing.(d-e) Land cover map of central area of Guangzhou from Dynamic World and World Cover.(f) Our result of Guangzhou.

Table 2 :
The percentage of the number of pixels belonging to each category in Chinese megacity test areas.Sparse label contains 5.21 × 10 7 pixels and dense label covers 7.77 × 10 6 pixels.

Table 3 :
The percentage of each category in the additional Asian city test areas.Sparse label includes 2.40 × 10 6 pixels and dense label covers 1.29 × 10 6 pixels.

Table 5 :
Quantitative evaluation of land cover mapping for five Chinese megacities based on sparse label and dense label.Accuracy results are expressed as percentage values (%).

Table 7 :
Comparison with the recent leading UDA approaches on the additional six Asian cities. "D S -only" is the baseline, meaning only using the multi-scale D S to train the Siamese network."DPA" indicates dynamic pseudo-label assignment.Accuracy results are averaged over the six cities and expressed as percentage values (%).