Generating annual high resolution land cover products for 28 metropolises in China based on a deep super-resolution mapping network using Landsat imagery

ABSTRACT High resolution of global land cover dynamic is indicative for understanding the influence of anthropogenic activity on environmental change. However, most of the land cover products are based on Landsat image that only has 30 m resolution, which is insufficient to distinguish the heterogenous urban structure; while very high spatial resolution image usually has low temporal resolution, which is difficult to monitor the urban dynamic. Deep-learning-driven super-resolution mapping is a prevailing way of achieving very-high-resolution land cover dynamic products in aspect of alleviating the mixed pixel problem of Landsat image. However, two limitations are obvious: 1) the fixed grid of kernel during the upsampling process favors spatial homogeneity and suppresses the learning of spatial heterogeneity of urban composition and 2) geometric or radiation variation over large spatial and long temporal extent in remote sensing images makes the super-resolution mapping approach difficult to transfer for application. Here, we attempt to solve these two limitations: 1) a progressive edge-guided super-resolution architecture is designed to allow nonuniformed kernel specific at the low-confidence edge region and intensify the learning of heterogenous compositions’ patterns and 2) an alternating optimization strategy is designed to minimize the resultant entropy and modulate the classification hyperplane to accommodate to the manifold of the discrepant region. Validation experiments are investigated based on a fine-grained and large-extent super-resolution (FLAS) dataset constructed in this study, and it is found that our approach remarkably enhances rich detailed patterns of heterogenous region and outperforms other state-of-the-art algorithms. Besides, we applied DETNet to the large spatial extent of 28 metropolises in China (>40,000 km2) and the large temporal extent of continuous 21-year (2000–2020) in Wuhan city to examine transferability. From the land cover areas variation, we find that the expansion rate of cropland is faster than the urban expansion over the past 10 years, which are gradually becoming the principal source for the encroachment of forest and lakes. From detailed urban dynamic reflected by the 21-year products, we find that urban-villages between the old city zone and the outer high-tech development zone are gradually disappeared. The captured dynamic is consistence with the urban-village renovation policy during this period, which is meant to redistribute the spatial configuration of the city for a more sustainable urban structure. We believe that the proposed method can facilitate a seamless and fine-grained observation system that can fill the weakness of the existing land cover activities and provide a brand-new insight into the urban dynamic and its underlying mechanism.


Introduction
Land cover information retrieval through remote sensing imagery is a promising way for understanding the socioeconomic properties and interactions between human activities and environmental change (Liu et al. 2019(Liu et al. , 2020. Benefit to its long time-series and global coverage, remote sensing imagery can provide an unprecedented insight into the progress of sustainable development (Burke et al. 2021), afforestation (Wernick et al. 2021), and urbanization (Zhang, Chen, and Lu 2015;Deng and Zhu 2020;Aljaddani, Song, and Zhu 2022;Zhu et al. 2022) and help municipality make informed decisions on urban green space management , coastal ecosystem monitoring , water/wetland resource management (Pekel et al. 2016;Hu et al. 2021), and environment protection (Aguirre-Gutiérrez et al. 2020).
Land cover information retrieval usually undertaken by land cover classification and has been exhausted by pioneer, achieving many land cover products, such as the 1-km Global Land Cover Characterization (GLCC) product in 1992 (Loveland et al. 2000), the 1 -km Global Land Cover (GLC2000) product in 2000 (Bartholome and Belward 2005), the 1 -km Global Land Cover dataset provided by National Mapping Organizations (GLCNMO) product in 2003 (Tateishi et al. 2011), the 300 -m Global Land Cover product (GlobCover) in 2005, and 2009(Bicheron et al. 2011, the 500-m annual MODIS Global Land Cover type (MCD12Q1) product (Friedl et al. 2002), the 30 -m Finer Resolution Observation and Monitoring-Global Land Cover (FROMGLC30) product in 2010 and 2015 (Gong et al. 2013), the 30 -m GlobeLand30 product in 2000, 2010, and 2020 (Chen, Ban, and Li 2014), the 30 -m National Land Cover Database (NLCD) product at USA every 5 years since 2006 (Fry et al. 2011), the 30 -m GLC_FCS30 product in 2020 , and the 10 -m Esri Land Cover product in 2020 produced by European Space Agency (ESA) (Karra et al. 2021).
Although the spatial resolution is constantly increased, the existing land cover products are still not enough to delineating the spatial heterogeneity of the highly dynamic urban compositions (Deng and Zhu 2020). For example, the diameter of the urban tree crown is usually less than 10 m, the width of the non-primary road is usually shorter than 30 m. Even the length of the small residential buildings is larger than 30 m, they may also be merged with the surroundings, leading to the destruction of their morphological structure. In this situation, we find that one Landsat pixel (spatial resolution of 30 m) may capture a mixture of the reflectance signal of multiple land cover types, resulting in the mixed pixel problem (Fisher 1997). Conventional pixel-level hard classification approach adopted by existing products usually simplified each pixel grid as one single land cover class (Adams et al. 1995), which ignored the spatial structure of the mixed land covers within 30 × 30 m ground area, leading to a large amount of latent land cover information loss and fine-grained structure loss (Wang et al. 2020).
In the past decades, the mixed pixel problem of Landsat image in urban scene has been thoroughly studied using spectral unmixing (SU) technique, which establishes physical mixture model to simulate the reflected signal mixing process, and try to decompose it to several pure signals (called endmember) and inversing their proportions (called abundance) using least square optimization. The output of SU is a soft classification map (called abundance map), i.e. each pixel is proportional belonging to each endmember class, instead of hard classification that assigning one mixed pixel to a single class. SU has been widely used to explore the proportion of impervious surface or forest inside each mixed pixel (Xiao and Moody 2005;Deng et al. 2017). However, the abundance maps are still retained at 30 -m resolution, which only provide fraction indications without any spatial distribution of each land cover class inside a mixed pixel.
In order to achieve high-resolution (HR) land cover dynamic products, in this study, we attempt to use the super-resolution mapping (SRM, also known as subpixel mapping) approach (Atkinson 1997) for lowresolution (LR) Landsat image. Specifically, SRM split each mixed pixel (30 � 30 m ground area) into subpixel locations (circa 4 � 4 m ground area in our study) and assign the land cover classes on each subpixel location to restore a sub-pixel-scale land cover distribution and realize spatial resolution enhancement (Atkinson 1997). The assignment principal follows the proportion indication derived from SU (Keshava and Mustard 2002;Bioucas-Dias et al. 2012), and the location indication derived from the prior knowledge of the spatial distribution (Wang et al. 2021b). SRM has been widely applied in practical application that requires meticulous measures. For example, Ardila et al. (2011) applied SRM approach on 2.4m QuickBird MS image at Bothoven, Enschede, to reconstruct a 0.6 -m urban tree cover maps with individual plants distinguished; Muad and Foody (2012) used SRM approach to super-resolution the 240m remote sensing image at Quebec province, Canada to 30 m for mapping the lake area; Poudyal (2013) used SRM approach to super-resolve the 2 -m WorldView-2 image to 0.5 -m resolution for mapping the individual plants in the potato farm field and monitoring their growing status; Ling et al. (2019a) applied SRM on 463m MODIS image at Bay of Bengal to reconstruct a 30m resolution map to measure the narrow river wetted width; Zhang et al. (2019) applied SRM approach on 250 -m MODIS images at forest areas in the USA, Paraguay and Russia, to reconstruct time-series 25m resolution forest maps, and discovered tiny changes that was caused by the forest fire and deforestation; He et al. (2022) developed a novel SRM approach on 10m Sentinel-2 images to derive time-series 2 -m urban tree covers that are usually omitted in traditional land cover/forest cover product, etc.
Generally, SRM relies on spatial prior knowledge to regularize the ill-posed problem of reconstructing LR to HR image (Atkinson 2009), which provides the indications of how each class located within a mixed pixel. We summarized four types of spatial prior: 1) Spatial dependence prior assumes that spatial proximate pixels more likely belong to the same class than the distant one, which includes spatial attraction model (Mertens et al. 2006), pixel swapping model (Atkinson 2005), genetic algorithm SRM (He et al. 2016a), Hopfield neural network SRM (Su 2019), spectral-spatial fusion-based SRM (Xu et al. 2018), and random field-based SRM (Kasetkasem, Arora, and Varshney 2005); 2) Geostatistical prior assumes that the sub-pixel distribution is subject to empirical geostatistical models that can be regressed with given samples (Boucher and Kyriakidis 2006), which includes multi-point simulation (Ge 2013), area-topoint kriging (Wang et al. 2015), and spatial distribution pattern-based SRM (Ge et al. 2016); 3) Regularization prior assumed that HR image follows the specific manifold like Laplacian model, total variation model, non-local prior or sparse prior, which includes maximum a posteriori SRM (Zhong et al. 2015), and sparse representation SRM ; 4) Spatiotemporal prior is aimed to provide complementary information from external dataset for constraint, such as time-series temporal images , shifted images (Zhong et al. 2014), panchromatic images (Nguyen, Atkinson, and Lewis 2011), and Point of Interest (POI) (Chen et al. 2018b). However, the spatial priors aforementioned are usually model-driven or expertly handcrafted, which is too inflexible for reconstructing the irregular heterogenous distribution (Ling et al. 2019b). Recently, deep-learning-based super-resolution mapping network (DLSMNet) is developed, which is aimed to learn a more realistic sub-pixel-scale spatial priors from exemplar pairs of LR image and HR annotations based on the end-to-end mapping and representation learning ability of the network (Yu et al. 2017). Many prevailing network architectures were taken as backbone to formulate advanced variants of DLSMNet, such as super-resolution reconstruction network (Ling et al. 2019a;Ling and Foody 2019b;Ma et al. 2019;He et al. 2021a;Shang et al. 2020), semantic segmentation network (Arun, Buddhiraju, and Porwal 2018;Jia et al. 2019), graph convolution network , and spatiotemporal fusion network (Chen, Shi, and Ge 2021b). The efficient network modules such as attention mechanism (He et al. 2021b and multi-scale pyramid  were also fully explored, and the interpretable network modeling was also studied by embedding physical model of SU and SRM into the network (He et al. 2021c;Yin et al. 2021).
Although the DLSMNet is a prevailing trend in SRM community, it leaves several open questions to be further studied: 1) The network used stacked convolutional layers to enlarge the receptive field for learning high-level feature, which is equivalent to downsampling and sacrifices the low-level spatial details (Huan et al. 2022). Traditional DLSMNet usually adopted the fixed grid sampling of 3 � 3 (or other size) kernel in the deconvolution or sub-pixel convolution layer to restore the downsampled feature map; however, fixed grid sampling favors spatial homogeneity and makes the spatial details merged with the neighboring dominant class and being ruined (Marin et al. 2019); 2) Landsat images in different regions may have geometrics or radiance variations due to the different acquisition conditions (Tuia, Persello, and Bruzzone 2016), which result in a feature distribution gap between training data (source domain) and test data (target domain) (Kouw and Loog 2018). The network only trained on the training set may confront overfitting problem and suffer a performance drop on the large-extent Landsat images. Although a large amount of training data can strengthen the generalization performance (Kolesnikov et al. 2020), the meter-level high resolution annotation data are insufficient and not adequate in quality . Therefore, generalizing the network to the diverse large extent images remains a major challenge.
In this study, we developed a novel deep edgeguided and transferable super-resolution mapping network (DETNet) to alleviate the weakness of the fixed grid sampling in reconstructing spatial heterogeneity and improve the generalization ability of the network model, achieving HR land cover products over largeextent and over time. The main contributions are as follows: 1) we designed an architecture in DETNet to adaptively search the error-prone pixel for locating the potential edge region using 1 � 1 multiple layer perceptual (MLP), which enables a nonuniformed sampling of convolution that adhered to the edge to intensify the learning of the edge region; 2) we designed a strategy to alternatingly minimize the target domain's entropy and the source domain's loss function during model fine-tuning, to gradually project the classification hyperplane to the manifold of the target domain, making the DETNet model valid on Landsat image in large-extent and different time; 3) in order to train a generalized DETNet, a fine-grained and largeextent sub-pixel-scale annotation (FLAS) dataset is constructed by surveying and mapping experts, with pairs of 30 -m Landsat image and very-HR annotation, which have a total coverage of more than 6362 km 2 . After that, we used the trained DETNet to reconstruct 30m Landsat image to meter-level HR land cover maps with rich geographical-realistic details, over large extent of the 28 metropolises (>40,000 km 2 ) in China, and over long time series of continuous 21 years  in Wuhan city, to examine its feasibility in understanding how cities change in space and over time.

Training dataset and pre-processing
Up to now, meter-level well-annotated land cover dataset is significantly deficient. Although several HR annotated datasets exist in the community (Gerke et al. 2014;Maggiori et al. 2017), the coverage of which are limited within 10 km 2 and the geographic patterns are similar, which is insufficient for training a generalized model suitable for large extent mapping. Although the Gaofen-2 Image Dataset (GID) annotation has both high spatial resolution and the large spatial extent, i.e. 4 -m resolution and 50,000 km 2 , the annotation is not sufficiently finegrained, as shown in Figure 1. To this end, this research firstly constructed a Fine-grained and Largeextent SRM HR annotation (FLAS) dataset.
Specifically, 211 annotation locations are randomly sampled in Guangdong province, China, and the Google Earth images with size of nominal 4500 × 6700 pixels at 1 -m resolution acquired in these sampling locations in summer 2019 are used ( Figure 1). The sampling region covers more than 6362 km 2 , which enables a diversity of spatial distribution patterns to ensure the generalization of the trained model. The pixel-wise annotation work was carefully done on the HR Google Earth images with the help of experts major in survey and mapping. Since they usually take ground survey within Guangdong province and are familiar with the in-situ ground truth, they have prior knowledge of which parcel on the image should be annotated as forest, shrub, grass, etc. As for annotation implementation, we manually delineate the boundary vector of each land cover parcel on the HR Google Earth images with the aid of ArcGIS. For the land cover parcel that is hard to determine, the NDVI, NDWI, NDBI, and DEM are also taken as the assistant data for a reliable determination. After that, the delineations are double checked by another group without participated in the previous work, to further screen and correct the poor-quality annotation. Finally, the boundary vectors are converted to raster Figure 1. The sampling location of the FLAS annotation dataset. One scene of the FLAS annotation is enlarged for visual display, and the corresponding GID annotation at the same field of view is also displayed for comparison. image to formulate the FLAS annotation label. Considering the balance between super-resolution scale and reliability, and in order to stay in step with the GID annotation dataset, the FLAS annotation dataset is resampled to 4 -m resolution to release the reconstruction difficulty and ensure the reliability. The seven discriminant categories are chosen based on the Chinese Land Use Classification Criteria (GB/ T210102017), i.e. cropland, forest, grass, shrub, water, impervious surface area (abbreviated as impervious in the following), and bare land. In addition, this annotated dataset was examined by third party that is not involved in annotation work, for a further refinement. The number of the annotated pixels and the ratio for each category is provided in Figure 2.
The training of the SRM network requires pairs of LR image and HR annotation dataset, thus the Landsat-8 Operational Land Imager (OLI) image with 30 -m resolution is selected. Specifically, Landsat-8 OLI top-of-atmosphere (TOA) images deposited in Google Earth Engine (GEE) (https://developers.goo gle.com/earth-engine/datasets/catalog/LANDSAT_ LC08_C02_ T1_TOA#description) is used in our study to avoid atmospheric effect and ensure consistent observation in large area, as was done by Xiong et al. (2017). In order to eliminate the cloud contamination and the influence of phenology factor, the Landsat images within the growing period of April-September 2019 are composited by median-mosaic operation (https://developers.google.com/earthengine/apidocs/ee-imagecollection-median), and totally seven bands are used. The Google Earth images and the corresponding images are projected to WGS-84 coordinate and were rigorously registered using ENVI, with registration error less than a Google Earth image pixel.
We split the 211 FLAS annotated scene images into training set, validation set, and test set based on the proportions of 70%, 10%, and 20%, i.e. 147, 22, and 42 scene images, respectively, which are denoted as red, blue, and green block in Figure 1 and non-overlapped with each other. The patches of size 512 × 512 on the FLAS annotations and the corresponding Landsat patches of size 64 × 64 on the Landsat images were cropped to form pairs of HR and LR samples, which can be fed to the network for training a SRM task model. Totally, the 1536 patches (512 × 512 pixels) are cropped from the 147 scenes images (circa 1125 � 1675 pixels), 233 patches are cropped from the 22 scene images, and 485 patches are cropped from the 42 scene images. The specific cropping procedure can be referred to the supplementary file. The training set is used to train the network model, the validation set is used to evaluate the model at each training epoch to select the optimal network model, and the test set is withheld until the final test, which provide the final accuracy of the network.

Study area
In order to examine the large-extent generalization ability, 28 provincial capital metropolises in China are selected as the study areas, which are located at the temperate monsoon climate and the subtropical monsoon climate zone, ranging from 19 to 46° N, 102 to 130° E, as shown in Figure 3. The residual 6 provincial capital cities are located at the continental climate and mountain climate zone, which have large variation in land cover types and statistic distribution characteristic.
In consideration of the importance of these provincial capital cities, a regular observation and land cover mapping is necessitated for an in-time monitoring of the land cover changes and analyzing their internal driving force and influence to the eco-systems. It helps understand the interactions between human activities and environmental change, which is important for realizing the goal of sustainable development.

Sub-pixel mapping (SRM)
SRM is a widely used technique for mitigating the mixed pixel problem by reconstructing the geographical locations of each land covers at sub-pixel-scale inside the mixed pixel. Classical SRM usually relies on the abundance images (soft classification, i.e. one pixel proportionally belongs to multiple class) derived from SU technique (Bioucas-Dias et al. 2012) and further specifies the sub-pixel-scale locations of each land cover. Given a soft classification of 3 � 3 mixed pixels with three classes (Figure 4), SRM grid divides the mixed pixels into several sub-pixel sites by interpolation according to the scale factor S (in this case S=5, i.e. one mixed pixel is split into 25 sub-pixels) and allocates the land cover labels to each sub-pixel site, in accordance with the proportion indication. This realizes a hard classification (i.e. one sub-pixel belongs to only one class) at sub-pixel-scale with spatial resolution improved by S. Compared to the traditional pixel-scale hard classification assigning one pixel to single label (Figure 4), more morphological structure and detailed edges can be reflected by SRM result.
h; w; c represents the height, width, and band of the image). y i represents the ith mixed pixel covering 30 � 30 m ground footprint, and its reflectance signal is assumed to be a combination of the signals of multiple pure land cover types. This mixed reflectance signal can be represented by the sum of the endmember's spectral signal M 2 R c�NC weighted by their abundance α 2 R NC�N based on the SU technique (NC represent the number of land cover types), given as: where α i;k represent the proportion of the kth class in the ith mixed pixel, which should follow the abundance nonnegative constraint (ANC) and abundance sum-to-one constraint (ASC) to restrain the proportion value to conform to the physical meaning, i.e. proportion value should be no less than 0 and total proportion values in one mixed pixel should equal to 1. However, SU only provides abundance information, without any indication of where each endmember spatially located in the mixed pixel.
SRM is then adopted to specify the spatial location of each endmember at sub-pixel scale based on its spatial dependence with the surrounding pixels (e.g. eight-neighborhood) and finally extend the LR abundance image α 2 R NC�N to the HR classification map X 2 R 1�ðN�S 2 Þ . SRM is formulated based on the observation model to bridge the LR and HR image (Zhong et al. 2015). Considering that the metric criterion of X � 1; 2; 3 . . . ; NC f g and the α � 0; 1 ½ � is not compatible, one-hot operation is used to uniformed two by transforming the X to the probability map Z 2 R NC�ðN�S 2 Þ . In this way, the observation model of SRM problem can be given as: where D 2 R ðN�S 2 Þ�N is the downsampling matrix, which represent the degradation process from Z to α. Substitute the Eq.
(2) into Eq. (1) can formulate the super-resolution task of super-resolution LR Landsat image to HR classification map: Considering that the traditional model-driven or expertly handcrafted priors is difficult to estimate D due to the irregular spatial patterns, Ling et al. (2019b) argued to learn geographical-realistic priors by datadriven deep learning method and developed deeplearning-based super-resolution mapping network (DLSMNet). Given training pairs of {Y, X}, the M and D can be implicitly estimated as network parameter θ, by minimizing the distance L between the network predicted result b X and the ground truth X, give as:

Deep edge-guided and transferable Sub-pixel Mapping Network (DETNet)
Considering the limitations of the current DLSMNet in spatial detail reconstruction and transferability, this research developed DETNet model. Generally, DETNet consists of three main blocks according to its purpose ( Figure 5), i.e. feature representation block is for extracting discriminative feature for the mixed land covers, upsampling block is for mapping the features from LR to HR feature space with the proposed PESR architecture, and loss function block is to measure the difference between the estimated HR classification and the HR ground truth. At the training stage, the gradient error from loss function is back-propagated to update the network parameter, and after sufficiently trained, the model is applied to the unseen data at the test stage, in which the proposed AOE strategy is used to enhance the generalization ability.

Feature representation with ResNet101
Firstly, given coarse input Landsat patch Y 2 R h�w�c , to set up sub-pixel locations, by bilinear interpolation layer. After that, one of the most commonly used CNN architecture, Residual Network with 101 layers (ResNet101) is adopted as the feature representation block (He et al. 2016b), to extract discriminative features of the mixed land cover for further modeling. The specific structure of ResNet101 is provided in Table 1.
The convolutional transformation of ResNet101 on Y 0 is given as: where W and b are weights and biases, which are the network parameters to be solved.
The output feature maps of ResNet101 (Layer 4, denoted as F l4 2 R H=16�W=16�1024 ) is taken as the highlevel semantic features, which is fed to Atrous Spatial Pyramid Pooling (ASPP) layer and SoftMax classifier layer to derive the preliminary class probability map F l4p 2 R H=8�W=8�NC , as was done by Chen et al. (2018a).

Progressively Edge-guided Super-resolution Reconstruction (PESR)
In order to reconstruct the geographical-realistic detail of the land covers, the PESR architecture is designed in upsampling block ( Figure 6). The underlying mechanism of PESR is to adaptively localize the fine-grained edges at multiple scale and refine the prediction of these detailed edges based on the additionally trained nonuniform point-wise classifier. Noteworthy is that the edge enhancement has been thoroughly studied in deep learning community; however, most of which are based on redundant dual branch structure and rely on extra prepared edge annotations for supervision (Yang et al. 2017).
On the contrary, our method is free from extra edge annotations, since it can adaptively search the errorprone pixel for locating the potential edge region.
Firstly, we design a progressive super-resolution architecture to mitigate the difficulty of filling a large scale gap at one time (Pan et al. 2020). Based on this architecture, the input F l4p is progressively upsampled by 2-fold once a time with sub-pixel convolution (SPC) layer (Shi et al. 2016 1 Hard point sampling (HPS). HPS strategy is to precisely localize the fine-grained edges of the land covers on F 2 l4p . Generally, the edges are the most ambiguous locations in a scene, with low confidence points that usually have indistinguishable probabilities value among each class (Dai et al. 2007). Based on this assumption, we measure the hard point by the difference between the largest and the second largest probability value point-by-point across the F 2 l4p . The measurement can be given as: means the second largest probability value. If the difference is less than threshold ε, the point i; j ð Þ is selected. Totally, the N most ambiguous hard point are localized, forming a location set P, where Figure 6 to display the localized hard points (red point).
2 Spatial-semantic synergized features extraction. The spatial-semantic synergized features of the fine-grained edges were extracted for the subsequent reclassification. Specifically, the "Layer 3" of ResNet101, denoted as F l3 2 R H=8�W=8�512 , is adopted as the complementary spatial features, which is  merged with the semantic probability map F l4p , to generate spatial-semantic synergized features: where up �2 ðÞ denotes upsampling by 2-times, Concate½� denotes concatenating two tensors along the channel dimension, and thus the channel dimen-

Nonuniformed point-wise refinement (NPR).
The NPR is designed to reclassify the fine-grained edges by the additionally trained classifier, based on the extracted spatial-semantic synergized features P F 2 c � � . Notably, instead of 3 � 3 convolution, NPR classify the fine-grained edge points with a MLP classifier point-by-point, which can avoid the interference of the neighboring pixel introduced from the fixed grid sampling of the convolutional kernel (Kirillov et al. 2020). Specifically, MLP classifies each point

Replacing the point with the refined prediction.
The F S l4p is refined by replacing the probability vector of the corresponding points on F S l4p with the newly predicted P n F S l4p h i : where NPRðÞ means replace the edge's probability vector on up �2 F l4p À � with the reclassified one. The residual upsampling stages are the same way.
Finally, the outputs of PESR are the refined F 8 l4p and F 8 c , which are denoted as Ẑ 2 R H�W�NC and F out 2 R H�W� 512þNC ð Þ for subsequent implementation, respectively. Ẑ represents the super-resolved HR probability maps, which can be hardened to obtain the sub-pixelscale classification result X .

Loss function
For training the network model, the estimated probablity Ẑ is rectified by the ground truth Z 2 R H�W�NC with cross entropy loss function, to supervise the learning of sub-pixel-scale classification: where Ẑ i; j; k ð Þ denotes the predicted probability value at i; j ð Þ location in k channel and Z i; j; k ð Þ denotes the real probability value at i; j ð Þ location in k channel.
Besides, in order to train the MLP classifier, the feature vectors at hard points (P F out ½ �) are preliminarily classified to generate probability vectors, which are then rectified by the corresponding point in ground truth Z (denoted as P Z ½ �) with cross entropy loss function, given as: where P n;k Z ½ � denotes the real probability value of the nth hard point in k channel and MLP P n;k F out ½ � À � denotes the estimated probability values of the the nth hard point in k channel. Notably, the weights are shared across all MLP, as was done by Kipf and Welling (2017), which is simple and computational efficiency. Thus, the total loss function can be obtained by the weighted sum of SRM_loss and Point_loss, given as: Given g, X m 2 R H�W is firstly one-hot encoded to generate ground truth probability map Z m 2 R H�W�NC , and the network parameters θ can be obtained by iteratively minimize the difference between the estimated Ẑ m and ground truth Z m , with loss function L seg as criteria, given as: After converging, the trained model with parameter θ can be used to reconstruct the external Landsat images to generate sub-pixel-scale HR classification image.

Alternating optimization with entropy-minimization strategy (AOE)
Although well-trained on the training sample (source domain image), the network model may also be confronted with a performance drop on the Landsat image at unknown areas (target domain image) due to the feature distribution gap between source and target domain, which impedes the usability of DETNet in large extent mapping. In order to reduce the crossdomain discrepancy between source and target results, a criterion that measures this discrepancy is considered. Inspired by Grandvalet and Bengio (2005), the prediction from the source domain usually has low entropy, indicating a highly confident result, while the prediction from the target domain has high entropy, indicating an unreliable result (Vu et al. 2019). Thus, entropy of the resultant map is adopted as the criterion (Shannon 1948). Supposing the Landsat images from source domain (with label) is denoted as Y s m ; m ¼ 1 . . . M, and the images from target domain (without label) are denoted as Y t l ; l ¼ 1 . . . L. Given an already-trained model, the output probability map is Z t l , the entropy of Z t l can be given as: Generally, enforcing a minimal entropy on the target result Z t l can modulate the already-trained network and gradually bridge the domain gap. However, modulating the network parameter takes the risk of negative transferring, which may destroy the performance on the source domain. In order to preserve the subpixel-scale classification accuracy on the source domain, we design an alternating optimization entropy minimization (AOE) strategy (Figure 8)

Experiment purpose
We firstly validate the proposed DETNet model on the FLAS dataset, together with classical SRM and the state-of-the-art DLSMNet model for comparison, and conduct an ablation analysis to validate the effectiveness of the PESR and AOE. After then, we validate its transferability over large extent of 28 cities in China. Finally, we deploy the DETNet to continuous 21-year Landsat images in Wuhan during 2000-2020, to validate its performance in observing urban dynamic and spatiotemporal evolution patterns. The framework of the experiment is shown in Figure 9.

Evaluation metric
For quantitative validation on FLAS dataset, the persubpixel classification metrics like overall accuracy (OA), the mean producers' accuracy (mPA), users' accuracy (mUA), and mF1-score of all the classes, are investigated. It worth noting that Kappa is recently proved to be improper for land cover accuracy assessment (Foody 2020), and the intersection over union (IoU) essentially measures the same thing as F1-score (Maxwell, Warner, and Andrés Guillén 2021), thus, these two metrics are not included. Meanwhile, in order to examine the performance on each land cover class, class-wise F1-score is also considered. They can be derived from the confusion matrix (Table 2), and the calculation is provided in Eq.
(20)- (22). where n ij means the number of pixels that belong to class j (according to the ground truth) is predicted as class i by the classification model, n þj represents the total number of pixels that belong to class j, and n iþ represents the total number of pixels that are predicted as class i. K is the total number of classes.
Furthermore, considering that per-subpixel classification metric only concerns the universally pixel-wise accuracy and cannot reflect the detail reconstruction ability, a novel semivariogram metric is adopted to further inspect the performance on detail structure, which can quantitatively characterize the spatial variation or spatial pattern in land cover map (Curran 1988;Balaguer-Beser et al. 2013). Specifically, after deriving the estimated HR probability map Ẑ 2 R H�W�NC , the semivariagram at lag h is: where v i and v i þ h denote the spatial locations on Ẑ k that have a distance of h. Ẑ k v i ð Þ and Ẑ k v i þ h ð Þ denote the probability value of these two locations. N h ð Þ denotes total number of pairs of v i and v i þ h ð Þ on Ẑ k . Specifically, four directions (0°, 45°, 90°, 135°) are considered for lag h, and h is set to [1,8] with step size 1 in 0° and 90° direction, and with step size 1.414 in 45° and 135° direction (Wang et al. 2020). With the lag varying, the γ k h ð Þ can form a semivariogram curve in each direction. The more similar the semivariogram curve between the predicted Ẑ and the ground truth Z, the better the performance on the detailed  structure. Furthermore, in order to quantify this similarity, the mean square difference (MSD) of the semivariogram curve for each class (MSD k ) is investigated. For quantitative validation on large extent 28 cities in China, three verification strategies are employed: 1) In order to quantitatively validate our products, a consistence assessment with contemporary land cover products is developed. It uses a consistence map derived from other products, including 10m Esri Land Cover in 2020, 10-m FROMGLC10 in 2017, and 30-m GlobeLand30 in 2020. They are initially resampled to 4-m resolution. If one pixel is recognized as the same class by three products, it is assumed to be a convincing annotation and can be used to validate our product. Although the consistence map is much coarser than our products, it can be still used to evaluate the homogenous area of our products; 2) Considering that the consistency assessment could only validate homogenous area, we further used the publicly certified 4-m GID annotation dataset for further validation. Other prevalent products, including GlobeLand30, FROMGLC10, and Esri Land Cover, are also participated for comparison. Each GID scene has size of 6800 × 7200 pixels (~786.36 km 2 ). Noteworthy is that GID annotation was essentially random-sampled, which is not always located within the provincial capital city, thus we only use the overlaid GID annotation; 3) Considering that GID annotation dataset is not fine-grained enough to examine the spatial detail reconstruction performance, additional 12 validation regions with size of 2345 � 2345 (~88 km 2 ) are further sampled in three typical cities, i.e. Beijing, Guangzhou, and Wuhan, at the heterogeneous locations, which are annotated by experts without participating in the experiment. Also, GlobeLand30, FROMGLC10, and Esri Land Cover products are selected for comparison.

Comparison algorithms
In order to provide benchmarks for the validation of the proposed method, the classical SRM and the stateof-the-art DLSMNet are considered for comparison. Noteworthy is that since our goal is to super-resolve the coarse Landsat image to fine land cover product, the semantic segmentation network which only deal with the same resolution between input and output is not included. Classical SRM algorithms include spatial attraction SRM model (SASM) (Mertens et al. 2006) and Adaptive MAP model and a winner-takes-all Class Determination strategy for Sub-pixel Mapping (AMCDSM) (Zhong et al. 2015). DLSMNet include LinkNet ( (He et al. 2021b). In addition, traditional pixel-scale hard classification (HC) is also taken for comparison, which adopted typical SVM method that assigns a single label to each mixed pixel.

Experiment environment and parameter settings
All the DLSMNet models are trained on a server with NVIDIA RTX 2080Ti graphics processing unit (GPU) accelerators (11-GB memory) and were implemented in TensorFlow 1.15.0, Pytorch 1.2.0, and Python 3.7. For the state-of-the-art DLSMNet, the epoch is uniformly set to 1000, the learning rate is 0.0001, the Adam optimizer is used, and the batch size is 16, as suggested previously (Arun, Buddhiraju, and Porwal 2018;Jia et al. 2019;He et al. 2021a). As for DETNet, the learning rate is set to 0.01, the batch size is 4, and the SGD optimizer is used with momentum of 0.9 and weight decay of 0.0001. The number of the sampled hard points N is empirically set to 32,768 (1/8 of the 512 � 512 patch) after parameter tuning. Figure 10 demonstrates the visual inspection of the results (five representative resultant patches are selected. More resultant patches can be referred to Appendix S1 in the supplementary file), and Figure 11 displays the zoomed-in region for detail reconstruction examination.

Evaluation in terms of per-subpixel classification
It is obvious that 30 m/pixel Landsat images are affected by the mixed pixel problem, with blurry edges and vague structures, compared to the 1-m Google Earth images. Due to this reason, the result of traditional pixel-scale HC is unpleasant with spackle noise and jagged boundaries. By contrast, sub-pixel-scale classification can generate smooth edges with small-scale land cover objects ( Figure 11).
Among compared algorithms, DETNet achieves the best result, with the spatial structure of the building and small-sized grass clearly characterized in 3 rd patch, and the elongated road in 2 nd patch is also well restored (Figure 10). By contrast, the results of SASM and AMCDSM are compromised with noise, which is attributed to the uncertainty of the input abundance image due to the spectral variability phenomenon. As for DLSMNet, the LinkNet result is oversmoothed without details, since the excessive downsampling in the encoder stage ruins the lowlevel features. The SRMCNN makes a simplification on LinkNet architecture to reduce pooling layer for preserving spatial details, and the results become better. SRMCNN_ESPCN and SRMCNN_SRGAN eliminate the pooling layer to further keep the low-level feature and use SPC layer to advance the upsampling performance. The SIMNet resembles the structure of SRGAN, further takes the abundance images as semantic constraint for spatial detail enhancement, and achieves a more pleasant result.
The confusion matrixes of each result are displayed in Figure 12, with the PA and UA of each class presented. The rows represent the prediction and the columns represent the ground truth. The cell entry is the population proportion of area normalized by dividing the total area, and the values of each table as a whole sum to 1.0 (Stehman and Foody 2019). For DETNet, it is obvious that most predictions are located at the diagonal, and most of the PA and UA of each class are better than other algorithms, indicating that each land cover class is properly predicted. For other algorithms, the shrub and forest are usually confused, as well as the bare land and impervious.
For quantitative assessment in terms of persubpixel classification metrics (Table 3), the OA of DETNet achieves 79.28%, and outperforms the second best SIMNet by 6.70%. The mPA, mUA, and mF1-score of DETNet also makes an improvement by 0.0913, 0.1502, and 0.1157 compared to SIMNet, respectively. In addition, the statistical results of SRMCNN, SRMCNN_ESPCN, and SRMCNN_SRGAN are very similar, with average OA of 68.81% and small standard deviation of 0.53%, while LinkNet is slightly worse. Not surprising is that HC ranks the lowest in all accuracy metrics, which indicates an unsatisfactory performance.
For quantitative assessment of each class in terms of F1-score (Table 4), the most increment occurred in grass, shrub, and bare land, which are increased by 18.90%, 17.36%, and 23.13%, respectively. The reason is that these classes are sparsely distributed and difficult to be identified for traditional algorithms, while being beneficial to the detail edge localization ability of DETNet, they can be well retrieved. Meanwhile, for cropland, forest, water, and impervious class, DETNet also makes an improvement by 8.43%, 5.21%, 4.36%, and 3.40%, respectively.

Evaluation in terms of semivariogram
The semivariogram curves of the ground truth and that of the predicted results by each algorithm are shown in Figure 13. It is obvious that the curves are similar in the four directions, which indicates a similar spatial variation along these directions. Meanwhile, the curves of each class are varying, which indicates a largely distinct distribution pattern of each class. For forest, grass, impervious, and water classes, large semivariogram values are observed, indicating that they have distinct distribution patterns with large spatial variation, which is consistent with the visual inspection in Figure 13. For these classes, the DETNet can realize a more similar semivariogram curves with the ground truth, suggesting that DETNet has a superior sub-pixel-scale reconstruction ability. However, the DETNet has a slightly performance drop for the shrub and bare land, the reason is that the shrub and bare land is rare, and the parcel of these classes are widely spaced and far away from each other compared to the dominant classes, resulting in a low value of the semivariance of the rare classes. Besides, the DETNet is also inferior to SIMNet in restoring the spatial structure of cropland, since the cropland is usually homogenously distributed, and the structure derived by DETNet tends to be fragmented.
In order to quantify the spatial structure restoration performance, the MSD between the semivariogram curves of each algorithm and that of the ground truth for all the classes are recorded (Table 5). It is obvious that DETNet achieves the best MSD for the forest, grass, water, and impervious classes, which are reduced by 0.0034, 0.0021, 0.0003, and 0.0039, compared to the second best algorithm, respectively. As for cropland, shrub, and bare land, however, the DETNet undergoes accuracy drops of 0.0005, 0.0005, and 0.000068.

Ablation analysis of PESR and AOE
In order to validate the effectiveness of the PESR and AOE strategy, ablation analysis is conducted. The baseline for comparison is constructed by removing both parts. Then, the "Baseline + PESR" is involved by adding PESR, to independently inspect its effectiveness. Finally, the "Baseline + PESR + AOE" (i.e. DETNet) is involved to examine the enhancement achieved by AOE strategy. The statistical result is given in Table 6. Besides, a visual inspection of ablation analysis is presented in Figure 14.
Specifically, PESR makes a remarkable improvement by 8.60% and 0.1458 in OA and mIoU when comparing the "Baseline" with the "Baseline + PESR." From the visual inspection in Figure 14(a), the accuracy improvement achieved by PESR is mainly located at the boundaries that constitute the detail spatial structure of the scene, which verifies the spatial detail reconstruction ability of PESR.
As for the effectiveness of AOE strategy, the accuracy gain is not obvious, only improved by 0.51% and 0.0068 in terms of OA and mIoU, since     Figure 13. The semivariogram curves of each algorithm to examine the reconstruction ability toward the spatial detail structure. the statistical distribution of the training set and test set of FLAS dataset is similar. Despite this, from the visual inspection in Figure 14(b), the cropland and grass land, which is easily confused in "Baseline + PESR," is properly distinguished in "Baseline + PESR + AOE," owing to the contribution of AOE strategy.

Results on large-extent study area of the 28 cities in China
The large-scale mapping products of the 28 cities in China in 2020 derived by DETNet are summarized in Figure 15. A detailed visual inspection of them can be found in Appendix S2 in the supplementary file. Besides, the mapping results on the residual six provincial capital cities located at the continental climate and mountain climate zone are separately provided in Appendix S3, to provide the limitation of our model.

Evaluation by contemporary land cover products
From consistent assessment (provided in Appendix S4 in the supplementary file), it is obvious that the accuracy of the cities in China's north-eastern region (e.g. Shenyang, Harbin, Changchun, Beijing, and Hefei) is usually higher than that in China's central plain and south region (e.g. Wuhan, Changsha, Chongqing, and Haikou), since the land covers in the north cities are more homogenous cropland or forest gathered to a cluster, while they are scatteredly distributed in the south cities with large spatial heterogeneity. The highest consistency of 95.06% OA is achieved in Shenyang, which is due to its flat terrain with clear boundaries between urban impervious area and cropland. Besides, the cropland is homogenously distributed with little forest or water disturbance. While the lowest consistency of 74.68% OA is achieved in Chongqing, which due to its mountainous topographic makes the urban and cropland highly confused with the mountain forest. Overall, our products achieve well consistency with the prevailing contemporary products, achieving OA of 84.11%, which preliminarily substantiate its reliability. Noteworthy is that the consistence map only has values at the homogenous area, thus the accuracy presented can only reflect the rough performance at the homogenous area, further examination at heterogenous area should be considered.

Evaluation by GID annotation
The visual and quantitative examination validated by the GID annotation are provided in Appendix S2 and S5, respectively, which has totally 14 cities  Figure 14. The visual inspection of the improvement achieved by the PESR and the AOE strategy. (a) Comparison between "Baseline" and "Baseline + PESR"; (b) comparison between "Baseline + PESR" and "Baseline + PESR + AOE." that have overlaid region with our products. From visual inspection, it can be observed that the GlobeLand30 product is relatively coarse, the urban area is usually clustered to one polygon without inner structure, which is not suitable for observing urban dynamic. Although at 10m resolution, the Esri Land Cover product is also coarse, and the cropland area is usually misclassified to impervious surface. As for FROMGLC10 products, the spatial heterogeneity is well delineated; however, the cropland at outskirt is usually misclassified to grass land. On the contrary, the DETNet-derived products restore an urban structure with rich details and fine-grained boundaries. For quantitative assessment, our products achieve the best accuracy for most of the cities. It worth noting that the GID is annotated on Gaofen-2 images in 2015-2016, which is not exactly matched the date of our products in 2020, resulting in an accuracy fallen in Hangzhou and Chengdu, which experienced a rapid urbanization. Averagely, our products achieve OA of 83.11% and surpass the state-of-the-art Esri Land Cover products by 8.515%.

Evaluation on three typical cities
From visual examination in Wuhan area (Figure 16), the detailed spatial structure is well restored in the DETNet result. For example, in T1 region located at the optical industrial park center, the detailed patterns with intersection of the cropland, grass, as well as the district blocks and the road nets, are distinct in DETNet result. As in T2 region located at the Huazhong Agricultural University in downtown area, the structural buildings, urban forest, and cropland are clearly distinguished in DETNet result, but are aggregated in buildings without any structure patterns in other compared products. The visual examination and analysis in Beijing and Guangzhou are provided in Appendix S6.
For quantitative assessment (Table 7), the result achieved by DETNet is superior to Esri Land Cover, FROMGLC10, and GlobeLand30 product in the three cities. Averagely, DETNet achieves OA of 70.70% and outperforms the second-best product by 1.28%. Combined with visual analysis, the increments are mainly located at the heterogenous region, in which the other products only characterize a homogenous pattern without detailed structure, causing an accuracy drop. Although FROMGLC10 reveals considerable spatial details, the recognition is not accurate with large amount of grass area.

Ablation analysis of AOE
Due to the various imaging conditions, the statistical distribution of the training set has considerable distinction in the Landsat images over large extent. In order to validate the effectiveness of the AOE strategy in generalizing the model to the large extent and quantify its contribution, an ablation analysis is conducted, by comparing the result of "Baseline + PESR" and "Baseline + PESR + AOE," at Beijing, Guangzhou, and Wuhan areas. Figure 17 demonstrates a visual perceptual of the contribution of the AOE strategy, and Table 8 provides an average quantitative assessment on the validation regions used in section 4.2.3.
From visual inspection, the results achieved by "Baseline + PESR" are affected by noise, which is due to the fact that the trained model cannot properly recognize the unknown scene that has different statistical distribution from the training set and only gives ambiguous probability maps with a variety of uncertainty points. By contrary, the result with AOE strategy is more continuous, since the noise is appropriately surpassed by reducing the entropy of the target result to transform it to the manifold of the source domain, which can alleviate the uncertainty in probability maps of the target domain.  Figure 17. Visual comparison of "Baseline + PESR" and "Baseline + PESR + AOE" at Wuhan. From quantitative assessment, the AOE strategy significantly enhances the classification accuracy, by 10.97%, 8.00%, and 4.82% in terms of OA in Beijing, Guangzhou, and Wuhan, respectively. This improvement indicates that a considerable crossdomain discrepancy was filled up by the AOE strategy and confirms the transferability of our method for a large spatial extent.

Results on long-term and fine-scale land cover mapping of the 21 years in Wuhan
In addition to spatial extent, the DETNet is also deployed to long temporal extent of continuous 21 years  at Wuhan city that experienced a rapid and constant urbanization, to examine its transferability in capturing fine-grained urban dynamic. Specifically, Landsat-5 within 2000, Landsat-7 in 2012, and Landsat-8 within 2013-2020 are used to form a series of 21-year observations. A median mosaic of the growing period ranging from April to September in each year is taken to eliminate the cloud contamination and phenology variation. Besides, in order to ensure consistency over time, a histogram matching method (Wegener 1990) is initially employed to adjust Landsat-5, 7, and 8 to have the same mean and variance, and 2020 is selected as the reference for the matching. Furthermore, a continuous change detection (CCD) method (Zhu and Woodcock 2014) is performed on annual time-series products predicted by DETNet, to relieve the fluctuation of possible uncertainty. Specifically, we only use the break map (indicating which pixel is varied in which year) in CCD as the indicator mask for the refinement of our time-series mapping results, which can eliminate noise and false changes. The annual time-series results are provided in Figure 18, and a zoomed-in area at Tianhe Airport is provided for demonstrating the performance on detail urban composition. Noteworthy is that other land cover products are limited to the production period and only have single or several years' results, which is not suitable for studying the spatiotemporal change of the surface process.
From visual inspection of the fine-grained urban composition, the morphological structure of Airport Terminal-1 is clearly reconstructed in 2000, such as the lawn and track in the airport. Besides, the construction of Airport Terminal-2 was begun in 2004 and completed in 2008, Airport Terminal-3 was begun in 2012 and completed in 2016, the spatiotemporal dynamic of which is consistent with our time-series products, proving the ability of our products in observing fine-grained urban dynamic.
From the quantitative statistic of the land cover area variation (annual variation of each land cover displayed in Figure 19(a) and the growing increment every 5 years displayed in Figure 19(b)), the impervious area is stably growing with average speed of 16.87 km 2 , except for 2015-2020, which experienced a prosperity of the real estate market that makes the housing prices in Wuhan almost doubled. From the dynamic maps in Figure 20, it is obvious that the impervious area gradually encroaches the cropland along the outskirt. Although being encroached, the cropland area is still constantly grown, since it instead encroaches the lake and forest area, as can be seen in Figure 20. The growing rate of the cropland area fluctuated with an average speed of 29.71 km 2 . Especially in 2015-2020, cropland experiences a drastic increase, the reason behind it is partially attributed to the new first-tier city of Wuhan and the increasing population that put considerable pressure on the agriculture yield. Besides, Wuhan, Hubei is also the main producing location for paddy rice in central plain of China.
From urban structure dynamic reflected in three main districts, i.e. Wuchang, Hanyang, and Hankou (zoomed-in every 5 years in Figure 21). We find that the impervious areas were separately distributed in the old city zone and outer city zone in early year, with the village region abruptly located between them. This is the urban-village phenomenon in  China after "Reform and Open-up" policy, which was caused by rapid urbanization without rational urban planning that gradually encircled the village with high buildings. As time goes on, the impervious areas were then gradually dispersed from the old city zone to the outer city zone and finally connected, forming the Lingkonggang, Zhuankou, and East Lake High-tech Development Zone, respectively, during which process the middle village regions were disappeared. This is mainly due to the urban-village renovation policy that is aimed to redistribute the spatial configuration of the city to renovate the village and construct a healthy and sustainable urban structure.

Computational efficiency
In practical land cover mapping activity, the time consumption is one of the critical indicators when  applying DETNet to large spatial extent. In order to examine the computational efficiency of the proposed DETNet, the time consumption during the training phase, the fine-tuning phase, and the inference phase in Wuhan area are recorded, respectively (Table 9). In practical situation, only the fine-tuning phase and the inference phase are needed when new remote sensing image comes up. After statistic, the DETNet model only takes 1788.13 s to infer the Wuhan area with coverage of 8,569 km 2 , which has great potential for achieving frequency updating of the land cover product, and it is of great importance for meeting the requirement of frequency monitoring applications.

Discussion
These results show that the nonuniformed sampling that clustered at the low confidence edge position when upsampling feature maps can substantially restore more details than the classical deconvolution or sub-pixel convolution in previous DLSMNet that adopted fixed grid sampling of 3 � 3 (or other size) kernel. This finding is consistence with our hypothesis that the learning of the spatial heterogenous distribution is usually weakened by the learning of the homogenous distribution, which is partly attributed to fact that the fixed grid sampling of convolution operation favors spatial homogeneity. Besides, the results also show that synergistically considering the entropy of the target domain during model fine-tuning can considerably reduce the influence of the geometric or radiance variations between source and target domain, which make it possible to transfer to a large extent application. Based on the above two advanced properties, the proposed DETNet model is able to break through the spatial and temporal limitations of the current land cover mapping activities, by super-resolution 30m Landsat images to meter-level fine-grained land cover distribution. Once trained, the DETNet model can be directly deployed on large-extent and multitemporal Landsat images without human intervention, which costs less and has computational efficiency. The high spatial and temporal resolution land cover product is of great value and indicative two aspects, such as the case in Section 4.3: 1) From the area statistic and spatial variation, the impact of urbanization to the ecosystem is indirect, i.e. impervious surface encroached cropland, while cropland encroached forest. Besides, the growth rate of cropland experienced a large increase during 2015-2020 in Wuhan (Province of a Thousand Lakes), which largely encroached the lake area. 2) From micro view of urban structure and its evolution patterns, the dynamic of the impervious surface in the old city zone and the outer city zone was well captured, which reflected an urban-village phenomenon and urban-village renovation process reforming the urban structure. Therefore, the derived products can be served as a data support for municipal decisionmaking.
Although the effect of the modification of the network model is obvious, the limitations of the remote sensing dataset and annotations are also nonnegligible. Specifically, the FLAS dataset is only sampled in temperate and subtropical monsoon climate zone and is mainly located around the urban area, which lacks the samples of wetland, tundra, and ice/snow. In this situation, the trained model may be unable to identify these land covers in the mountainous zone of northwest China. Besides, due to the class imbalance issue (i.e. small amount of grass, shrub, and bareland samples), the transferability of the trained model may be limited. Specifically, it is more suitable in temperate monsoon climate zone, the subtropical monsoon climate zone, or urban agglomeration and its surrounding area, than in other area like desert, mountain, or prairie. But noteworthy is that the advantage of our model is the detail reconstruction ability of heterogenous region with significant spatial variation, such as urban area. The class imbalance issue can be also alleviated by reducing the weight of large ratio class and enlarging the weight of small ratio class in the loss function calculation.
Furthermore, satellite image quality is often degraded by image striping due to sensor calibration problems, as well as by thin clouds and fractus clouds, especially in tropical climates with year-round rainfall. These and other issues reduce the usefulness of spectral data for image classification. In this situation, a data fusion method needs to be further explored to restore the missing/contaminated observations.
In the future development of global land cover mapping community, the DETNet model has great potential to realize seamless and fine-grained HR land cover products, which is indicative of the precise spatial variation of each land covers, can facilitate greater understanding of the impacts of urbanization on biodiversity and human settlements, and help to achieve sustainable development goals.

Conclusion
HR global land cover product is a fundamental source for retrieving Earth surface processes and understanding the interaction of urban dynamic and environmental variation. However, current products are usually restrained by the spatial resolution (up to 10 m) and temporal resolution (2000( , 2010( , 202020102020 for Esri Land Cover), and the urban composition is highly heterogenous, which make the current products insufficient to understand how cities change in space and over time at HR.
This study attempts to alleviate this major problem in the aspect of mixed pixel problem and use advanced super-resolution mapping (SRM) method to decompose the mixed pixel (30 � 30 m) and locate each land cover at sub-pixel-scale (circa 4 � 4 m), which consequently improves the spatial resolution of the land cover map. Considering two specific limitations in deep-learning-based SRM community, i.e. 1) the fixed grid sampling of 3 � 3 kernel in the deconvolution or sub-pixel convolution in previous DLSMNet favors spatial homogeneity and suppresses the learning of spatial heterogeneity of urban composition; 2) the geometric or radiation variation in different region and time in Landsat images makes it difficult for simple SRM to transfer, we develop the DETNet model. Two innovations are obvious: 1) PESR architecture is designed to allow nonuniformed sampling specific at the low-confidence edge position, which can intensify the learning of heterogenous compositions' patterns that are surpassed in the fixed grid sampling of classical deconvolution; 2) AOE minimization strategy is designed to alternatingly minimize the entropy of the target domain and the SRM loss of the source domain during model finetuning, to gradually project the classification hyperplane to the manifold of the target domain, making the DETNet model valid on Landsat image in largeextent and different time.
From experimental results on FLAS dataset, we find that DETNet achieves outstanding performance on detail reconstruction in heterogenous region, the OA of which reaches 79.44%, and makes 6.86% accuracy improvement. From ablation analysis, we find that PESR makes an accuracy improvement of 8.60% in OA, while AOE makes an averagely 7.93% promotion when transferring DETNet in Beijing, Guangzhou, and Wuhan. The semivariogram curve criterion also shows that DETNet achieves the most similar spatial variation with the ground truth.
To examine its transferability in large spatial extent, we deploy DETNet to the 28 cities in China and validate the derived products by three quantitative verification strategies. We find that the derived products restore considerable heterogenous distribution in delineating the urban composition and structure and achieve better accuracy than the contemporary Esri Land Cover, FROM-GLC10, and GlobeLand30 products. To examine its transferability in different times, we further deploy it on continuous 21-year Landsat images during 2000-2020 in Wuhan city and find how city changes in space and over time at HR in two aspects: 1) From area statistic of each land cover, it is found that even encroached by impervious surface, the expansion rate of cropland is much faster than the impervious area over the past 10 years, which are the principal source of the encroachment of forest, grass, and lakes; 2) From the spatiotemporal evolution patterns, it is found that anisotropic rapid urbanization makes the old village encircled with new buildings in early years, while these urban-villages are then reconstructed in later years to redistribute the spatial arrangement for a more rational urban structure, after the execution of urban-village renovation policy. Therefore, the derived products are able to provide data support for assisting the municipal decision-making.
Some key issues can be pursued in future to further promote this research and facilitate a seamless nationalscale/global-scale high resolution product. Firstly, samples in the continental climate and mountain climate zone can be increased, to enhance the generalization of the network model. Secondly, the bias in land cover types in FLAS annotations may also result in uncertainty in predicting minority class, such as grass and bare land. Thirdly, considering the single source of spatial prior only from training dataset may not be sufficient to constrain the ill-posed SRM problem, the idea of spatiotemporal super-resolution mapping method can be adopted, which borrows the complementary HR spatial patterns from other temporal images. In summary, this work provides a novel SRM method for generating HR land cover product, and it is the first attempt to transfer the SRM model to the largest extent that has ever been done before, which make it possible for mass production of land cover maps. We hope that the DETNet as well as the derived products could be used as a support for further geographical-related research.