Spatial Population Distribution Data Disaggregation Based on SDGSAT-1 Nighttime Light and Land Use Data Using Guilin, China, as an Example

: A high-resolution population distribution map is crucial for numerous applications such as urban planning, disaster management, public health, and resource allocation, and it plays a pivotal role in evaluating and making decisions to achieve the UN Sustainable Development Goals (SDGs). Although there are many population products derived from remote sensing nighttime light (NTL) and other auxiliary data, they are limited by the coarse spatial resolution of NTL data. As a result, the outcomes’ spatial resolution is restricted, and it cannot meet the requirements of some applications. To address this limitation, this study employs the nighttime light data provided by the SDGSAT-1 satellite, which has a spatial resolution of 10 m, and land use data as auxiliary data to disaggregate the population distribution data from WorldPop data (100 m resolution) to a high resolution of 10 m. The case study conducted in Guilin, China, using the multi-class weighted dasymetric mapping method shows that the total error during the disaggregation is 0.63%, and the accuracy of 146 towns in the study area is represented by an R 2 of 0.99. In comparison to the WorldPop data, the result’s information entropy and spatial frequency increases by 345% and 1142%, respectively, which demonstrates the effectiveness of this approach in studying population distributions with high spatial resolution.


Introduction
The 2030 Agenda for Sustainable Development, which was adopted by all Member States of the United Nations in 2015, provides a comprehensive framework towards peace and prosperity for all humankind through the accomplishment of 17 Sustainable Development Goals (SDGs) [1]. High-resolution population distribution information is of critical importance for evaluation and decision-making in relation to the achievement of several of these SDG indicators at a fine resolution, such as those pertaining to traffic planning referring to SDG 11.2.1 [2,3], public health facilities construction referring to SDG 3.8.1 [4,5], and disaster prevention and response planning referring to SDG 11.5.1/11.5.2 [6][7][8][9].
Currently, there are two types of widely used population distribution data. The first is based on census statistics that are aggregated over administrative units, e.g., provinces, counties, townships, census tracts, or block groups [10]. However, these population data do not accurately represent the true spatial distribution of the population because spatial homogeneity exists in the census results of the administrative units of each region [11]. The The remainder of this paper is structured as follows: Section 2 provides an introduction to the study area of Guilin, China, elaborates on the dataset utilized, the methodology employed, and the evaluation indicators. Section 3 proceeds with the analysis and discussion of the experimental results. Finally, Section 4 summarizes the findings of this study.

Data Sources
NTL and land use data are introduced as auxiliary data to disaggregate the existing coarse population product in this study. The basic information of these data is shown in Table 1.

Data Sources
NTL and land use data are introduced as auxiliary data to disaggregate the existing coarse population product in this study. The basic information of these data is shown in Table 1. Population data suitable for fine-scale applications are developed by WorldPop using a large amount of ancillary data layers. The dataset is global in scope and covers the years 2010 to 2020, making it highly accessible for subsequent studies. The WorldPop dataset provides products with resolutions of 1 km and 100 m, as depicted in Figure 2 for Guilin, China.  Population data suitable for fine-scale applications are developed by WorldPop using a large amount of ancillary data layers. The dataset is global in scope and covers the years 2010 to 2020, making it highly accessible for subsequent studies. The WorldPop dataset provides products with resolutions of 1 km and 100 m, as depicted in Figure 2 for Guilin, China.   The Chinese Academy of Sciences (CAS) launched the Sustainable Development Goals Satellite-1 (SDGSAT-1) into orbit on 5 November 2021. It is the first satellite designed specifically to implement the United Nations 2030 Agenda for Sustainable Development and the first earth science satellite developed by the CAS. Table 2 displays the main parameters of the satellite. The satellite's NTL data comprise four bands, including three visible light bands and one panchromatic band with a maximum spatial resolution of 10 m. SDGSAT-1 data products consist of Level 1, Level 2, and Level 4 data. Level 1 data products are generated by processing relative radiation correction, band registration, HDR fusion, and RPC on the basis of level 0 products, resulting in standard products. Level 2 data products are the geometrically corrected versions of the Level 1 standard products. Level 4 data products result from thorectifying the Level 1 standard products using ground control points, digital elevation models, and in accordance with format specifications. We used only Level 4 products in this study since they are currently the only products available to users. Figure 3 shows that a true-color synthesis image allows us to clearly distinguish the contour of roads and buildings as well as the color of ground neon lights.

Land Use Data
In this study, we used three auxiliary land use datasets: EULUC-China data, FROM-GCL10 data, and road network data. The EULUC-China dataset, generated by Tsinghua University, utilizes 10-m resolution satellite imagery (Sentinel-2A/B) from 2018, Open-StreetMap, night lights (Luojia1-01), POI (Amap, POI category and quantity), and Tencent mobile-phone locating-request (MLP) data (i.e., 8-h mean trajectories of the active population during weekdays and weekends) to produce a dataset containing 440,798 plots labeled with five primary and twelve subcategory feature labels in major Chinese cities [30]. It is not feasible to use the same classification scheme for both urban and rural areas due to their different environments. Hence, we employed FROM-GCL10, developed by Tsinghua University, which is the world's first 10-m resolution global surface coverage product with 72.76% overall accuracy [31]. This product uses a random forest classifier on the Google Earth Engine platform to map global land cover at a 10-m resolution by transferring the 30-m resolution sample set from 2015 to the Sentinel-2 imagery acquired in 2017. The surface features include cropland, forest, grassland, shrubland, wetland, water, tundra, impervious, barren, and snow/ice. Notably, in our study, impervious ground in rural areas is considered as villages. The road network data of the study area were obtained from OpenStreetMap (www.openstreetmap.org, OSM) (accessed on 6 March 2018), an open-source map which includes road layers such as highways, urban expressways, main roads, secondary roads, branch roads, country roads, bicycle roads, pedestrian roads, and internal roads.

Land Use Data
In this study, we used three auxiliary land use datasets: EULUC-China data, FROM-GCL10 data, and road network data. The EULUC-China dataset, generated by Tsinghua University, utilizes 10-m resolution satellite imagery (Sentinel-2A/B) from 2018, Open-StreetMap, night lights (Luojia1-01), POI (Amap, POI category and quantity), and Tencent mobile-phone locating-request (MLP) data (i.e., 8-h mean trajectories of the active population during weekdays and weekends) to produce a dataset containing 440,798 plots labeled with five primary and twelve subcategory feature labels in major Chinese cities [30]. It is not feasible to use the same classification scheme for both urban and rural areas due to their different environments. Hence, we employed FROM-GCL10, developed by Tsinghua University, which is the world's first 10-m resolution global surface coverage product with 72.76% overall accuracy [31]. This product uses a random forest classifier on the The EULUC-China dataset furnishes a sound classification of functional area in urban areas but fails to include data pertaining to types of roads. As a result, the OSM data were utilized to create a 20 m buffer zone, which was then superimposed on the EULUC-China data to augment the latter. The EULUC-China and road network data (both as vectors) are transformed into 10 m raster data to comply with the experimental requirements. As the road network is not the main area of population distribution, we opted not to divide the various roads. Figure 4 illustrates the EULUC-China, FROM-GCL10, and road network data layers were stacked together using the ArcGIS 10.3 software to form a comprehensive land use data set for the study area. Three typical areas characterized as urban, rural, and urban-rural interface were chosen for comparison, with SDGSAT-1 multispectral data to confirm the accuracy of land use data. Figure 5 reveals a high level of agreement between each area type and the remote sensing data. Google Earth Engine platform to map global land cover at a 10-m resolution by transferring the 30-m resolution sample set from 2015 to the Sentinel-2 imagery acquired in 2017. The surface features include cropland, forest, grassland, shrubland, wetland, water, tundra, impervious, barren, and snow/ice. Notably, in our study, impervious ground in rural areas is considered as villages. The road network data of the study area were obtained from OpenStreetMap (www.openstreetmap.org, OSM) (accessed on 6 March 2018 ), an open-source map which includes road layers such as highways, urban expressways, main roads, secondary roads, branch roads, country roads, bicycle roads, pedestrian roads, and internal roads. The EULUC-China dataset furnishes a sound classification of functional area in urban areas but fails to include data pertaining to types of roads. As a result, the OSM data were utilized to create a 20 m buffer zone, which was then superimposed on the EULUC-China data to augment the latter. The EULUC-China and road network data (both as vectors) are transformed into 10 m raster data to comply with the experimental requirements. As the road network is not the main area of population distribution, we opted not to divide the various roads. Figure 4 illustrates the EULUC-China, FROM-GCL10, and road network data layers were stacked together using the ArcGIS 10.3 software to form a comprehensive land use data set for the study area. Three typical areas characterized as urban, rural, and urban-rural interface were chosen for comparison, with SDGSAT-1 multispectral data to confirm the accuracy of land use data. Figure 5 reveals a high level of agreement between each area type and the remote sensing data.   Table 3). In the first round of disaggregation, land use data are incorporated.   Figure 4).
In the first round of disaggregation, land use data are incorporated.

Multi-Class Weighted Dasymetric Mapping
In this study, we employed a multi-class weighted dasymetric mapping method for population disaggregation. This method was first named by Semenov-Tian-Shansky in 1928 [32] and developed by many scholars [25,29]; it subdivides populated areas into subcategories based on factors such as land use and infrastructure density, reflecting different population densities. By applying different weighting factors to each category, we obtained a more realistic population distribution [25]. This method is widely utilized in population disaggregation and often regarded as the most effective approach [33]. The flow diagram of this method is shown in Figure 6. Our initial premise was that a square area contains 144 individuals and that the optimal representation of their distribution is individual points, as shown in Figure 6a. Nonetheless, it is arduous and expensive to obtain data on individuals, so we divided the study area into grids of uniform size to approximate the practical situation as closely as possible. In the absence of additional auxiliary data, we assumed that the population in an area is evenly distributed. However, as evidenced in Figure 6(1), this approach led to a considerable deviation from the actual population distribution. Additionally, the grids varied in their level of deviation from one another, indicating the presence of spatial heterogeneity in population distribution.  2) is the population spatial distribution grids after using land use data, (c) is the NTL data, (3) is the population spatial distribution grids after using NTL data).

Evaluation Indicators
Accuracy verification has always been a challenging task in population distribution studies. Currently, three primary methods can ascertain model accuracy. The first involves comparing disaggregation results with census data [18]. The second method utilizes geospatial measures such as relative error and root mean square error (RMSE) to  1) is the uniform population spatial distribution grids, (b) is the real population spatial distribution in various land use types, (2) is the population spatial distribution grids after using land use data, (c) is the NTL data, (3) is the population spatial distribution grids after using NTL data).
To account for regional disparities, we integrated land use data, as illustrated in Figure 6b. The quantity of individuals in distinct land use categories varied, and we allocated each land use type a corresponding distribution coefficient. We then employed these coefficients to ascertain the population distribution in each land use type (Formula (1)). Consequently, the total number of individuals in each grid was determined by proportioning the individuals across land use types calculated to be in each grid (Formula (2)). As Figure 6(2) demonstrates, incorporating land use data considerably reduced the number of grids that deviated from the actual population distribution, demonstrating the effectiveness of our approach in reflecting regional differences in population distribution.
where W j is the population distribution coefficient of the jth land use type, D j is the population density of the jth land use type, and D is the total population density.
where P i is the population of the ith disaggregation unit, P ij is the population of the jth land use type in the ith decomposition unit, and W j is the population distribution coefficient of the jth land use type Despite considering land use type, varying degrees of deviation still existed, indicating persistent spatial heterogeneity. To address this issue, we introduced NTL data, as shown in Figure 6c. NTL data can sensitively capture and record human activities [30], and a significant positive correlation between nighttime lights and population has been demonstrated in numerous countries and regions [19]. In this paper, we leveraged NTL data to redistribute the population of the same land use type within each disaggregation unit (Formula (3)) to replace the previous average distribution. Nevertheless, it was vital to note that nighttime lights can be influenced by numerous factors, including the economy, culture, climate, season, government management system, and more. For instance, in some less developed areas, low nighttime light intensity does not necessarily indicate a small population distribution. To minimize deviations between the population disaggregation results and reality caused by such fluctuations, we aimed to minimize the disaggregation units' size. Figure 6(3) displays the final outcome, demonstrating a further reduction in the number of grids that deviated from the actual population distribution compared to Figure 6(2).
where P i is the population of the ith land use type in each disaggregation unit, P ij is the population of the jth pixel in the ith land use type in each decomposition unit, and L j is the brightness value of the jth pixel in the ith land use type in each disaggregation unit.

Evaluation Indicators
Accuracy verification has always been a challenging task in population distribution studies. Currently, three primary methods can ascertain model accuracy. The first involves comparing disaggregation results with census data [18]. The second method utilizes geospatial measures such as relative error and root mean square error (RMSE) to assess population disaggregation results and existing population products' differences in spatial structure and correlation between [34]. In our study, the WorldPop products are used as the input data for population disaggregation to generate higher spatial resolution population data than the original product. The accuracy of the disaggregation results relies heavily on the quality of the population products. Therefore, to a certain extent, accuracy can be guaranteed. The third method involves field sampling surveys, which are only applicable for small-scale research. In conclusion, the first and third methods are not essential in this study, and the relative error is used as the evaluation indicator (Formula (4)) to assess the model's accuracy.
wherep op is the population of the disaggregation result, and pop is the population of the existing population products. Additionally, another two objective indicators were introduced to evaluate the fineness of the disaggregation result in our study, namely, information entropy (IE) and spatial frequency (SF). IE of an image measures its statistical characteristics, indicating the average amount of information present in the image and representing the aggregation feature of image gray distribution (Formula (5)). IE is used to verify the improvement of the disaggregation result in the amount of information it contains. SF reflects the rate of change in raster data value (Formulas (6)-(8)) and can be used to evaluate the spatial resolution of the data. At the same scale, a higher IE represents a greater the amount of information, and a higher SF indicates higher spatial resolution and a clearer image.
where H(X) is information entropy, and P i (X) is the probability of occurrence of each gray level.
where M and N are the width and height of the image, respectively, and H(i, j) is the pixel value of the i,j coordinates.

Result of the Disaggregation
The coefficient of population distribution for each land use type was calculated using the WorldPop 100 m population data and Formula (1). Subsequently, adjustments to the coefficient values were made based on actual conditions. Table 3 reveals that the coefficients in urban areas are higher than those in rural areas. Due to the high brightness false information caused by the specular reflection of bodies of water during the acquisition process of satellite nighttime lighting data, we manually adjusted the weight of water and wetland values to 0.
Each 100 m grid is disaggregated into multiple land use types, serving as an independent disaggregation unit. Using distribution coefficients, the disaggregation unit's population is redistributed among different land use types within the unit. Figure 7 displays the outcome of the initial disaggregation. By incorporating land use data, the distribution pattern of the population in the disaggregation unit varying with socioeconomic activities has been accurately captured. The population is predominantly concentrated in urban areas. In addition, the population density varies within the different functional areas of urban areas. For example, residential areas, commercial areas, and hospitals have high population density, whereas industrial areas have comparatively low population density. Generally, the population density in rural areas is lower than that in urban areas. It is primarily concentrated in villages. Lake and river areas do not have a population distribution owing to subjective correction. urban areas. In addition, the population density varies within the different functional areas of urban areas. For example, residential areas, commercial areas, and hospitals have high population density, whereas industrial areas have comparatively low population density. Generally, the population density in rural areas is lower than that in urban areas. It is primarily concentrated in villages. Lake and river areas do not have a population distribution owing to subjective correction.  Following the initial disaggregation, we discover that the population within areas of the same land use type is uniformly distributed, causing significant deviation. To improve the uniformity of population distribution in areas of the same land use type within each disaggregation unit, NTL data will be incorporated. SDGSAT-1 NTL data provides a 10 m resolution that can sensitively capture and record human activities. Nighttime light brightness reflects the degree of population concentration, and the brighter areas of the same land use type indicate denser population distribution. Figure 7 depicts the outcome of the disaggregation. In comparison to Figure 2, the population distribution pattern shown in Figure 7 is a closer approximation to the real population distribution, providing a more detailed insight. Regardless of the land use type, the population is typically clustered, particularly in urban areas. However, variations exist within different blocks due to the restrictions imposed by social, economic, and cultural factors.

Accuracy Evaluation
For validating the effectiveness of the proposed method, it is crucial to assess the disaggregation results' accuracy. For the analysis, 17 districts and counties within the study area were taken as units for analysis. The population of each district or county in the WorldPop data and disaggregation result of the study area were counted, and then the relative error was calculated using Formula (4). Table 4 shows that except for a few individual districts and counties, the relative error is less than 2%, with a total relative error of only 0.63%. Furthermore, to assess the reliability further, 146 towns in the study area were analyzed, and the population of each town in the WorldPop data and the disaggregation result of the study area were counted. As shown in Figure 8, all points are in close proximity to the trend line with minimal error and an R 2 value of 0.99. The high degree of consistency between the disaggregation results' accuracy and that of the disaggregation data validates the proposed method, confirming that it does not undermine the accuracy of the results.  Following the confirmation of the accuracy of the disaggregation result, we introduced objective indicators to evaluate the refinement of the result. We selected four urban areas in Guilin with a significant population distribution that were then adjusted to the same size for evaluation purposes. Histogram statistics were then performed on the selected areas, and the values of IE and SF were calculated for each area (see Figure 9 and Table 5). As a result, the information entropy increased by 345%, and the spatial frequency rose by 1142%, as depicted in Figure 10.  Following the confirmation of the accuracy of the disaggregation result, we introduced objective indicators to evaluate the refinement of the result. We selected four urban areas in Guilin with a significant population distribution that were then adjusted to the same size for evaluation purposes. Histogram statistics were then performed on the selected areas, and the values of IE and SF were calculated for each area (see Figure 9 and Table 5). As a result, the information entropy increased by 345%, and the spatial frequency rose by 1142%, as depicted in Figure 10.

Conclusions and Discussion
In this study, we focused on Guilin, China, as the study area and used the WorldPop population data as the input data, supplemented by SDGSAT-1 NTL data and land use data to generate a population distribution grid with a spatial resolution of 10 m using the multi-class dasymetric mapping method. SDGSAT-1 NTL data were introduced for the first time in the context of population disaggregation. Based on the results of disaggregation and accuracy verification, we found that the spatial resolution of the output was significantly improved while maintaining accuracy, and the output had better performance in detail. This demonstrates the effectiveness of this approach in studying population distributions with high spatial resolution.
However, the ground accuracy of the disaggregation results heavily relies on the accuracy of the input data (WorldPop data in this study). In general, the accuracy and spatial resolution of the disaggregation results increase with higher accuracy of input data and auxiliary data. Although WorldPop has been widely used, the total population statistics are not always consistent with the census data, particularly at the small local administrative scale. We utilized the WorldPop 100 m data because it is the highest spatial resolution population distribution data available in the study area. To generate a population disaggregation model at a higher spatial resolution (10 m), a large number of samples at small spatial units were required. Unfortunately, the census data in the study area did not provide sufficient support for this purpose. As this paper mainly focuses on the disaggregation method of high spatial resolution population distribution, the WorldPop population grid data product is finally selected as the input source data for our model. Future studies A subjective evaluation was carried out by scaling up the four types of population grid data to 1:10,000 in four different areas. At this scale, a strong mosaic phenomenon was observed, regardless of whether it was the WorldPop population grid data with a spatial resolution of 1 km or 100 m. Nevertheless, the two disaggregation results generated by this study substantially ameliorated this issue. Notably, the outcome of the first disaggregating land use data could clearly identify the distinct distribution of population across diverse functional locations of the city. Furthermore, utilizing SDGSAT-1 NTL data to disaggregate the population grid data led to the emergence of numerous bright spots in multiple functional areas, signifying the dense concentration of population in those areas. Figure 9 depicts population grid data for the four areas ranging from 1 km to 100 m in the first round of disaggregation, and later in the final disaggregation, respectively. The information entropy (IE) value and spatial frequency (SF) value both increased monotonically, which was consistent with our subjective visual evaluation results. Based on the significance of the numerical values of IE and SF, the effectiveness of this study in addressing the problem of improving the spatial resolution of population grid data was further objectively verified.

Conclusions and Discussion
In this study, we focused on Guilin, China, as the study area and used the WorldPop population data as the input data, supplemented by SDGSAT-1 NTL data and land use data to generate a population distribution grid with a spatial resolution of 10 m using the multi-class dasymetric mapping method. SDGSAT-1 NTL data were introduced for the first time in the context of population disaggregation. Based on the results of disaggregation and accuracy verification, we found that the spatial resolution of the output was significantly improved while maintaining accuracy, and the output had better performance in detail. This demonstrates the effectiveness of this approach in studying population distributions with high spatial resolution.
However, the ground accuracy of the disaggregation results heavily relies on the accuracy of the input data (WorldPop data in this study). In general, the accuracy and spatial resolution of the disaggregation results increase with higher accuracy of input data and auxiliary data. Although WorldPop has been widely used, the total population statistics are not always consistent with the census data, particularly at the small local administrative scale. We utilized the WorldPop 100 m data because it is the highest spatial resolution population distribution data available in the study area. To generate a population disaggregation model at a higher spatial resolution (10 m), a large number of samples at small spatial units were required. Unfortunately, the census data in the study area did not provide sufficient support for this purpose. As this paper mainly focuses on the disaggregation method of high spatial resolution population distribution, the WorldPop population grid data product is finally selected as the input source data for our model. Future studies may explore generating high resolution population grid data disaggregation models based on the ground truth data. On the other hand, WorldPop's grid values are not continuously changing; there are varying degrees of abruptness in these discrete values which result in obvious boundaries in the disaggregation results. Additionally, in this study, only NTL data and land use data were utilized as auxiliary data, while other data such as building footprints [15], building volume [35], points of interest (POI) [14,36], and GPS tracking data [37] could reflect the spatial heterogeneity of population distribution and contribute to fine population disaggregation. In the study of population distribution with high spatial resolution, improving the accuracy of the model remains a challenge due to the lack of more precise population samples.