Development of a global urban greenness indicator dataset for 1,000+ cities

Global climate change has sparked efforts to adapt to increasing temperatures, especially in urban areas that experience increased day and nighttime temperatures due to the urban heat island effect. The addition of greenspace has been suggested as a possible means for urban centers to respond to increasing urban temperatures. Thus, it is important for urban planning and policymakers to have access to data on greenspace specific at a fine spatial resolution. This dataset consists of information on peak and annual average 1 × 1 km Normalized Difference Vegetation Index (NDVI) for over 1,000 global urban centers, which is an objective satellite-based measure of vegetation. Population-weighted values for both peak and annual average NDVI and include an indicator of greenness, with seven levels ranging from extremely low to extremely high are provided. Additional information regarding the climate zone (using the Köppen-Geiger climate classification) and level of development (using the Human Development Index or HDI) for each city is included. Analyses were repeated in 2010, 2015, and 2020 to provide the ability to track urban greenness over time. Data are provided in tabular format with summaries presented in both tables and graphics. These data can be used to inform policy and planning and can be used as an indicator for a variety of climate and health investigations.


a b s t r a c t
Global climate change has sparked effort s to adapt to increasing temperatures, especially in urban areas that experience increased day and nighttime temperatures due to the urban heat island effect. The addition of greenspace has been suggested as a possible means for urban centers to respond to increasing urban temperatures. Thus, it is important for urban planning and policymakers to have access to data on greenspace specific at a fine spatial resolution. This dataset consists of information on peak and annual average 1 × 1 km Normalized Difference Vegetation Index (NDVI) for over 1,0 0 0 global urban centers, which is an objective satellite-based measure of vegetation. Population-weighted values for both peak and annual average NDVI and include an indicator of greenness, with seven levels ranging from extremely low to extremely high are provided. Additional information regarding the climate zone (using the Köppen-Geiger climate classification) and level of development (using the Human Development Index or HDI) for each city is included. Analyses were repeated in 2010, 2015, and 2020 to provide the ability to track urban greenness over time. Data are provided in tabular format with summaries presented in both tables and graphics. These data can be used to inform policy and planning and can be used as an indicator for a variety of climate and health investigations.
© 2023 The Author(s Value of the Data • These data are a useful tool in determining urban centers that suffer from lack of green space • Policymakers and city planners may find this dataset useful in aiding decision-making and climate-related analyses of urban centers (i.e., developing plans to mitigate urban heat islands). • Population health researchers can apply this dataset to multiple investigations, including reducing temperatures in urban centers, increasing physical activity in urban centers, and the use of greenspace to improve health outcomes • This dataset is made readily available in a universal format that is both streamlined and accessible to individuals regardless of expertise or experience in the underlying data or platforms. • Data may be used as inputs for urban research models to study a variety of health-and non-health-related subjects.

Objective
While the Lancet Reports provide a summary of our key findings and methods, they lack details on processes used to generate the raw data [1][2][3] . This paper elaborates on the application of cloud computing technology to conduct large-scale analyses of remote sensing data, with details on the data sources, functions and parameters, and analytical metrics used. This report includes additional analyses on urban greenness stratified by climate region and human development index. This paper gives details regarding the data, methods, and analysis for creating an urban greenness indicator.

Data Description
The "Global Greenspace Indicator Dataset" consists of three types of data, including processed data, tabular data, and graphical data [4] . The raw data was processed from each data source (see below) and compiled the results in the processed data file. The processed data are provided in '.csv' format and include information on urban areas, multiple measures of the Normalized Vegetation Index (NDVI), levels of development utilizing the Human Development Index (HDI), and categories of the greenness indicator. Additional summaries and visualizations of the data are included as tables ('.csv') and graphics ('.tiff'). Fig. 1 presents the global distribution of included urban centers in each included country. Fig. 2 presents the population-weighted annual peak NDVI for each of the included urban areas, indicated by the color of each point. Information in Table 1 summarizes the NDVI measures for  each level of the greenness indicator. Fig. 3 provides the global distribution of the greenest cities for 2020 (high or very high greenness levels). Fig. 4 charts the change over time in populationweighted peak levels between 2010 and 2020, shown as percentages of the total number of urban areas for each of the greenness levels. For each year, mean unweighted and weighted peak and annual NDVI values were calculated and displayed in Fig. 5 .
The dataset is summarized by specific subcategories that may be of interest to end users, including climate region and level of development. Fig. 6 charts the mean population-weighted peak NDVI for cities over time and information in Table 2 summarizes the NDVI measures by HDI category. After classifying each urban area by the Köppen-Geiger Climate Classification category, the average population-weighted peak NDVI is calculated for each of the five categories: polar, arid, temperate, continental, and tropical (as seen in Fig. 7 ) and mean population-weighted peak NDVI is summarized in Table 3 .  Fig. 3. Distribution of cities with a high or very high level of greenness based on its population-weighted peak NDVI in 2020.

Experimental Design, Materials and Methods
Data on population size and density for 2010, 2015 and 2020 were collected from the Gridded Population of the World, Version 4 (GPWv4): Population Density, Revision 11 at Columbia University's Center for International Earth Science Information Network (CIESIN) [5] . Urban area spatial extents were taken from the Global Human Settlement Urban center Database R2019A (GHS) [6] . Urban centers with populations larger than 50 0,0 0 0 were included in the analysis. For countries lacking urban areas with populations greater than 50 0,0 0 0, the most populated urban area in the country was selected for inclusion. Based on these methods, 1042 major ur-ban centers across 174 countries were identified. However, remote sensing data were unavailable (either due to cloud cover or equipment malfunction) for two of the countries for all seasons, which resulted in our final dataset comprised of 1038 urban centers across 172 countries.
The Human Development Index (HDI) classifications for each country were provided by the United Nations Human Development Report Office (UN HDRO) [7] . In addition, climate regions were defined using an updated version of the Köppen-Geiger climate classification system at 1 km spatial resolution [8] . For this study, the five general classes of climate were used, which include tropical, arid, temperate, cold and polar. Stratification using this system allows for increased understanding and tracking changes in greenness by region over time.
NDVI was derived from the joint National Aeronautics and Space Administration (NASA)/U.S. Geological Survey (USGS) Landsat program [9] . Landsat 8 images were used to calculate NDVI in 2015 and 2020 and Landsat 7 for 2010. While we could have used Landsat 7 images to calculate NDVI for all years, a hardware equipment failure in 2003 resulted in large amounts of missing data since then. As a result, Landsat 8 images were used, when possible, to minimize data gaps.
ArcGIS Pro 3.0.4 was utilized to subset the GHS layer by cities with populations of at least 50 0,0 0 0 [10] . Shapefiles of these cities were then imported into Google Earth Engine (GEE). In GEE, remotely sensed data was downloaded for each year and applied the ee.Algorithms.Landsat.simpleComposite() method and the default parameters to create a Landsat composite for each season. GEE's built-in function to compute NDVI was then applied. With this approach, all non-cloud-contaminated satellite images collected during our specified time periods were used in averaging the NDVI for each season and city. Seasons were defined based on northern hemisphere seasonal parameters: December 1 to February 31 as "winter", March 1 to May 31 as "spring", June 1 to August 31 as "summer", and September 1 to November 30 as "fall". The same dates were used to denote southern hemisphere seasons but reversed to reflect known seasonal patterns: June 1 to August 31 as "winter", September 1 to November 30 as "spring", December 1 to February 31 as "summer", and March 1 to May 31 as "fall". Any negative NDVI values, indicating water, were set to zero. Finally, mean NDVI for each city and season were determined using the reduceRegions method. The function reduceRegions was applied over the cloud-free Landsat composite per season, reduced them over the shapefile of GHS cities, and set the reducer to ee.Reducer.mean, and the scale to 100 m. This was repeated using the same method for each year to generate output including the mean NDVI values for each season and city in the years 2010, 2015, and 2020.
NDVI values above 0 were classified the level of greenness into 7 indicator levels as outlined in Table 4 , indicating NDVI values at exceptionally low, very low, low, moderate, high, very high, and exceptionally high levels. Using GEE, population-weighted NDVI for each season and year was calculated to assess population exposure to green space within each urban center. The equation used for the population-weighted NDVI is shown below: where i is an individual pixel, and n is the total number of pixels within an urban center. Using the cloud-free Landsat composites with negative NDVI values set to zero, each NDVI value was multiplied by the population size (from CIESIN/GPWv4) of the corresponding year within the same 1 × 1 km pixel. Applying reduceRegions, the numerator (i.e., sum of the weighted values) was generated with the same parameters as above, except the reducer was changed to ee.Reducer.sum. For the denominator (i.e., sum of the weights), images were loaded from "CIESIN/GPWV4/population-density/2010 or 2015 or 2020." The reduceRegions command was applied to the population density layer using the same parameters as the numerator. The sum of the weighted values and the sum of the weights were determined for each urban area and all seasons per year using GEE. We did not calculate the final population-weighted NDVI in GEE. The remainder of the analysis was done using R Statistical Software (R) to compute variations of the population-weighted averages, as well as to process and analyze data [11] . Outputs in GEE were generated for each urban area in the dataset, including NDVI by season per city per year, population-weighted NDVI by season per city per year, and population of each city. Using R, four metrics were calculated per year per city based on the GEE outputs, including peak NDVI (maximum NDVI across the four seasons), annual mean based on the four-season average NDVI (annual), population-weighted peak NDVI (pop-weighted peak), and populationweighted annual mean NDVI (pop-weighted annual).
While our GEE script selected only images with minimal cloud cover, remote sensing data can have missing information for several reasons. [ 12 , 13 ] As a result, some cities were missing NDVI data for one or more seasons. Other missingness reflected a known equipment failure on Landsat 7 (used for 2010 values, yielding 303 missing values in 2010. Using Landsat 8, there were 16 and 21 missing values in 2015 and 2020, respectively. Our R script removed those missing values and calculated the metrics described above based on available data. Future iterations of the data will include additional years and we will consider the addition of urban blue space to form a hybrid indicator.