A comprehensive generalizability assessment of data-driven Urban Heat Island (UHI) models

Data-driven models serve as valuable tools for understanding and tackling the UHI phenomenon that can provide user-friendly platforms for urban planners for incorporating UHI considerations in their decisions. This study aims to assess the generalizability of data-driven UHI models at the street-level resolution, particularly consid- ering various similarity degrees of urban contexts between training and testing cities. Five cities from three countries were selected to encompass a diverse range of similarities in this comparative study. Five Random Forest models were developed. The lowest-performing model has an R 2 value of 0.56 and an MAE of 0.07, and the highest-performing model has an R 2 of 0.71 and an MAE of 0.05. While these models proved to be accurate for the cities they were trained for, cross-validation of the models in different cities revealed low generalizability capabilities, irrespective of the similarity degree between training and testing datasets. Small changes in feature importance resulted in significant variation in UHI derivation mechanisms and behavior, which contributes to the models ’ low generalizability. The findings of this research indicate that universal mitigation strategies may not yield consistent outcomes worldwide, and a one-size-fits-all approach may be inefficient in addressing UHI. Hence, it ’ s vital to tackle UHI locally.


Introduction
Urban areas are expanding horizontally and vertically as more people migrate to cities (Jain et al., 2020). The United Nations anticipates that cities will host 68% of the world's population by 2050 (United Nations, 2019). Creating and maintaining a sustainable, safe, and livable environment in future cities would require city expansion to be accompanied by a complete transformation of the urban planning mindset. That is because previous research has indicated that the tactless expansion of dense urban settlements will intensify the climate change problem (The World Bank Group, 2022). Among other reasons, this can be attributed to the fact that materials commonly used in urban areas increase the storage of solar radiation in cities (Salvati et al., 2022). Also, anthropogenic heat released from buildings, which is generated due to human activities, and the poor evaporation of moisture from hard surfaces, contribute to temperature rise in cities (Rizwan, Dennis & Chunho, 2008). The phenomenon of the temperature differential between urban and rural areas due to excessive heat generation/storage in urban areas is commonly known as the urban heat island (UHI) effect (Oke, 1988). Human health, living comfort, and the local economy are significantly affected by the UHI phenomenon (Akbari & Kolokotsa, 2016). Among the most vulnerable segments of society are, for example, elderly or very young people (Aflaki et al., 2017), people living in low-income housing (Sakka, Santamouris, Livada, Nicol & Wilson, 2012), and also people that are performing long-lasting physical work in warm environments (Acharya, Boggess & Zhang, 2018). In spite of the negative impacts on human health and the environment, UHI is often not given sufficient attention in urban planning due to the challenges involved in modeling it (Parsaee, Joybari, Mirzaei & Haghighat, 2019;Wong, Jusuf & Tan, 2011). Conventionally, physics-based and numerical simulation models are used to analyze the UHI effect at the micro-(i. e., individual buildings or street segments) and macro-scale (i.e., neighborhoods). These physics-based approaches rely on governing thermodynamics principles, such as thermal convection, solar radiation exchange, and air ventilation around buildings (Mirzaei, 2015). Ener-gyPlus (Crawley et al., 2001), Envi-met (Huttner & Bruse, 2009), and Urban Weather Generator (Nakano, Bueno, Norford & Reinhart, 2015) are some examples of simulation software packages that are commonly used by urban climate researchers and sustainability engineers to evaluate UHI. Physics-based modeling approaches, however, have a significant limitation in that their applicability is restricted by the computational capacity required. In order to create these models, extensive representations of building geometries and properties of building materials are necessary (Bherwani, Singh & Kumar, 2020;Parsaee et al., 2019). As a result, simulations are often limited to simplified scenarios, such as a single building or an urban canyon. Furthermore, the process of developing these models can be time-consuming, and the resulting simulation data can be overly complex for average city planners to interpret and use in their decision-making processes (Condon, Cavens & Miller, 2009;Schindler & Dionisio, 2021;Wong et al., 2011). To address this limitation, data-driven UHI assessment models have gained popularity over the past few years (Jato-Espino, Manchado, Roldán-Valcarce & Moscardó, 2022; Lyu, Wang, Han, Catlett & Wang, 2022;MacLachlan, Biggs, Roberts & Boruff, 2021;Mohammad, Goswami, Chauhan & Nayak, 2022;Vulova, Meier, Fenner, Nouri & Kleinschmit, 2020). The increasing availability of urban and satellite imagery data makes it possible to study correlations between urban characteristics (i.e., low level of detail in the properties), meteorological data, and heat generation, without the need of modeling the heat exchange process . Many machine learning (ML) methods have been used to analyze buildings' energy performances and UHI-related issues for specific urban environments (Gobakis et al., 2011;Miles & Esau, 2020;Nutkiewicz, Yang & Jain, 2018;Sun, Gao, Li, Wang & Liu, 2019). The recent work of the authors (Pena Acosta, Faridaddin, Santos, Hammad & Doree, 2021a) implemented both random forest (RF) and decision tree (DT) approaches to make a distinction between five UHI intensity levels of UHI at the street level, using publicly available datasets carrying geospatial information. While data-driven models are shown to be very accurate in predicting UHI, they are commonly tested on the same environment/urban context that was used for training the model. It should be noted that while the application of data-driven models is easy and user-friendly, the training of an accurate model (including the collection and processing of the data) can be complex and time-consuming for urban planners and decision-makers. It would be, therefore, ideal if an accurate model developed for a given urban context can be readily used for other contexts as well. However, this would require models to be generalizable (Demuzere, Bechtel & Mills, 2019). Generalizability in this context is defined as the ability of a data-driven model to accurately predict the UHI phenomenon for cities other than the one used for the training of the model. Generalizable data-driven predictors of UHI intensity could be of great value for urban planners, as they would yield high accuracies in UHI prediction, without the necessity to employ location-specific data to train the model in advance.
The generalizability of data-driven models is known to be an intricate issue that requires thorough analysis since models can be very sensitive to a myriad of parameters (e.g., the scope of the dataset, hyperparameters, and selected features), and for complex urban phenomena, these parameters can be case-sensitive (Sun, Fung & Haghighat, 2022). However, the majority of data-driven research in the domain of urban analysis seems to overlook this important aspect of the data-driven models, which has great relevance to the applicability and practicality of the models in urban decision-making.
The preliminary earlier research of the authors (Pena Acosta et al., 2021a) indicated low generalizability of data-driven UHI models. However, this research only investigated generalizability for an extreme case where the two studied cities (i.e., Montreal in Canada, and Apeldoorn in the Netherlands) differed significantly in terms of size, locality, weather condition, urban morphology, and socioeconomic factors. The logical question that arose at the end of the previous research was how the generalizability would change if the studied cities were more similar. In other words, the correlation between the degrees of similarity of the urban contexts and the extent to which data-driven UHI models are generalizable is not known. Also, it is not fully understood how the deriving mechanism of UHI would depend on the context. That is because most researchers looked into the importance level of features in the data-driven models, whereas very little attention was paid to understanding how the feature importance changes across different cities and also in relation to the degrees of similarities between the cities.
To this end, this research pursues a twofold objective of: 1. Perform a comprehensive generalizability assessment of data-driven UHI models considering the degrees of similarity between training and testing datasets, and; 2. Investigate the correlation between degrees of similarity in urban contexts and the generalizability of data-driven UHI models.
The remainder of this paper is structured as follows: the next section looks at the current literature on data-driven approaches to UHI modeling, followed by Section 3 which describes the methodology for assessing the generalizability of the data-driven UHI models. Thereafter, the results of the generalizability assessment are presented. The paper closes with a discussion and conclusions drawn from the research results.

Literature review
Data-driven approaches for UHI modeling have become increasingly popular due to recent advancements in computing power, urban data, and ML techniques. This section explores the current trends in datadriven approaches, and their current limitations, and offers insights into the present state of the art concerning the generalizability and transferability of these models. Fig. 1 summarizes the trends in data-driven approaches to UHI modeling from 2011 to the present. Based on a review of the literature, four main trends can be described: (1) mechanisms, implications, and trends over time, (2) mitigation and sustainable city planning, (3) quantification, patterns, and drivers, and (4) remote sensing and highresolution databases.
The first category (i.e., mechanisms, implications, and trends over time) focuses on the underlying causes and mechanisms of UHI. These Urban Heat Island studies analyze the processes driving UHI, the factors that contribute to their formation and intensity, and the potential impacts on the built environment, such as increased energy consumption, discomfort, and health risks (Cakmakli & Rashed-Ali, 2022;Lemoine-Rodríguez, Inostroza & Zepp, 2022). The research in this category explores how the phenomenon changes over time in relation to the urbanization (Manoli et al., 2019;Shen et al., 2022), and climate change Yang & Yao, 2022;Yang, Huang & Tang, 2019), providing insights into potential future scenarios. However, most efforts have focused on studying UHI at the city scale, with little investigation into the microscale mechanisms driving UHI. The second category (i.e., mitigation, and sustainable city planning), explores strategies to minimize the effects of UHI and promote sustainable urban development (Phelan et al., 2015;Tsoka, Tsikaloudaki, Theodosiou & Bikas, 2020). The research trend in this category investigates various approaches to reduce UHI intensity, such as incorporating green spaces (Marando et al., 2022), and implementing cool roofing materials (Tsoka, 2020). Furthermore, these studies often use case studies to illustrate successful mitigation strategies and highlight the importance of incorporating UHI considerations into urban planning and design (Haddad et al., 2020;Lamb, Callaghan, Creutzig, Khosla & Minx, 2018, 2019MacLachlan et al., 2021;Phelan et al., 2015;Pierer & Creutzig, 2019). While research in this category has made significant progress, further investigation is needed to understand how to effectively implement UHI mitigation strategies in different urban contexts, considering the unique conditions of each built environment.

Nomenclature
In the third category (i.e., quantification, patterns, and drivers), the focus is on the measurement, spatial distribution, and factors influencing UHI. These studies use various methods, such as remote sensing, geographic information systems (GIS), and ML, to quantify UHI intensity and its spatiotemporal patterns (Yang et al., 2019). These studies also examine the relationships between UHI and other factors, such as land use, population density (Manoli et al., 2019), and climatic zones (Mohammad & Goswami, 2021), to understand the primary drivers of UHI formation and variation within the built environment (Li & Zha, 2019) (Yang & Yao, 2022). However, there is a need for further research to explore the dynamic and complex relationships between UHI and other urban elements, especially social and economic features that contribute to UHI.
The last category (i.e., remote sensing and high-resolution databases), groups the research utilizing remote sensing technologies and high-resolution databases to study UHI (Benz, Davis & Burney, 2021;Creutzig et al., 2019;Wang, Meng, Fu, Pei & Xu, 2018). These studies leverage satellite imagery to map and monitor UHI, assess their spatial distribution (Niu et al., 2021), and analyze the relationships between UHIs and various urban factors (Mohammad & Goswami, 2021). Needless to say, remote sensing techniques have enabled researchers to examine UHIs at different scales and resolutions, improving the understanding of their formation, intensity, and implications Manoli, Fatichi, Bou-Zeid & Katul, 2020;Niu, Tang, Jiang & Zhou, 2020;Yang et al., 2019). However, one of the most recurrent problems in this category is that remote sensing data may not always capture the granular details of UHI at the micro-scale level.
Despite significant progress in applying ML to UHI modeling, the question of whether these models are generalizable, transferable, or scalable remains unanswered. As discussed in the introduction, generalizability is arguably central to the practicality of these models. Although, to the best of the authors' knowledge, no research directly investigates UHI ML model generalizability, there have been extensive efforts to define universal Local Climate Zones (LCZs) that classify neighborhoods based on their urban and morphological characteristics (Ching et al., 2018). In other words, LCZs decompose cities into neighborhoods that have more or less similar urban characteristics such as compact low-rise or compact high-rise neighborhoods. Although it can be argued that microclimates within each LCZ have similar characteristics, the concept of LCZ does not directly look into urban classification based on different UHI profiles. As a result, it can be potentially argued that even neighborhoods within the same LCZ may have different UHI profiles. Several authors have highlighted the significance of investigating the transferability of ML-based LCZ classifiers ; Yang, Huang and Li (2017)) Recently, Demuzere et al., (Demuzere et al., 2019) looked into the transferability of LCZ ML-based classifiers. They have discovered while strong LCZ classifiers can be developed for the same urban contexts, the models generally perform poorly when used to classify LCZs of cities not included in the training dataset. Although this study considered transferability, there are a few inherent limitations and differences that distinguish the present research from the previous work. (1) LCZs are defined at the neighborhood level. However, as the authors argued in their previous work, the smallest unit of urban decision-making is commonly streets (Pena Acosta, Faridaddin, Santos, Hammad & Doree, 2021b). Therefore, it is essential to study UHI at this level of resolution to provide decision-makers with tools that can accommodate the smallest unit of their decision-making; (2) as stated before, LCZs inherently look into the constituent elements of climate zones rather than directly classifying neighborhoods based on their UHI profile. However, the previous work of the authors indicated that significant variation could take place between streets that are morphologically very close and geographically belong to the same neighborhood. This can be partly because UHI and local climate zones are also partially impacted by socioeconomic factors (e.g., urban land use) that can vary within the same neighborhood. On this premise, the authors believe that UHI is better studied at the street level and classified directly based on their temperature profiles. To summarize, despite significant advancements, there remains a need to investigate the generalizability and transferability of data-driven UHI models at street-level resolution across different cities or regions. Fig. 2 presents an overview of the research methodology. This research included three main phases, namely data collection, data processing, and data analysis. In a nutshell, five different cities with different degrees of similarity were used in this research. The data collection phase concerned the collection of the data required for the training of data-driven models for these five cities. Next, the collected data were processed to build a consistent database that allows easy development of the models. Finally, the generalizability was assessed through (1) performing feature engineering to select balanced training data from the dataset, (2) developing an optimized model for each city, (3) assessing the degrees of similarity between each pair of cities, and finally (4) applying cross-validation of the UHI model of each city on all the remaining cities. Each of these phases is explained in detail in the following sections.

Data collection
This phase consisted of three main tasks, namely (1) city selection, (2) feature selection, and (3) data collection.

City selection
The core hypothesis that this research aims to test is that the degree of generalizability is correlated with the degree of similarity between cities. Similarity, as it pertains to this study, refers to the extent to which various urban features from different urban contexts (e.g., buildings, population, land use) are comparable to each other. This similarity is hypothesized to allow the model to utilize the knowledge it has obtained from a particular dataset to predict new, previously unseen data. Therefore, five different cities were selected, with the main selection criterion being a balanced distribution of levels of similarity between the cities. So, special attention was paid to ensuring that the selected cities cover the whole spectrum of similarity levels, ranging from very similar to very dissimilar. As will be explained later in Section 3.3.3, the precise assessment of similarity levels requires the collection of the data and development of the model. Therefore, it was not possible to conduct a quantitative assessment of similarities prior to data collection. Consequently, the preliminary assessment of similarity was carried out solely based on examining the population, size, urban morphology, and climatic region of different cities. Thus, three cities were chosen for two distinct climatic regions, specifically the Maritime and Humid Continental. Within each climatic zone, the aim was to include one large and two mid-sized cities in one zone, and the opposite configuration in the other zone. This approach was taken because larger cities within the same climate zone tend to have better matching characteristics than smaller cities. In the Maritime climatic region, Enschede and Apeldoorn, two mid-sized cities, were selected. Both cities share similar patterns in their urban fabric and size, although their density and greenery vary. Rotterdam was chosen as the large-sized city for this climatic region due to its urban geometry, density, and population. For the Humid Continental region, Montreal and New York were selected as large-sized cities. Quebec City was initially chosen as a mid-sized city; however, due to data unavailability, it was not included in the study. As a result, the study focused on Montreal, New York, and Rotterdam as examples of large-sized cities in two different climatic regions and Enschede and Apeldoorn as mid-sized cities. The similarities derived from the feature distributions are quantified after the data has been collected and processed. Table 1 lists the five cities that were selected for this research. The climatic region, size, and population are presented as indicators for the environmental, urban morphological, and socioeconomic factors respectively, as proposed in the previous work of the authors (Pena Acosta et al., 2021a).

Feature selection and data availability
The data used in this study were available through open data sources, mainly operated by local governmental institutions. Hence, the cities in this study did not share the same data sources. To enable a comparison between UHI models, the characteristics and structure of the data were kept consistent across all cities. The data that were gathered as input for the data-driven model included eleven features, i.e., explanatory variables that influence the UHI intensity, and one dependent variable, i.e., the Land Surface Temperature (LST). The selection of these features (e. g., building geometries, vegetation, waterbodies, height/width (H/W) ratio, population density, and land use) was based on the findings from the earlier work of the authors (Pena Acosta et al., 2021a). The ground surface elevation, retrieved from Digital Elevation Models (DEM), was added to the ML model as it is known to have a significant impact on UHI intensity (Geng, 2023). Table 2 lists the features and the corresponding sources of the data that are used in this research. The LST data for all cities were derived from Landsat 8 thermal bands, as provided by the U.  S. Geological Survey (2021). The estimation of LST was based on multiple images taken during the summers of 2019 to 2021. To ensure the accuracy and reliability of the data, only satellite images with a cloud cover of less than 30% were used. This step helped reduce the noise in the thermal images, resulting in a cleaner and more accurate dataset. From here, The top of atmospheric (TOA) spectral radiance was calculated using thermal band 10, and the TOA was then converted to Brightness Temperature (BT). Bands 4 and 5 were used to calculate the normalized density vegetation index (NDVI), which then was used to calculate the proportion of vegetation in each city. Based on the NDVI index, the Emissivity was determined, which was then used to calculate the LST for each city. Finally, the average of all images per city was calculated to obtain a representative LST measurement per location.

Data processing
The processing of the data was done in ArcGIS mapping software. The general workflow for data processing was consistent with the authors' previous work (Pena Acosta et al., 2021a). However, for the sake of completeness, a brief explanation is provided in the following sections.

Specify street buffers
The urban feature data and LST data that have been collected for each city had to be rearranged at the street level. This is because urban planning decisions are best made at the street and neighborhood levels (Jacobs, 2016). For UHI modeling to be useful for urban decision-making, the resolution of the model must match that of decision-making. By considering UHI at the street level, it is possible to capture the impact of the core street features (e.g., dominant land use, the average height of the buildings, the width of the streets, the density of vegetation, etc.) on UHI at the micro level. This would help the cause of developing micro mitigation strategies at the street level to incrementally tackle the problem at the city level.
In order to develop the model at the street level, the local socioeconomic and urban morphological data, and the LST of each street needed to be determined. To this end, a buffer was created around the center line of each street segment as shown in Fig. 3. The buffer makes it possible to identify data points that belong to a street and then group them and estimate the relevant average values. The buffer distance in this research was kept at 15 m, as was also the case in the previous work of the authors (Pena Acosta et al., 2021a).

Extract features
The features that were used to develop the databases take different forms and required different computations. Table 2 presents the overview of features considered in this research and the sources from which these features were extracted. The densities per street segment (i.e., building density, vegetation density, and water density) were computed by taking the proportion of the buffer area that is overlapping with either building, vegetation, or water respectively, as shown in Fig. 3. The  Fig. 3. Buffer area used to capture urban features.  Table 3. Subsequently, the predominant land use of each street was represented by the land use that is the most dominant within the buffer area, as illustrated in Fig. 4(a). The population data for all cities were mapped to a grid structure that represented the number of people per cell. The mean population per street was computed by the weighted average (i.e., taking the proportions of cells that are within the buffer into account) of the population in each cell, as illustrated in Fig. 4 (b). The mean building height was also derived from the weighted average within the buffer. The elevation was obtained by taking the average value of the DEM within the buffer area.

Estimate ΔLST
The intensity of UHI is generally expressed in terms of the LST differential (ΔLST) in the urban area, compared to its rural environment (Oke, 1973), which is commonly represented by a reference point. In this research the LST was estimated using Landsat 8 thermal bands using USGS Landsat Level-1 as discussed in Section 3.1.2. The availability of satellite images between 2019 and 2021 per city varied from a minimum of five for the city of Enschede to a maximum of nine for the cities of Apeldoorn, Rotterdam, New York, and Montreal. The value for ΔLST in each observation was calculated by the difference between the average LST within the created buffer, and the averaged reference LST measured at three independent locations in the rural environment of the city. Fig. 5 shows an LST map of the city of Enschede after processing the Landsat 8 images, and the locations of three reference points in the rural environment of the city. It is worth mentioning that many studies on the causes and mitigations strategies of UHI pointed out that the UHI intensity is correlated with the baseline temperature. This makes comparison of ΔLST between cities from different environments inappropriate. For instance, the average observed LST for New York City was 37.0 • C, while it was 24.3 • C for the city of Apeldoorn. This difference between the observed urban environments will bring noise to the training datasets that are used for both cities. This is important because the goal of this research is to investigate the generalizability of the model for different urban contexts, and this would require one-to-one comparison of the cities. It is, therefore, important to normalize the ΔLST in such a way that allows cross-validation and assessment. Consequently, this research used the average percentage difference ΔLST [%] for each observation, relative to the reference LST in the local rural environment as the dependent variable. Thus, the magnitude of ΔLST per city, in percentage, was used to account for environmental differences between cities. The ΔLST [%] for each observation was averaged using the created buffer areas. The computed values for the features and the ΔLST [%] of each observation were combined, and the data were cleaned (i.e., duplicate removal and fixing structural errors), resulting in five datasets. Table 4 presents an example of how the data is structured.

Data selection
The study aimed to investigate the generalizability of data-driven UHI models across different urban contexts. However, due to differences in the size of the studied cities, the amount of data available per urban context varied. Generally, the performance of the ML models improved as the data population used for training increased. To compare the models of cities and avoid bias in the assessment, a consistent number of data instances were used for all cities. The populations used for each of the cities were equal to the urban context which yielded the smallest data population in total. However, random selection could result in unbalanced datasets in terms of ΔLST [%], especially for larger populations. Therefore, to maximize the range and variance in terms of ΔLST [%] for each city, 5.000 data instances were chosen for the training dataset of each city with the intention to make the data as uniformly distributed as possible. This approach ensured a more uniform distribution of data points and improved the accuracy of the trained models. Fig. 6 presents the histograms of the datasets used for training and testing the RF regressor. The mixed dataset was created by combining 5.000 data points from each city, and its distribution is not uniform to ensure fair representation of all cities.

Individual models
RF regressor was used to develop the UHI model. RF was selected mainly because it is known to be less prone to overfitting due to the use of multiple randomly generated trees. Also, the previous work of the authors indicated that RF is the best-performing model in terms of accuracy (Pena Acosta et al., 2021a). Finally, the fact that RF is an ensemble method, makes it more suited for generalization problems (Breiman, 1996;Geng, 2023). The model development was done in Python using the Scikit-learn library (Pedregosa et al., 2011).
The selected data populations were randomly split into subsets with a 70:30 training and testing ratio. Two methods for hyperparameter tuning were used from the Scikit-learn library. First, Random-izedSearchCV was applied to narrow down the range for each hyperparameter (i.e., number of estimators, minimum samples per split, min samples to reach a leaf, maximum depth, bootstrapping). Subsequently, GridSearchCV was used to obtain the hyperparameter settings of the best-performing models. The performance of the RF regressor was assessed by means of the Mean Absolute Error (MAE) and the Mean Absolute Percentage Error (MAPE) in the prediction of the dependent variable (i.e., ΔLST). For each model, the goodness-of-fit is estimated by means of R-squared (R 2 ).

Similarity index
To better understand the generalizability of the model for the different urban contexts, a similarity index (SI) was defined for every combination of two cities. In this research, SI is defined in terms of how similar the features are considering the importance of features. In other words, similarity in more important features carries more weight than  Fig. 6. Histogram of all ΔLST data for each city. similarity in less important features. It should be highlighted that because of the dependency of SI on the feature importance and given that feature importance is a unique characteristic of the model developed for a given context, the direction of similarity assessment is important. In other words, the SI of City A to City B is not equal to the SI of City B to City A, because the importance of features in the models of City A and City B is not the same. To calculate SI, as proposed in this research, first, the relative contribution of the features was investigated.
To this end, the feature importance was extracted from the bestperforming RF regressor of each city. Subsequently, the (dis-)similarities in the distributions of each feature between the two cities were assessed using a two-sample Kolmogorov-Smirnov (KS) test. KS is a nonparametric test used to quantify the distance between two empirical distributions, considering the KS-statistic (D) as a measure of the dissimilarity. The distance between the given distributions of feature F (x) of two cities y and z was obtained following Eq. (1): Where D yz is the KS-statistic corresponding to cities y and z, and sup x is the supremum (i.e., largest absolute difference) of the cumulative functions F y (x) and F z (x). D yz varies between 0 and 1, where the closer the value to 0 the greater the similarity between the distributions. KS was selected to assess the similarity between distributions because it does not require any particular assumption on the nature of the underlying distributions. The feature importance of the best-performing model of City Y operates as the weighing factor to calculate dissimilarities. The SI was calculated by taking the sum of all feature distances, multiplied by the corresponding feature importance. The calculation of the SI is expressed in Eq. (2): Where SI yz is the similarity index for the best-performing model of city y cross-validated to city z. D i is the KS-statistic for feature i (obtained from Eq. (1)), FI i is the feature importance considering feature i, and n is the number of features used in the RF regressor. The structure of Table 5 presents an example of SI calculation.

External cross-validation
The best-performing model for each city in terms of R 2 was then used to predict ΔLST [%] of the streets of the other four cities, resulting in 20 external cross-validations in total. Finally, the generalizability of the RF regressor was assessed for different scenarios, considering the varying levels (dis)-similarities, expressed in terms of SIs.

Study results
This section presents the results of the generalizability study, which was explained in the previous section. As briefly explain in Section 3.3.1, the collection and processing of the data resulted in five independent datasets, structured according to Table 4. As mentioned before, due to differences in the size of the cities, the datasets varied in size. As shown in Table 6, the city of Montreal yields the smallest number of observations (slightly over 5.000). In contrast, the data for New York City adds up to a total of 83.000 instances, making this dataset the largest. Table 6 also presents the mean and standard deviation of each feature for the entire dataset. The histograms of the data in terms of ΔLST [%] for all cities are presented in Fig. 6. In addition to each city, it was found interesting to look at the accuracy of the mixed model, i.e., the model that contains the data of all the cities.
To make the comparison between different models' fair, the training dataset of each city was reduced to a total number of approximately 5.000 data points (approximately the size of the smallest dataset), selected from the total data populations. Random selection, however, results in unbalanced datasets in terms of ΔLST [%], especially for larger populations like New York and Rotterdam. Since only a small proportion of the total population is selected for these cities (5.000 out of 83.000 and 15.000 respectively), it is more likely that instances close to the mean are selected, as there are considerably more observations in this range compared to the lower and higher values of ΔLST [%]. As a result, the data instances with relatively high and low UHI intensities are overlooked. Likewise, the RF regressor gains the most information from a more varied dataset (Lan, Hu, Jiang, Yang & Zhao, 2020). Particularly the higher and lower UHI intensities are of major interest to city developers, in order to evaluate mitigation strategies. Therefore, the 5.000 data instances are picked from the total population, such that the range and variance in terms of ΔLST [%] are maximized for each city. In other words, the final distribution of data points for each city was made as uniform as possible. It should be noted that it was difficult to have very uniform distributions for cities with fewer data points because this could have only been achieved by making the dataset smaller than 5.000 data points, which then would have an impact on the accuracy of the trained model. Fig. 7 presents the histograms of the datasets that were used for training and testing the RF regressor. Please note that the mixed dataset was built based on combining 5.000 data points of each city. Therefore, the distribution of the mixed dataset is not uniform. If this distribution was to be made uniform, then cities would have not been represented fairly in the mixed dataset.
After building the training and testing datasets, the individual models for each city were trained using the best hyperparameter settings (i.e., yielding the highest value for R 2 ). Fig. 8 presents the scatterplots of best-performing models for each city and also for the mixed dataset. As shown in this figure, all models have similar performances. The lowestperforming model is that of Rotterdam, where R 2 is 0.56 and MAE is 0.07. Even in this case, as indicated by the results, the error rate is very low. So, in general, it can be concluded that all models performed well in predicting the same context as the training dataset. This observation is consistent with the previous work of the authors (Pena Acosta et al., 2021a). It is interesting to see that the mixed model also performed quite well, and its performance does not deviate greatly from the specific models. The feature importance analysis was carried out for each model, as shown in Table 7. In this table, the magnitude of impact and the ranking of each feature for each model are shown. In addition to all six models, the average magnitude/ranking of features of the 5 models (i.e.,    Table 7. Also, the range of variation in the rankings of each feature across different models is shown in Fig. 9. The rankings in the figure represent the relative importance of each feature across all models, and the range of variation in the rankings indicates how much the importance of each feature varies across the different models. As shown in Table 7 and Fig. 9, the ranking of features is fairly consistent across different models, with vegetation density , elevation (Marando et al., 2022), and predominant land use being (Kwak, Park & Deal, 2020) the most influential features. However, when looking at the magnitude of feature importance, there is considerable variability.
To further analyze the generalizability of the models, crossvalidation of models was performed as explained in the previous section. To this end, the best-performing model of each city was tested on the data from all remaining cities. Table 8 shows the results of this analysis. As shown in this table all models performed very poorly when applied to other cities. All performance metrics show a significant decline. The negative R 2 values related to the majority of the models suggest that the trained models do not follow the trend of the test datasets. In other words, the models exhibit very low generalizability. This can be partially explained by the high variability in the magnitude of feature importance within each model.
As mentioned in Section 1, one objective of this research was to investigate the correlation between degrees of similarity and the generalizability performance of the models. To this end, the SIs of different combinations of the cities were estimated using the method explained in Section 3.3.3. Table 9 presents the results of SI estimation. As mentioned before, the SI is solely concerned with UHI modeling and, thus, estimated considering the varying importance level of different features. Consequently, SIs do not necessarily represent the intuitive similarity assessment one may have before looking at the data. As shown in Table 9, geographic proximity played little to no role in determining SIs, which is counterintuitive. The same statement holds with respect to similarities in urban morphology and socioeconomic characteristics. As a result, the similarity levels between different cities did not vary significantly and hovered around the range of 0.47 to 0.71. Despite this preliminary observation, the correlation between SIs and the performance of cross-validated models was investigated. To this end, Pearson's correlation test was performed. As shown in Table 10, the high P values, and low Pearson's Coefficients (r values) of all three performance metrics suggest a very weak correlation between SIs and model performances. This highlights the strong dependency of the developed models on the specificities of the context. In other words, no matter how similar the two contexts are, the data-driven UHI model of one context most probably cannot explain the other context without any modifications of the model parameters.
However, looking at the good performance of the mixed model, one can argue that if the training dataset includes data of more than one context (e.g., a model trained on the data of two cities) it may demonstrate better generalizability. To test this hypothesis, an extreme  Fig. 9. Box and whisker plot representing the ranges of ranking of each feature across different models.
scenario was tested where models are developed based on four cities and then tested on the only city that was kept outside of the training dataset. The advantages of mixing the datasets are twofold. First, the model is trained on a larger dataset, which generally improves the accuracy and generalizability of RF regressors. Second, a mixed dataset captures a more averaged relationship between UHI intensity and the feature values of different cities, reducing the chance of overfitting the training data from one specific urban context. In the five additional scenarios, the model is trained on a mixed dataset containing approximately 20.000 data instances from four out of the five cities, and then cross-validated on the 5.000 data instances from the city that was not included in the mixed data population. Table 11 shows the results of this investigation. As shown in this table, while compared to individual models the mixed models performed slightly worst, yet all performed consistently well within the training contexts. Yet, when the models were applied to the context that was excluded from the training, again the performances declined significantly. This, once more, suggests that the data-driven models are very dependent and sensitive to different urban contexts, no matter how extensive the scope of the training dataset is.

Discussion and future work
As climate change has become an increasingly pressing issue, cities have taken on a greater role in addressing it. As reflected by Creutzig et al. (2019), it is imperative to build transferable knowledge across cities, so that city-to-city learning can occur, and knowledge-based climate solutions can be shared. This research generated insights into the generalizability aspect of data-driven UHI models at the micro level (i.e., street-level) resolution, by looking at how data-driven UHI models developed based on the data from streets of one city would perform when applied to predict UHI in streets of other cities. A comprehensive dataset was collected that encompassed data from five different cities with diverse urban morphological and socioeconomic characteristics. Additionally, the correlation between the models' cross-validation performances and the degrees of similarity among cities was studied.
Recent research has pointed to the usability of implementing ML    approaches to analyze complex urban environments effectively across different cities, such as urban land-use mapping (Mao, Lu, Hou, Liu & Yue, 2020;Yang et al., 2017). Similarly, progress has been made in developing a global, culturally-neutral framework for classifying and delineating urban landscapes, applicable across various regions and cities (Bechtel et al., 2015;Demuzere et al., 2019;Demuzere, Kittner & Bechtel, 2021). However, to the best of the authors' knowledge, this study marks the first comprehensive examination of the generalizability of the data-driven street-level UHI model. In line with the findings of Demuzere et al. (2019), who reported a low generalizability of LCZ classifiers, this study also revealed that the same trend is observed when UHI is studied directly and at the street level. The most significant finding of this research is that the hypothesis that street-level resolution data-driven UHI models can be generalizable when applied to other similar (i.e., in terms of features that account for UHI) cities is shown to be invalid. It appears that the UHI mechanism is very context-sensitive and dependent. This is shown to be the case despite the fact that features that account for UHI have similar rankings across different cities. It can be interpreted that small changes in the magnitude of feature importance can result in significant variation in UHI deriving mechanism and behavior. With an extension, it can be inferred that blanket/universal mitigation strategies may not deliver similar outcomes all over the world and it can be a wrong approach to deal with UHI. Instead, it is imperative that UHI is studied and addressed in a context-specific manner, considering the specific deriving forces of UHI for any given context. Of course, in the development of mitigation strategies, it is equally important to consider the cost and feasibility of changing the features. For instance, while changing the H/W ratio of streets requires long-term investment, changing the land use and adding vegetation can be far more feasible in the short-term. But, as shown in this research, it is equally important to consider the actual contributions of each feature to the specific context of one city because the change in one feature could have different degrees of impact in different cities.
This research provides further evidence that the framework introduced in the previous work by the authors (Pena Acosta et al., 2021a) has great potential in revealing the UHI mechanism in each city and it is fairly easy to implement in different cities, in spite of differences in how the required data are structured and stored. This indicates the strong possibility and value of using this approach in urban planning decision-making. Nevertheless, the authors strongly believe in the need for the development of a standardized method to store and structure UHI-related data at the national and international levels. This is mainly because although the models show very low generalizability, it is shown that mixing the data in a universal model can generate reliable outcomes. So, the development of an UHI ontology can be very useful for researchers and urban planners to be able to study and address the phenomenon at the global level.
On a similar note, this research only considered the application of the UHI model of one city to another city in an as-is state, i.e., without retraining the model for the new context. Although this resulted in the low generalizability of the models, the observation can change if transfer learning techniques are applied to adjust the existing model based on a small and limited amount of data from a new city. The authors believe that transfer learning can potentially yield much better outcomes in terms of generalizability, given that mixed datasets perform relatively well. The authors are already pursuing this research avenue and present the results in the future. But it is in light of efforts such as this that the development of standardization for UHI data becomes crucial. If transfer learning is shown to be promising, which seems a likely outcome, then being able to readily integrate data becomes very valuable and necessary. Standardized data structure would also open the way for automation pipelines that can potentially relieve the urban planners and city decision-makers of the need to collect data and develop UHI models.
It should be highlighted that this research solely looked at the surface temperature data, which describes what is known as surface urban heat island (SUHI). While correlated, canopy urban heat island (CUHI) is known to have a different mechanism from SUHI. It is imperative that the distinction between the two types of UHI is more thoroughly studied and the generalizability assessments and feature importance analysis is done in view of the possible differences between CUHI and SUHI. This forms part of the future work of the authors. Finally, this research analyzed Landsat 8 images that were available during the summer months over a period of three years. Given the low frequency of the data and also less-than-desirable consistency in the timing of the data, it is important to consider a more consistently-built dataset for this type of research that also considers UHI at various times of the day/year. Although this limitation exists, it must be highlighted that the authors have used the most common source of data for data-driven UHI research, which is used by many researchers before. The statement about the limitation caused by data should not be viewed as an attempt to invalidate the findings Instead, it should be seen as a call to action for governments, and municipalities to invest in better data collection strategies, which can, in turn, improve the reliability and applicability of future research in this area. The authors are also busy developing a framework for a more systematic data collection and storage that better supports the data-driven modeling of UHI.

Conclusions
This study aimed to investigate the generalizability of data-driven UHI models across different urban contexts. In order to do so, this study collected and processed five independent datasets for five different cities. The dataset for Montreal had slightly over 5000 observations, while the dataset for New York City had the largest with 83,000 instances. The RF model was then trained and tested using these datasets. The RF regressor performed well for all cities and showed similar performances for the mixed model, with the lowest-performing model having an R 2 value of 0.56 and MAE of 0.07, while the highestperforming model had an R 2 of 0.71 and MAE of 0.05. However, when cross-validated, the models performed poorly on data from other cities, suggesting low generalizability. The feature importance analysis showed that vegetation density, elevation, and predominant land use were the most influential features in the predictions of the models. The similarity index (SI) was estimated to investigate the correlation between the degrees of similarity and the generalizability performance of the models. The results showed no significant correlation between SIs and model performances, indicating the developed models' strong dependency on the context's specificities. Moreover, testing models on the excluded city of the mixed dataset showed that the models were still very sensitive to different urban contexts, underscoring that data-driven UHI models are context-specific and require modifications of the model parameters for different urban contexts.
Finally, this research emphasizes the importance of context-specific approaches to understanding and mitigating UHI, as the generalizability of data-driven UHI models across different cities is found to be limited. The following two main conclusions can be made: 1. The comprehensive generalizability assessment of data-driven UHI models, considering the degrees of similarity between training and testing datasets, reveals that these models exhibit limited generalizability when applied to different urban contexts. This finding highlights the need for context-specific approaches to understanding and mitigating UHI. 2. The investigation into the correlation between degrees of similarity in urban contexts and the generalizability of data-driven UHI models shows no significant correlation. Indicating that the models' performance strongly depends on the specificities of the urban context, emphasizing the importance of tailoring UHI mitigation strategies to each urban environment unique elements and requirements.
Based on the above, the main contribution of this research is to provide evidence for the low the generalizability ML-based UHI models.
To the best of authors' knowledge this is the first time the generalizability of ML-based UHI models has been studied at this scale and extent and although the findings invalidated the initial hypothesis of the authors, it provides an insight into the inner mechanisms of street-level UHI modeling. It also offers a roadmap for the future research direction because as shown in this work, the ML-based UHI models cannot be transferred across cities in an as-is state. However, this may change if transfer learning methods are applied to these models.
Based on the findings, the following four recommendations are proposed for future street-level resolution data-driven UHU studies and data-driven urban planning: 1. The implementation of UHI mitigation/adaptation strategies should be context-specific because the unique driving factors of UHI in each city, at a local level, result in different outcomes when employing a one-size-fits-all mitigation approach. 2. There is an urgent need for more consistently-built datasets and systematic data collection methods. To enhance the accuracy and reliability of data-driven UHI modeling, researchers should strive to create more consistent datasets and develop frameworks for systematic data collection and storage. This is because, while the mitigation of UHI is context-specific, building models that can be reused globally is certainly an avenue worth exploring, and consistent datasets can aid these efforts. 3. Further research should investigate the effectiveness of transfer learning for improving the generalizability of data-driven UHI models when applied to new cities, as this approach may yield better outcomes compared to using models in their as-is state. 4. Future studies should examine the distinctions between Canopy UHI and Surface UHI, two types of UHI and assess the generalizability of models and feature importance analysis in light of these differences, and the combined effect at a street-level analysis.