Machine learning for spatial analyses in urban areas a scoping review

The challenges for sustainable cities to protect the environment, ensure economic growth, and maintain social justice have been widely recognized. Along with the digitization, availability of large datasets, Machine Learning (ML) and Artificial Intelligence (AI) are promising to revolutionize the way we analyze and plan urban areas, opening new opportunities for the sustainable city agenda. Especially urban spatial planning problems can benefit from ML approaches, leading to an increasing number of ML publications across different domains. What is missing is an overview of the most prominent domains in spatial urban ML along with a mapping of specific applied approaches. This paper aims to address this gap and guide researchers in the field of urban science and spatial data analysis to the most used methods and unexplored research gaps. We present a scoping review of ML studies that used geospatial data to analyze urban areas. Our review focuses on revealing the most prominent topics, data sources, ML methods and approaches to parameter selection. Furthermore, we determine the most prominent patterns and challenges in the use of ML. Through our analysis, we identify knowledge gaps in ML methods for spatial data science and data specifications to guide future research.


Introduction
Cities are facing tremendous environmental, infrastructural and social challenges that are unprecedented in scale, scope, and complexity (Meerow & Newell, 2019).To become sustainable, cities need to accommodate a growing population, meet greenhouse gas targets, adapt to a changing climate, and ensure fair and equal living conditions for all.To address these challenges and to improve urban efficiency, justice and quality of life, sustainable smart cities use information and communication technology (Colding et al., 2020).The associated rise of sensors, crowd sourcing and real-time monitoring has tremendously increased the availability of large spatial datasets.Advances in urban geographic information sciences and spatial data analytics have opened new avenues to analyze and visualize spatial data (Goodchild & Haining, 2004).Leveraging these advances and benefiting from the increasing of digital innovations of our cities has been identified as one of the key transformations needed for achieving the Sustainable Development Goals (Sachs et al., 2019).
Today, most prominently Artificial Intelligence (AI) and Machine Learning (ML) provide new opportunities to better monitor, understand, and predict the (sustainable) development of urban areas.As such, urban analytics and modeling have become increasingly prominent to deal with the complex sustainability challenges that cities grapple with (Batty, 2008).Studies such as Nosratabadi et al., and Aram (2020)) and Vinuesa et al. (2020) have used machine learning to improve sustainability and achieve the sustainable development goals.Here, we follow the vision of Elmqvist et al. (2019) in defining a sustainable city via the "integration of all sub-systems in an urban region in ways that guarantee the wellbeing of current and future generations" (Elmqvist et al., 2019).As such, we review subsystems that relate to the social, economic and environmental aspects of sustainable cities, as well as the infrastructural systems that shape the interactions between the different elements (see also Section 3.2 for an overview of categories).
Machine learning (ML) has gained popularity in many research fields.The foundations of ML were already laid in 1959, when Arthur Samuel, a pioneer in AI, coined the term (Samuel, 1959).In a nutshell, ML is a method to train algorithms to understand patterns inherent in data and predict outcomes based on statistical analysis.ML methods are data-driven: they extract meaningful information from data, instead of a priori modeling causal links.The 'learning' aspect herein implies that the better an algorithm performs in a specific task, the better it learned from that experience (Mitchell, 1997).
ML algorithms are divided into two main groups: supervised and unsupervised learning.Supervised learning uses a training set of examples with correct responses (targets) (Hastie et al., 2009;Marsland, 2014) In contrast, in unsupervised learning, correct responses are not provided.Instead, the algorithms aim to identify similarities between inputs and group them (Celebi & Aydin, 2016).Moreover, natural language processing (NLP) developed techniques that aim to extract a fuller meaning representation from free text (Kao & Poteet, 2007).Studies can combine algorithms from supervised, unsupervised and NLP methods.
In the 1990s, Openshaw and Openshaw (1997) published one of the first books about ML applications in geography.Since then, ML has contributed to the fields of geography and spatial analyzes generally, and urban systems more specifically.Spatial ML uses primarily geospatial data, which refers to data containing a geographic component that identifies locations (e.g., coordinates, addresses, and postcodes) or indicates geographically referenced features and conditions, such as the population of a district, seasonal weather of a region, number of vehicles passing a highway intersection, and geo-tagged social media data (Boulos et al., 2019).Moreover, urban spatial ML analyzes different aspects of the urban system, consisting of multiple tangible (e.g., infrastructures, land use) and intangible aspects (e.g., social equality, gentrification).
Recently, GeoAI was proposed as a framework for analyzing datadriven problems in geographic information science (Janowicz et al., 2020;Li, 2020).GeoAI aims to integrate artificial intelligence, in particular deep learning techniques, with geospatial big data and high-performance computing to investigate geospatial problems.In GeoAI, spatially explicit models are viewed as a significant research direction.Those models fulfil at least one of these four requirements: the results are not invariant under the relocation of studied phenomena (invariance test), the models contain a spatial representation of the studied phenomena (representation test), the models make use of spatial concepts in their implementation (formulation test) and the spatial forms of input and outcomes differ (outcome test) (Goodchild, 2001;Janowicz et al., 2020).Clear steps to build spatially explicit models shifting from general ML models to designing more complex ones are not yet well-evaluated.
While spatial data collection has been accelerated through technological innovations (such as social and remote sensing), the availability of the data is not equally distributed throughout the world (Guigoz et al., 2017;Leyk et al., 2019).At locations where data is available, local statistical data are related to different areas of a municipality, which can vary among organizations and time.Because of the heterogeneous nature of data sources and availability, spatial analyses need to integrate data from different sources and spatial granularities to establish a comprehensive understanding (Cheng et al., 2006).Due to the intense data collection and processing requirements, the reuse of spatial data has become a new norm (Janowicz et al., 2020).Lack of standards and unclear data collection procedures become a potential risk in the development of reliable datasets.
Reflecting the increasing popularity of ML methods, several reviews were published in the fields of geography and urban analysis (see Table 1 1 ).While there are publications that focus on specific areas of application or ML algorithms, there is no comprehensive overview across urban domains that allows researchers to compare and choose the most adequate methods for their topic, neither it is possible to understand the potential overlaps and synergies, or to leverage the insights from one field for another.Moreover, a discussion about the types of spatial data used for urban ML analyzes, or methods for choosing parameters is missing.We address this gap by conducting a scoping review of the fields and domains in urban analysis, which have priority in ML research, along with a mapping of the specific approaches, algorithms, or data sets and their fit to specific applications.
As indicated in Table 1, there are already numerous reviews on remote sensing (for example see Lary et al., 2015;Ma et al., 2019;Maxwell et al., 2018;Zhu, Tuia et al., 2017).These reviews show that support vector machines (SVM), random forests (RF), and boosted decision trees (DTs) have been shown to be very powerful methods for classification of remotely sensed data.However, all remote sensing studies aim to detect and monitor the physical surface of the world by using remotely sensed images.What is missing though, is the relation of the physical features of a city to its functions and sustainability.Therefore, in our scoping review, we focus on studies that primarily use geospatial data for urban sustainability.We explain the eligibility criteria in depth in section 2.1.The remainder of this paper is organized as follows.Section 2 explains the material and methods used for this scoping review.Subsequently, the paper provides insights into (i) the main themes and domains of applications of ML in urban analytics (Section 3.2), (ii) the data sources used (Section 3.3), (iii) the ML algorithms applied (Section 3.4) and (iv) the approaches for parameter selection (Section 3.5).The paper continues with a discussion of the main gaps and presents a research agenda to address these gaps.We conclude with the main findings.

Material and methods
This section describes the process and methods that have been followed in this review.As our objective here is to scope the field and its many applications for sustainable cities, we opted for a scoping review.Scoping reviews have been developed as a methodology to develop a mapping of study domains, data sources, approaches, and methods (Peters et al., 2015).While scoping reviews are still relatively new as compared to systematic reviews, they have been described as an ideal tool to determine the scope or coverage of an (emergent) body of literature on a given topic and provide an overview of its focus (Munn et al., 2018).A scoping review is especially suitable because the number of publications on ML applications for urban analyses has grown rapidly in the past years.Therefore, it is impossible to conduct a rigorous systematic review without excluding aspects of the field.Moreover, systematic reviews are not immune to exclusions of relevant papers (Biljecki & Ito, 2021).
Methodologically, our review process falls into the conventional three steps of a scoping review (Peters et al., 2015): (i) planning the review by developing eligibility criteria; (ii) identifying relevant literature through a database search, screening and selection; (iii) conducting the review and charting the results.

Table 1
Previously published reviews on machine learning applications for geography and urban analysis.

Authors (year)
Field of study Biljecki and Ito (2021) Street view imagery Chaturvedi and de Vries (2021) Urban land use planning Grekousis (2019) ANN and deep learning in urban geography Hegde and Rokseth (2020) Engineering risk assessment Ibrahim et al. (2020) Computer vision Lary et al. (2015)) Remote sensing Ma et al. (2019) Remote sensing Maxwell et al. (2018) Remote sensing Milojevic-Dupont and Creutzig (2021) Climate change mitigation Nikparvar and Thill (2021) Spatial data Toch et al., and Ben-Gal (2019) Mobility data Zhu et al. (2017) Remote sensing 1 In addition, Kamel Boulos, Peng, and Vopham (2019) presented works in GeoAI for healthcare topics, which might have applications for urban sustainability.However, this is not a formal review, so not included here.
Y. Casali et al. 2.1.Planning the review: Eligibility Criteria ML and urban analysis for sustainability are broad fields.We used the following criteria to select papers that are relevant for our analysis on ML applications that use spatial information in urban areas: 1) Papers mainly used ML algorithms to solve urban problems.We included supervised, unsupervised ML and neural linguistic programming methods.We excluded papers that used solely linear regression or that discussed the theory of ML. 2) Papers primarily focused on urban scales ranging from neighborhoods to counties.The broad scoping allows us to include applications on smaller areas that could scale to cities or metropolitan regions.3) Papers used geospatial datasets, i.e., data series, vector, or raster datasets when they are used in conjunction with a geographic location stored by coordinates or by indexes.For example, we included papers that used satellite images in combination with feature datasets as references for land use and land cover.4) We excluded papers that solely focused on remote sensing, detection of geospatial objects and features from remote sensing images, image processing, image classification, computer vision, urban street images.5) Papers are published in journals or peer-reviewed conferences and written in English.

Database search and screening
To identify an initial pool of literature for this study, the Web of Science was used to ensure the highest academic standards and validity of the articles, and for its broad and multi-disciplinary coverage.The web of Science (WoS) is the oldest, most widely used, and authoritative database of publications (Birkle et al., 2020), and in a recent comparative study has been shown to guarantee reproducible results (Gusenbauer & Haddaway, 2020).
In the literature, the terms urban areas, cities and urban environment are often used interchangeably.Therefore, we included each term in the search.Moreover, we included the keywords 'urban spatial analysis' and 'land use change' to aim for papers with a spatial analytical component.As a result, we used as keywords 'urban area', 'cities', 'urban environment', 'urban spatial analysis', 'land use change' and 'machine learning' for our database search (search string: (('urban area' OR 'urban spatial analysis' OR 'land use change' OR 'cities' OR 'urban environment') AND 'machine learning')).We screened the literature by following three approaches to lower the risk of bias.First, we looked for papers that included the keywords in the abstract and that were highly cited according to Web of Science statistics to ensure inclusion of publications with high impact.Second, we looked for papers that were published in 2021 to ensure that the most recent trends and developments are covered.Third, we identified additional papers by snowballing with Google Scholar.In total, we screened 245 papers and selected 162 papers that met all eligibility criteria.We collected this set of papers on December 2021.

Review and analysis
After selecting the articles based on the eligibility criteria, we analyzed the body of literature.We selected key information in the papers: title, authors, year of publication, the purpose of the study, place of the case study, the method used, data reported, training-testing information, and hyperparameter or parameter information.For our mapping of themes and methods, we collected the information in tables by analyzing each paper.If a paper did not provide any information about a specific detail, we reported it as missing.
Our analysis covered five perspectives.
1) We investigated the spatial and temporal distribution of papers.
For the spatial analyzes, we use the locations of case studies, and grouped them into seven regions: Africa, Asia, Europe, North America, Central and South America, Middle East, and Oceania.2) We mapped out the topics studied in papers to identify priority research areas and gaps.We developed four categories of studies that represented specific urban sub-systems: land use and urban form, socioeconomic, environment and infrastructures.3) To identify patterns of data (sources), we investigated the type of data that papers used to develop their models.We distinguished data stored in tabular form (e.g., csv) and spatial data in vector and raster forms as well as remote sensing data.4) To map out the most prominent ML methods in each category of study, we analyzed the methods used.We distinguish ML methods based on supervised, unsupervised, a mix of unsupervised and supervised and natural language processing algorithms.5) We analyzed the training-testing and the hyperparameterparameter information reported to study how authors implemented their analyzes and reported the associated information.

Spatial and temporal distribution
159 papers reported the location of the case studies.Fig. 1 shows the distribution of case studies by country and over time, clearly highlighting the discrepancy between regions.Most cases are located in China and the US, followed by the UK.Overall, 31% of the case studies were in Europe, 29% in Asia, and 27% in North America.7 papers include multiple case studies in different countries and 4 of them on different continents.If a paper covered multiple case studies, we counted each case study separately and assigned it to the respective continent.In Fig. 1, the bars show the year of publication of papers.84% of the papers were published between 2014 and 2021 indicating the increasing popularity of the field.

Categories
To derive topical categories, we built on the conceptual model of a sustainable urban system developed by Meerow and Newell (2019).As we focused on spatial attributes, we omitted the governance layer, and included those categories that characterize the urban system in terms of people, physical features, and services (i) land use and urban form, (ii) socioeconomic, (iii) infrastructures.To also capture the importance of environmental factors and hazards on the urban system, we put forward a fourth category on environment.
We found that 34% of the studies (55 papers) are dedicated to infrastructure, 24% to socioeconomic topics (39 papers), 23% to land use and urban form (38 papers), and 18% to environmental topics (30 papers).In the following sections, we present a summary of selected papers based on each category.We will discuss these categories and findings in section 4.

Land use and urban form
Fig. 2 shows the main topics studied in these categories.We distinguish papers dedicated to the (A) land use (29/38 papers) from studies related to the (B) urban form (9/38 papers).
A Studies on the land use focus on (A.1) land use detection (15 papers) and (A.2) land use change (14 papers).
A.1 Land use detection characterizes areas in cities spatially.We identified four categories (see Fig. 2): land use detection from social sensing, functional areas, urban identities, and informal settlements.Of these, the identification of functional areas was the most prominent topic of study (6/15 papers).
Y. Casali et al. 1 Social sensing based on mobile phone or social media data is used in two papers to map urban areas based on the activity patterns (Cranshaw et al., 2012;Toole et al., 2012). 2 Functional areas were identified by specific activities and mobility patterns (Hu et al., 2020;Yan et al., 2017;Yao et al., 2017;Yuan et al., 2012;Zhai et al., 2019;Zhang et al., 2018).3 Urban identities were studied by exploring the physical or spatial characteristics of public spaces (Chang et al., 2017;Chang et al., 2018), or the characteristics of attributes of urban blocks (Laskari et al., 2008).4 Informal settlements studies focus on mapping (Fallatah et al., 2020;Jochem et al., 2018), detecting (Mahabir et al., 2020) or understanding the growth (Badmos et al., 2019) of urban informal settlements and slums.Because these studies are conventionally situated in the Global South, the papers also present approaches to address the problem of limited geographic data.
A.2 Land use change is used to analyze urban evolution over time.Predicted changes can be the transition from non-urbanized to urbanized (Huang et al., 2009;Pijanowski et al., 2002Pijanowski et al., , 2014) ) or between land use classes (Chan et al., 2001;Petrović et al., 2017;Sangermano et al., 2010;Zubair et al., 2017).ML helps to detect urban change with cellular automata models (Feng et al., 2016;Moghaddam & Samadzadegan, 2009).Studies then focus on evaluating densification potentials in neighborhoods (Eggimann et al., 2021), analysing the dynamics of urban change from building alteration activities (Lai & Kontokosta, 2019), investigating the land use intensity from new masterplans (Gong et al., 2014), looking at the abandonment of residential areas (Xu et al., 2019), studying the evolution of the urbanization level in a metropolitan area (Grekousis et al., 2013).
Continue2A For the urban form, studies analyze the spatial structure of cities. Urban morphology was investigated in architectural scales (Gil et al., 2012;Hanna, 2007;Li et al., 2020;  Y. Casali et al.Thomas et al., 2010).Urban areas were delineated by their vertical extensions (Arribas-Bel et al., 2019) or by their land cover extension (Liu et al., 2019).Other publications addressed the problem to predict the height of buildings (Biljecki et al., 2017) or derived them in time (Farella et al., 2021).Lee et al., and Yu (2017) studied map generalization tasks of cartography.

Socioeconomic
The papers covering socioeconomic aspects in urban areas were categorized into (A) socioeconomic attributes, (B) land economy, and (C) social issues (see Fig. 3).The large majority considered social issues (26/39 papers), with only a small number of socio-economic attributes (3/39 papers).
A The earliest paper aiming to detect socioeconomic attributes by Grove and Roberts (1980) studies the social and economic variation of British towns.Then, socioeconomic attributes were predicted in neighbourhoods (Dong et al., 2019) and GDP was investigated in relation to geographic predictors (Chen et al., 2020).B Land economy falls almost equally into the prediction of retail attributes and real estate prices.When looking at the retail attributes, some studies predicted locations of stores (Satman & Altunbey, 2014;Xu et al., 2016).Other publications analyzed success indicators of retail store locations (Karamshuk et al., 2013) and hotels (Yang et al., 2015).For real estate prices, publications predicted market values of houses (Kauko, 2009;Xue et al., 2020), rent prices of residential units (Santibanez et al., 2015), or the real estate prices in different cities from the same country (Tchuente & Nyawa, 2021).Two publications studied the factors or amenities that drive prices for green building projects (Ma & Cheng, 2017) or land prices (Gao & Asami, 2007).C Social issues included (a) social inequality, (b) gentrification, (c) social vulnerability and (d) crime prediction (see Fig. 3).Of these, crime prediction was the most prominent topic (11/26 papers).

Environment
For the environment, we distinguished studies on (A) physical system (6/30 papers) from (B) hazard and risk (see Fig. 4).
A Studies on the environment as a physical system are categorized into weather (5/9 papers) and ecological aspects.(Zhang et al., 2021), air pollution on roads by using traffic and meteorological data (Arnaudo et al., 2020;Suleiman et al., 2019).Studies investigated the relations between land use and air quality in urban areas (Brokamp et al., 2017;Champendal et al., 2014;Liu et al., 2015).Related to COVID-19 disease, studies analyzed the relationship between pollution levels and COVID-19 spread (Magazzino et al., 2020); Mirri et al., 2021) or analyzed the changes in the air quality from lockdowns (Shi et al., 2021).For other kinds of pollution, studies investigated chemical pollution from industrial areas in air and water (Shi & Zeng, 2014) and noise pollution (Hernandez-Jayo & Goñi, 2021; Torija & Ruiz, 2015).

Infrastructure
The investigated infrastructures were predominantly (A) transport (33/55 papers), followed by (B) energy, (C) water and sewer system, and (D) waste.Gas was only considered by one publication.Other networked infrastructures, such as information and communication technologies have not been considered in the urban ML literature thus far (see Fig. 5).
A Studies on transportation infrastructures mainly focus on (A.1) mobility and behavior (31 out of 33 papers), while 2 out of 33 papers analyze (A.2) physical infrastructure (see Fig. 5).
A.1 From the mobility and behavior perspective, studies detect transportation system properties.
A.2. From the physical infrastructure perspective, two papers look at the structural characteristics of transportation networks.The topology of road network was compared in different cities (Strano et al., 2013) and the road network vulnerability was analyzed against river flooding (Abdulla & Birgisson, 2020).Cooper, 2015), modelled water demand to optimize water distribution (Rozos, 2019).Other authors investigated the vulnerabilities of the water supply infrastructure.They investigated leakages scenarios in the urban water supply system (Candelieri et al., 2013) and predicted pipe failure and breakage (Konstantinou & Stoianov, 2020;Kutyłowska, 2017;Winkler et al., 2018).The only study that looked at the sewer system was the one of Liu et al., and Prigiobbe (2021), who predicted groundwater infiltration into the sewer network.C Two studies studied how to predict waste in cities.They investigated the amount of solid waste (Ayeleru et al., 2021) and looked at how much municipal waste could be used to produce energy (Kaya et al., 2021).D Li et al., and Wang (2019) predicted vulnerabilities of the underground gas pipeline network in a city.

Data
Machine learning studies reveal critical and hidden information in datasets.For our analysis on the underlying data, we start with an overview of the frequency, at which different types of data were used.
We distinguish numeric (e.g., csv) data, vector data, remote sensing and raster maps.Fig. 6 shows a heatmap of the numeric and vector data across the different topical categories.We listed data used at least in two papers in alphabetical order.We calculated the percentages of the total number of papers that used each data type over the total number of reviewed papers.
The most popular data was Demographic data (29%), which describe the size of the population in an area.Points of Interest (POI) data (28%), which report locations and labels of public services and private businesses in the city, were primarily used for land use and socioeconomic analyzes, but also in environment and infrastructure (Fig. 6).Not surprisingly, socioeconomic data (22%; incl.GDP per capita, ownership, income, education, employment, subsidy, and tax assessment characteristics) was most prominent in the socioeconomic category.Road data (24%; information about road network structure) was most often used for land use, and less frequently for infrastructure studies.
Although social media and telecommunication datasets are gaining prominence, especially when it comes to privacy (de Montjoye et al., 2018), we found that such data is still not frequently used: 4% of the papers used Twitter and 3% Foursquare.Twitter, Foursquare and Flickr (1%) were the only social media platforms in this review.2% of the papers used data from the search engine Baidu, 4% used mobile phone data.Only one publication used Wifi connections (Xu et al., 2016).Fig. 7 shows remote sensing data and raster data, which were used in 13% of the reviewed papers.Generally, remote sensing and raster data were used most in land use and environment, and less in the infrastructure category.The most common datasets in land use and environment studies used digital elevation models (DEM) (7%).Satellite imagery of the land surface (7%) and topographic maps (4%) were important for land use; night light data (4%) and vegetation indexes (2%) had applications in different categories.Other data accounted for 1% to 2% of papers.
Next, we compared the underlying data in each topic.In all topics, we found that there is a large heterogeneity in datasets used even for similar problems.In addition, papers often chose data without reporting a systematic methodology that guided the selection processes.However, we can still identify one or two common datasets in most topics.For example, in detection of functional areas, papers rely mainly on POI data, to which researchers have been adding different types of mobility data.Taxi trajectories were used by Yuan et al. (2012)), bicycle stations and their rental records were used by Zhang et al., and Du (2018), while mobile phone data and origin-destination (OD) data trucks were used by Zhai et al. (2019).Similarly, for land use change detection, research is conventionally built on land use or land cover data as it represents the real distribution of land use at a certain time, and then complemented by datasets such as DEM, road and population data, or demographic information.

ML methods
In this section, we analyze the methods adopted per topic.We group the studies into four categories: 1) supervised, 2) unsupervised, 3) a mix of unsupervised and supervised or 4) natural language processing methods.Fig. 8 shows the distribution of methods per category.We calculated the ratio over the total number of methods.We omitted the paper of Spadon et al. ( 2019)) as an outlier in the count, who used 34 supervised algorithms to avoid distortions.
We find that supervised methods dominate across topics, with infrastructure most prominently, followed by environment, and equally, by land use and environment.Unsupervised methods were mainly adopted by socioeconomic topics, followed by land use, infrastructure, and environment topics.A mix of unsupervised and supervised methods is mainly used for socioeconomic topics, followed by infrastructure, environment and land use.Natural language processing was used mostly by the land use category, whereas socioeconomic and environmental problems do not use it.
Because of their prominence, we provide a closer analysis of the most popular algorithms for supervised and unsupervised ML.Fig. 9 (left) shows the number of times papers used specific supervised algorithms.For supervised learning, we listed algorithms used in at least two papers.Despite the wide range of algorithms, papers tended to use mainly a few.Neural networks (NN), random forests (RF), support vector machines (SVM), gradient boosting decision trees (GBDT), decision trees (DT), Knearest Neighbour (KNN) and logistic regression were the most frequently used supervised algorithms.Less frequently papers combined supervised ML algorithm with Cellular Automata analyzes.Studies that adopted only unsupervised ML algorithms (Fig. 9, right) used mostly PCA for data selection and k-means for clustering purposes.
Further, we analyzed the link between topics and methods.Table 3 Fig. 6.Heatmap of numeric tables and vector datasets, in alphabetical order.We reported the type of data used and the number of papers that reported their use by each category.We highlighted the social media and telecommunication data.shows our results.As for the datasets, we see a broad variety and heterogeneity of methods.Few algorithms were used several times per topic, with neural networks (NN) in energy use prediction as the top algorithm (7 papers in the topic), followed by NN in land use change and air pollution modeling (5 papers).The greatest heterogeneity in selection of supervised algorithms was found within the infrastructure category.Of the studies that used a mix of supervised and unsupervised ML, principal component analysis (PCA) was the most used unsupervised algorithm for feature selection as input to supervised ML methods, followed by the k-means algorithm.Natural language processing was prominently used when studies investigated functional areas in the land use category.These papers use Word2Vec, Place2Vec, DMR, TF-IDR, and LDA topic models to discover the thematic structure of spatial data.

Patterns in parameter selection
In this section, we analyze patterns in parameter selection in suand unsupervised ML.In supervised learning, we investigated the training and testing information reported.Although the parameter choices have an important impact on the results, we found that often authors did not report any information about the selection of training and testing parameters.The ones that did report mostly divided the data into two datasets: the training dataset comprised usually 70%-80% of data, and the testing dataset was in between 30%-20% of the total data.Some papers instead divided the training and testing data by years because they were developing temporal analyzes (Dash et al., 2018, Yang et al., 2018).Limited papers used three datasets for training, validation, and testing purposes (Arnaudo et al., 2020;Kutyłowska, 2017;Lee et al., 2017).
Few papers systematically report selection of hyperparameters, and thus far there is no common standard.For example, Xu et al. (2016) listed the hyperparameters for different algorithms in bullet points.Satman and Altunbey (2014), Grekousis et al. (2015), Kontokosta et al. (2017), Ma and Cheng (2016) and Li et al. (2019) reported hyperparameters or information about the architecture of NN in tables.However, the vast majority of papers failed to report hyperparameters and details about the model architecture.If authors reported hyperparameters, those mainly appeared in the body text rather than in a detailed and systematic fashion via tables or figures, making the information not easily readable.This finding was in line with earlier criticism on the lack of reporting of hyperparameters in artificial neural networks (Grekousis et al., 2019).
For unsupervised ML, we analyze how clustering and PCA were used.For clustering algorithms, we distinguish two approaches: (i) the optimal number of clusters is determined by using algorithms systematically, or (ii) the number of clusters is determined by users and algorithms assign data to each cluster.For the first category, for instance, optimal number of clusters were selected by using the partition coefficient and classification entropy by Grekousis et al. (2013)).Clustergram was used to identified clusters for k-means by Singleton et al. (2020), which plots a series of potential k values.The optimal number of k-means clusters was evaluated by using the silhouette coefficient by Shi and Zheng (2014).In the second category, authors typically selected the number of clusters based on empirical evaluations of the case studies (e.g., Chang et al., 2018, Aksela & Aksela, 2011;Lehmann & Gross, 2017).
For principal component analysis (PCA), we found that different methods were used when selecting the number of principal components (PCs).For example, Cutter and Finch (2008), Wang and Zhang (2017) and Gao and Asami (2007) used the Kaiser criterion.Champendal et al. ( 2014)) used Kaiser, Joliffe and Catell criteria.Other papers selected the numbers of PCs that captured the majority of the total variance without fixing a priori a number or following any specific criteria (Lalloue' et al., 2013, Dong et al., 2020and Ke et al., 2020, Reades et al., 2019, Suleiman et al., 2019).

Discussion
In this section, we discuss the implications of the results from the review, identify research gaps and objectives for future research across the dimensions of our scoping review.

Spatial distribution
We found that case studies were mainly located in Asia, North America, and Europe (Fig. 1).The lack of studies in other regions or comparative studies may be driven by limited access and availability of data, which strongly affects computational studies.This gap presents an opportunity to develop methods for data sparse environments (Brajard et al., 2020;Nikparvar & Thill, 2021), and comparative studies that identify the impact of different data sets and granularities of the results.Some of the most promising research avenues in data sparse contexts are the creation of satellite Earth observations (EO).Related approaches have proven successful to monitor agri-food systems (Nakalembe et al., 2021) or to understand urban sprawl and land-use change (Sankhala & Singh, 2014).Other methodologies improve spatial data collection in some regions.For the social dimension of sustainable urban development, the DesInventar methodology is an example, with a focus on disaster losses (Panwar & Sen, 2020).However, this methodology reports still important limitations regarding the level of urban disaggregated data and consistent coverage (Osuteye et al., 2017).
Digital technologies can be beneficial for sustainable development goals (Nosratabadi et al., 2020;Sachs et al., 2019;Vinuesa et al., 2020).However, there are risks and downsides that countries must identify and tackle through integrated strategies and a focus on the leave-no-one-behind principle (Sachs et al., 2019).Some of these risks concern ethical issues, for example, the loss of jobs for lower-skilled workers, the theft of digital identities, invasion of privacy by governments or businesses, and discrimination based on personal data.Therefore, responsible implementations and use of AI methods should address these topics and principles.Furthermore, model interpretability of AI algorthms must be addressed jointly with requirements and constraints related to data privacy, model confidentiality, fairness, and accountability (Barredo et al., 2020).
As a result, there is a need to conduct research that focuses on: • Developing ML applications for prominent urban or rapidly urbanizing regions in Southeast Asia, India, or Africa.This entails the development of the appropriate data fusion, assimilation, or sampling techniques to generate databases that are fit for machine learning applications in data sparse contexts.• Comparative studies across continents, validating findings from the different areas of studies while acknowledging the diversity of underlying data sets (see also data).

Categories
Although a broad range of topics has been investigated by using ML (see Figs. 2-5), some topics are underrepresented.In the socioeconomic category, studies are missing that investigate social attributes, for example, in urban cultures and the labour market.Social justice issues were partially treated with respect to inequalities and gentrification studies.Given the prominence of discussion related to accessible and affordable housing in fast-growing areas (Kramer, 2018;Rodríguez-Pose & Storper, 2019), and equality and equity in access to public urban services and infrastructures (Martínez et al., 2018;Modai-Snir & van Ham, 2018), more research in these domains is urgently needed.In the environmental field, which only marks 18% of the publications, more contributions are needed in evaluating more kinds of disaster risks (e.g.sea-level rise), natural resources consumption and ecosystem services.In the infrastructure category, crucial subjects such as waste management, logistics, renewable energy systems were underrepresented, which is surprising given the growing interest in circularity for cities (Sachs et al., 2019).
Further, there is a lack of cross domain publications.Two prominent areas that require research across urban systems are sustainability and resilience.Cities will be exposed to an increasing number of extreme events and will have to balance long-term sustainable and green development with resilience to hazards (Elmqvist et al., 2019).For both areas, infrastructural, social and environmental aspects need to be combined.Therefore, ML approaches are particularly promising.Sustainable urbanism is increasingly becoming smart and data-driven (Bibri 2020).However, our findings on a lack of cross-cutting publications in sustainability and resilience confirmed Milojevic-Dupont and Creutzig (2021), who found that despite their potential, ML tools were not common in climate change research communities.Similarly, despite urban resilience becoming a research and policy priority (Krishnan et al., 2021;Meerow & Newell, 2019), we identified one paper that used ML to evaluate resilience (Knippenberg et al., 2019), while resilience was discussed by Cutter and Finch ( 2008 In sum, the most pressing research needs are related to: • Developing machine learning applications targeting new areas of research such as (i) labour market and cultural attributes of cities; (ii) accessible and affordable housing in fast growing areas and equity or fairness in access to urban services; (iii) a circular urban economy, including waste management and logistics; and (iv) climate change and related extreme events.• Designing applications that focus on the interplay between the different urban environmental infrastructural and socioeconomic settings with land use, especially in the areas of sustainability and urban resilience.

Data
Despite the popularity of data sources such as demographic data and POI, our findings showed a significant heterogeneity in the datasets  Hu et al., 2020. Yao et al., 2016. Yan et al., 2017Yuan et al., 2012, Zhai et al., 2019, Zhang et al., 2018 Detection of land use change Land use, roads, DEM.Land use, roads, population, DEM.Land use, roads, population.Landsat -land use, roads, population, POI.Land cover.SPOT -land cover, roads, population, tax-break development areas.Pijanowski et al., 2014, Pijanowski et al., 2002. Huang et al., 2009, Petrović et al., 2017, Sangermano et al., 2010, Zubair et al., 2017 Prediction of store location POI, retail stores data.POI, user data from Baidu and Wifi connection data.used, even within the same topical category (see Table 2).Further, the rationale for the selection of datasets was often unexplained.We assume that many case studies followed a pragmatic approach built on the availability of data.With the significant differences in data sets both in scale and content, the results of ML algorithms are not comparable, hampering the development of urban analytics and evidence-based urban planning.While data availability is an issue, we argue that standards and explicit methods to select data for the different topical areas will help producing more generalizable and comparable results.The first promising way ahead is the integration of human sensing data.Surprisingly, urban ML still does not significantly rely on data from mobile phones or social media, confirming findings of Grekousis (2019) on ANNs.While such data is promising to understand interaction patterns or the use of urban services, there are also challenges related to privacy and data protection that need to be addressed.
Another promising research avenue is investigating approaches to integrate data from various sources.This issue has been recently raised within the data fusion and Big Data research domain (Favaretto et al., 2020;Kar & Dwivedi, 2020;Yang et al., 2020).For the geographical domain, most Big Data are produced with space and time stamps.These are samples of sequential observations from various remote, in-situ, mobile and human sensing systems or simulations, which lead to an increased need for cross-scale data fusion, including integration across various sources and interpolation across spatiotemporal domains (Yang et al., 2020).The use of a large amount of data without solid reasoning can lead to misinterpretation of findings.
In sum, research on data is needed that that focuses on: • Analyzing the impact of the different dataset choices for urban ML problems within and across the different topical categories.Based on this, standards and explicit methods can be developed to select data for the various topical areas.• Integrating sensing data such as from mobile phones or social media, while respecting privacy and data protection.• Investigating new methods that merge data from different sources to derive meaningful results, especially in the context of data sparse environments (see also spatial distribution).

Methods
In this scoping review, we distinguish four categories of methods (see Section 4.4).Natural language processing was primarily used in land use topics, even though it may also have promising applications in other domains.While we found that most papers lean on supervised ML (Table 3), there is a broad variety of supervised, unsupervised, or mixed methods.As the method selection depends on the scientific problem to analyze, future research should compare different methods within specific topics.Like for the data category, systematic comparisons are beneficial to understanding the significance and output variety of using specific algorithms.
We found that NN, RF, SVM, gradient boosting DT, DTs, KNN and logistic regression were the most popular supervised ML algorithms, while PCA and k-means were the most popular unsupervised ML algorithms.In terms of the rationale for methodology selection, different reasons are put forward.Overall, some algorithms performed well in predicting or classifying data for specific problems, therefore they grew a strong reputation over time.Supervised ML methods are often chosen based on their complexity, overfitting properties, parameter requirements, data requirements, and interpretability of results.Looking at the complexity of methods, RF and SVM are more complex compared to logistic regression, ordinary least square regressions or LASSO because they account for non-linear relationships (Cichosz, 2020; Knippenberg et al., 2019;Kontokosta & Tull, 2017).Whereas, NN are efficient predictors because they have higher computational complexity embedded in their network topology (Grekousis, 2019;Ma et al., 2017).When looking at overfitting properties, RF is often chosen because it avoids the overfitting of data (Chen et al., 2020;Jochem et al., 2018;Xu et al., 2019).For the parameter requirements, KNN does not require parameters in input for classification problems (Lee et al., 2017;Ma et al., 2017).For the interpretability, DT is often selected for the easy interpretability of the results (Lee et al., 2017).When looking at the data requirements, NN was not used because of the insufficient number of data (Knippenberg et al., 2019), or a NN based on an ensemble method was used to deal with data scarcity (Zhang et al., 2021).When studying unsupervised ML methods, PCA is often selected because it helps to synthesize datasets in a few sets of principal components (dimensionality reduction) and still preserves interpretability by loadings (Cutter & Finch, 2008;Laskari et al., 2008;Wang & Zhang, 2017).
New methods are needed that link the research on ML algorithms to urban science.Kitchin (2014) already discussed the challenges of new epistemologies and paradigm shifts that the use of big data and data-driven analytics might bring, highlighting the need for critical reflection on the epistemological implications of the data and analytical revolution.Falco (2015) demanded a human-centered approach.We argue that studies in ML and urban science should be aware of these challenges when developing appropriate methods that connect analytical frameworks with the broader urban science and policy.Especially knowledge transfer is a promising concept that fosters and supports collaborations between research organizations, business entities, and public sector (Heinimann & Hatfield, 2017).
Therefore, there is a need for: • In depth explorations of using natural language processing for other fields than land use topics.• Comparing different machine learning algorithms within specific topics to study potential differences in results and their relationship.• Develop new methodological frameworks that go beyond the mere application of ML, but rather establish novel ways to explain, translate and transfer the results from ML to urban sciences, practice, and policy.

Patterns in parameter selection
We found that most papers tended to train and test the models for supervised learning, while only some authors included validation in their research.Often, papers did not report the parameters selected to build the models and the information about training-testing phases, leading to a lack of reproducibility.Although tables and figures are beneficial to the reader, only few studies presented parameter information in these formats.These gaps were already identified by Grekousis (2019) for ANNs.We argue that a consistent way of reporting parameters is vital to increase reproducibility and advance the state-of-the-art.
Furthermore, we found that papers select number of clusters and principal components for unsupervised learning by using different approaches.Often, the approach that authors used to define the number of clusters is not appropriately justified.
Benchmarking analyses might help to prepare better standards.This need is confirmed also by other reviews about the use of deep learning applications (Grekousis, 2019;Ma et al., 2019), which means it is a recurrent need in the field.For benchmark analyses, there should be benchmark datasets accessible to everyone.Therefore, the problem of transparency of pattern selection is linked to the accessibility of data.
Thus, there is an urgent need for research on: • Studying protocols for reporting parameters in publications for supervised and unsupervised algorithms • Analyzing the impact of the ML algorithm results across different topical categories to define joint standards and increase reproducibility

Conclusions
In this paper, we set out to review the state-of-the-art in Machine learning based on spatial data for sustainable cities.Since this is an emerging and highly dynamic research field, we conducted a scoping review to (i) map out the most prominent topics, data sources, ML algorithms, and approaches to parameter selection, (ii) determine the most prominent patterns and challenges in the use of ML, (iii) identify knowledge gaps to guide future research.We reviewed papers covering different ML algorithms across all aspects of sustainable urban systems, which are divided into the categories of land use and urban form, socioeconomic, environment, and infrastructure.Overall, the analyses helped to create a classification of ML approaches according to topics, methods, and data sources.
There are three main takeaways from this study.First, there are still ample opportunities to evolve this research field.This can be achieved by investigating missing topics or by working on cross-domain or comparative case studies (see Section 4).As ML and AI are gaining momentum, we expected that applications of these technologies will serve to solve contingent problems around pressing issues pertaining to sustainability, such as circularity and resilience.
Second, there is still a need to standardize the selection of data, algorithms, and parameters.Systematic comparisons of the data and algorithms selection can help in exploring the significance of these methods and the impacts of the results, while systematically reporting parameters increase the reproducibility of works and the transparency of the analytical process (see Section 4).Grekousis (2019) confirmed partially these findings for ANNs.This lack of transparency and systematic comparison of ML methods also hinders application.Sustainable urban planning decisions and policies that influence the urban environment require concrete reasoning and clarity.
Third, spatial ML will benefit in shifting attention to the creation and types of datasets.There are limitations in developing spatial ML studies in data-sparse areas (such as the Global South).Moreover, studies use often heterogeneous data from different data sources.Studying how to integrate and merge data will help spatial data-driven analyses to become more meaningful.Along with these themes, ethical studies about the role of technology and its possible risks for communities and individuals (e.g. the loss of jobs for lower-skilled workers, the theft of digital identities, invasion of privacy by governments or businesses, discrimination based on personal data) should be addressed to develop a more conscientious use of spatially-explicit technology.
This scoping review has some limitations related to the scope and the focus on spatial urban data.Although studies related to street view images have been on rise recently, we did not include papers that adopted images as the sole source of information to develop urban analyzes.We refer the reader to literature reviews in computer vision for urban analytics (Biljecki & Ito, 2021;Ibrahim et al., 2020).For ML applications in the field of remote sensing, we refer to Lary et al., 2015, Ma et al., 2019, Maxwell et al., 2018and Zhu, Tuia et al., 2017.Another limitation is that we developed an initial mapping of an emerging field, while a systematic review would include all published papers in literature by following a protocol (e.g.PRISMA protocol by Moher et al. (2009)).
The scientific community can use this review as a guideline to understand which approaches and data sets have been used for which type of urban problem.Further, our analyses help shape a comprehensive understanding of the use between ML and geospatial data.Moreover, we identified several promising areas for future research in all domains, ranging from the need for more comparative studies and to an improved understanding of the impact of the selection of data sets, algorithms, or parameters.We especially stress the need to foster explainable machine learning approaches and invest in knowledge transfer to create impact and help equip cities with the tools they need to address the many challenges they are facing.

Fig. 1 .
Fig. 1.Spatial and temporal distribution of papers.The map shows the distribution of case studies per countries, the bars show the years of publication.

Fig. 2 .
Fig. 2. Topics studied in the land use and urban form category.The tree-plot shows the main research themes.We reported the number of papers per category in brackets.

2
Fig. 3. Topics studied in the socioeconomic category.The tree-plot shows the main research issues investigated in papers of the socioeconomic category.We showed the number of papers in brackets.
For the weather, studies analyzed mostly temperatures in cities by studying land surface temperature in relation to land use attributes(Osborne & Alvares-Sanches, 2019;Sun et al., 2019), assessing heatwave thresholds(Park & Kim, 2018) or mapping Local Climate Zones(Bechtel et al., 2019).A method to predict turbulent air flows in the urban environment was developed byXiao et al. (2019).In ecology, studies investigated the occurrence of ravens(Baltensperger et al. (2013), the ecological footprint of urban areas(Yao, 2012), the carbon storage of urban trees(Strohbach & Haase, 2012), and socio-ecological indicators of urban soil(Bonilla-Bedoya et al., 2021).B For analyzing hazards and associated risks in urban systems, we distinguish flood risk prediction and detection of pollution.A variety of studies analyzed flooding mainly for classification and prediction purposes.The studied topics were the classification of the severity of flood events based on rainfall intensities(Ke et al., 2020), susceptibility maps(Tehrany et al., 2019;Zhao et al., 2019) and flood risk maps of cities(Darabi et al., 2019; Eini et al., 2020;Motta et al., 2021).For pollution in urban areas, most papers focus on predicting air pollution.Studies predicted PM2.5 based on meteorological data(Banga et al., 2021; Deters et al., 2017), CO2 emissions from metereological and socioeconomic variables(Li & Sun, 2021), carbon emission from urban blocks , identified suitable locations to place bike stations(Chen et al., 2015), predicted the number of available bikes and free bike slots(Collini et al., 2021)   4 When looking at electric vehicles, studies predicted locations of charging pools(Straka et al., 2020) and investigated the charging behavior by predicting the departure time and energy needs(Shahriar et al., 2021).5 More broadly,Oke et al. (2019) studied urban typologies based on different urban dimensions to investigate the relationships between mobility and environmental sustainability.6 Most studies (13/31 papers) analyzed traffic characteristics for predicting traffic speed(Ma et al., 2017;Magalhaes et al., 2021), traffic congestion spots(Awan et al., 2021;Majumdar et al., 2021;Qin et al., 2020;Saldana-Perez et al., 2019), traffic flows (Moretti

AFig. 4 .
Fig. 4. Topics studied in the environmental category.The tree-plot shows the main research issues investigated in papers of the environmental category.We show the number of papers per topic in brackets.

Fig. 5 .
Fig. 5. .Topics studied in the infrastructure category.The tree-plot shows the main research issues investigated in papers of the infrastructure category.We showed the number of papers for topics in brackets.

Fig. 7 .
Fig.7.Heatmap of remote sensing and raster data.We reported the type of data used and the number of papers that reported their use by each category.

Fig. 8 .
Fig. 8. Machine learning methods per topic.The histogram shows the ratio of papers that used supervised, unsupervised, a mix of unsupervised and supervised algorithms or neural language processing (NLP) methods per category.

Table 2 List of data used in similar studies.
The table shows the data information adopted by papers grouped in by topic of study.
DEM, rain data, land use/land cover, slope percent, curve number, distance to river, distance to channel, and depth to groundwater, urban density, quality of buildings, age of buildings, population, socioeconomic conditions divided in levels.Inundated areas, weather data.Pollution data in points, land use, distance to roads, airports, hydrographic networks, population, maximal power of heating systems, traffic.Pollution data in points, land use, expressway and major roads, weather variables.Pollution data in points, land cover, traffic intensity, DEM, population density, greenspace, emission point sources.

Table 3 Machine learning algorithms used in urban analyzes.
For each topic, we report the algorithm used in the analyzes.Number of publications in bracket for algorithms with more than one application in the topic.If a paper uses more than one method, all are reported under the same topic.Abbreviations: Decision Trees (DT), Densitybased spatial clustering of applications with noise (DBSCAN), Dirichlet Multinomial Regression (DMR), Gradient Boosting Decision Tree (GBDT), k-nearest neighbors (KNN), Latent Dirichlet Allocation (LDA), Least absolute shrinkage and selection operator (LASSO), Neural Network (NN), Principal Component Analysis (PCA), Support Vector Machines (SVM), Term frequency-inverse document frequency (TF-IDF), Topical Word Embeddings (TWE).