Evaluation of 30 urban land surface models in the Urban-PLUMBER project: Phase 1 results

Accurately predicting weather and climate in cities is critical for safeguarding human health and strengthening urban resilience. Multimodel evaluations can lead to model improvements; however, there have been no major intercompar-isons of urban-focussed land surface models in over a decade. Here, in Phase 1 of the Urban-PLUMBER project, we evaluate the ability of 30 land surface models to simulate surface energy fluxes critical to atmospheric meteorological and air quality simulations. We establish minimum and upper performance expectations for participating models using simple information-limited models as benchmarks. Compared with the last major model intercomparison at the same site, we find broad improvement in the current cohort’s predictions of short-wave radiation, sensible and latent heat fluxes, but little or no improvement in long-wave radiation and momentum fluxes. Models with a simple urban representation (e.g., ‘slab’ schemes) generally perform well, particularly when combined with sophisticated hydrological/vegetation models. Some mid-complexity models (e.g., ‘canyon’ schemes) also perform well, indicating efforts to integrate vegetation and hydrology processes have paid dividends. The most complex models that resolve three-dimensional interactions between


INTRODUCTION
Over a decade has passed since the first International Urban Land Surface Model Comparison Project (PILPS-Urban) evaluated 32 models at two urban sites (Grimmond et al., 2010(Grimmond et al., , 2011)).Since then, new urban models have been developed, existing models have increased capabilities, and a new generation of modellers are using them.Expectations that urban schemes be integrated within weather and climate models have also grown: simulations are undertaken at finer spatial scales, and the wider modelling community recognises the importance of simulating meteorological conditions within cities (Masson et al., 2020;Sharma et al., 2021).Therefore, it is timely to undertake a new evaluation of land surface models used in meteorological simulations over urban areas.This project, Urban-PLUMBER, focusses on the local-scale (order 0.1-5 km) energy exchange between the urban land surface and the atmosphere.The last intercomparison with a similar focus, PILPS-Urban (Grimmond et al., 2010(Grimmond et al., , 2011) ) established that knowledge of an urban site's surface cover fractions significantly improved model performance.'Urban' models can include impervious surfaces (e.g., buildings, roads) and pervious surfaces (e.g., vegetation, bare earth), but not all urban models include both.In PILPS-Urban, models that neglected vegetation or porous ground performed poorly in latent, sensible, and radiant heat fluxes.This may have been expected at a suburban site with about 40% vegetation fraction (Grimmond et al., 2011); however, their performances were also poorer at an urban site nearly devoid of vegetation (Grimmond et al., 2010).PILPS-Urban concluded that models with simpler urban geometry (i.e., with fewer parameters describing built up areas) generally performed better than more complex models, as simpler models were better able to use provided site information.Further analysis of the suburban-site results (Best and Grimmond, 2015) concluded the dominant physical processes that urban models should capture, by importance, are: (a) bulk surface albedo during the day; (b) trapping of long-wave radiation between urban structures at night; and (c) evapotranspiration over diurnal and seasonal timescales.
Urban-PLUMBER builds on PILPS-Urban, which in turn drew on the methods of PILPS (Project for the Intercomparison of Land Surface Parameterisation Schemes).Since the 1990s, PILPS projects have undertaken land surface model evaluation and comparison (Henderson-Sellers et al., 1996, 1995;Slater et al., 2001;Bowling et al., 2003).A coordinating group defines the project framework (the protocol) and provides participating modelling groups with both meteorological data to drive land surface models and surface characteristics parameters to configure models.After running a model on their computers, participants submit their outputs to coordinators.Coordinators analyse outputs and communicate results.More generally, model intercomparison projects (MIPs) have been undertaken across all Earth system spheres, and have become a foundational element of climate science (Eyring et al., 2016).Together with PILPS-Urban, two additional MIPs have been influential in the design of the current project.
PLUMBER (Protocol for the Analysis of Land Surface Models Benchmarking Evaluation Project) (Best et al., 2015) demonstrated the benefit of using benchmarks to set the performance expectations for models.In traditional modelcomparison, models are ranked by various error metrics for select observed outputs.Although this helps identify outlying model performances it does not help determine whether the cohort overall is performing well or poorly.Furthermore, it may lead to subjective assessment of models being (un)fit for purpose and misdirecting subsequent model development priorities (Best et al., 2015).The benchmarks used in PLUMBER were simple empirical and physically-based models with far fewer inputs than the participating land surface models.Comparing models with benchmarks indicates the strengths and weaknesses of the cohort and hence areas for future development.Unsettlingly, PLUMBER found very simple empirical models such as a linear regression driven by short-wave radiation observed at other sites (i.e., trained out-of-sample) outperformed all participating land surface models across a suite of standard metrics when predicting sensible heat fluxes over 20 sites.The authors of the PLUMBER project concluded that complex and computationally expensive land surface models were not effectively using the information available in the forcing data when determining surface-atmosphere turbulent fluxes, arguing this challenged broadly accepted concepts used to model the surface energy balance.In Urban-PLUMBER we adopt a similar benchmarking approach but apply it for the first time in an urban setting.
ESM-SnowMIP (Earth System Model-Snow Model Intercomparison Project) (Menard et al., 2021) found widespread human errors affecting model performance but, unlike some earlier comparisons, it encouraged resubmissions where initial results showed unexpected behaviour.As such, most modellers resubmitted their results when errors were identified.Errors included incorrectly configuring model start times, using input data from the wrong sites, incorrectly formatted model outputs (variable name or sign), and hardcoded bugs (i.e., coding errors in model parameterisation).In the same way, Urban-PLUMBER aims to reduce human errors via an initial assessment and resubmission process to better focus on intended model functionality.
The Urban-PLUMBER project involves 30 models (Table 1, Appendix A1), 20 urban sites (Lipson et al., 2022a), 50 site-years of meteorological observations, 200 site-years of synthetic data, and 55 model output variables.Here, in Phase 1 of the project, we focus on evaluating model performance of five observed surface-atmosphere fluxes at one suburban site (Preston, Melbourne, Australia) over 16 months.The same site and observational data were used in PILPS-Urban (Grimmond et al., 2011), allowing direct comparison with those results; hence, our objectives here are to (a) evaluate land surface model performance in an urban setting using a benchmarking methodology; and (b) assess how the current cohort of models compare to earlier participants of PILPS-Urban.

Overview of modelling approaches
Many urban land surface models exist to parameterise urban surface-atmosphere exchanges (Grimmond et al., 2009;Garuma, 2018;Nazarian et al., 2023) and are developed for various purposes including to predict lower boundary conditions for weather, climate or air quality simulations; forecast environmental conditions within the urban canopy (e.g., between buildings at pedestrian level); test interventions intended to improve these conditions; and predict anthropogenic feedbacks relating to energy and water use or thermal comfort.
Although there is effectively a continuum of models with different levels of complexity for different physical processes (Figures 1 and 2, Appendix A1), here we broadly classify models into one of five cohorts (Figure 2) based on the representation of urban impervious surfaces (buildings, roads etc): • Non-urban schemes (participants in cohort n = 2): Most global and some regional weather and climate models lack an explicit urban scheme (Best, 2006;Oleson et al., 2018;Daniel et al., 2019;Zhao et al., 2021).Rather they simulate these areas using bare earth, rock or vegetation.Including models in this class helps determine the importance of using an urban scheme at a suburban site.
• One-tile (slab) schemes (n = 5): These treat built areas as a homogenous flat surface with parameters modified to represent the bulk influence of all urban elements.Some one-tile urban schemes represent built urban elements only (buildings, paving, roads etc), while others include the effects of vegetation and other surface types (water, bare soil, etc).Therefore, optimal effective bulk surface parameters are model-, site-and output-specific (Salamanca et al., 2009).Methods to estimate effective surface parameters include tuning to appropriately scaled observations (Best et al., 2006), from more detailed models (Martilli et al., 2015), or from more detailed input data (Wouters et al., 2016).
• Two-tile schemes (n = 5): These resolve two urban surface facets (e.g., roofs and 'street canyons') with different thermal and radiative properties, and therefore different surface energy balances.Best et al. (2006) suggested two-tile schemes provide benefit because one-tile heat capacity values could not be selected which provide both the correct amplitude and phase

F I G U R E 1
Model schematics of the main built, hydrological and behavioural attributes for participating models.Here models are categorised into five cohorts (left column) based on the geometric representation of buildings, with built, hydrological and behavioural attributes used to refine a 'total complexity' (Figure 2).Block array, statistical distribution and building-resolved models are grouped together into a 'complex' cohort in later analysis.

F I G U R E 2
Participating model capabilities.Grouped into cohorts (non-urban, one-tile, two-tile, canyon, complex: defined by approach to the built part of the urban area) and sorted from lower to higher 'total complexity' as calculated by assessing the included built, hydrological and behavioural attributes of submissions (green cells).Blue cells indicate a model is capable of representing the process but was not used in this submission.Frequency of approaches are indicated in right column.The 'complexity score' for each process is subjective.It is intended to be indicative only, helping to distinguish models within cohorts.
for observed sensible heat fluxes.While most two-tile schemes have surface parameters constant throughout a simulation, some parameterise the radiative and thermal effects of canyons from sun angle and morphology (e.g., MORUSES -Porson et al., 2010;CHT-ESSEL_U -McNorton et al., 2021).In this project these are classified as two-tile rather than canyon schemes, as they resolve two surface energy balances only.
• Canyon schemes (n = 13): These resolve the energy balance for roof, wall and ground surfaces separately (Masson, 2000).Radiation reflection and trapping are simulated in two dimensions with an infinite canyon assumption.The details of canyon schemes vary widely, with single or multiple atmosphere layers, sub-facets (e.g., multifaceted walls), fixed or averaged building orientation, independent facet thermal and radiative properties, constant or distributed building heights, and those that include pervious ground, low vegetation and/or street trees between buildings (Figures 1 and 2).
• More complex schemes (n = 5): These resolve three-dimensional interactions between urban facets using a variety of approaches.Repeated cuboids allow for two perpendicular streets while retaining some of the computational efficiency of a canyon approach (Kanda et al., 2005).Statistical distributions can characterise realistic urban environments and have been used to determine three-dimensional radiative interactions between buildings and urban vegetation similar to three-dimensional radiative interactions between clouds (Hogan, 2019a(Hogan, , 2019b)).This allows complex urban environments to be simulated in a computationally efficient manner (Stretton et al., 2022).Buildingand tree-resolving models represent three-dimensional interactions more explicitly, allowing microclimate conditions to be resolved, but at a larger computational cost.
Models can be further distinguished by how or if hydrological and anthropogenic processes are addressed, again with a large variety of approaches (Figure 1).How models represent built, hydrological and anthropogenic processes are used here to obtain a measure of each model's 'total complexity' (Figure 2).Participating model's parameterisations are individually summarised in Appendix A1.

2.2
Experiment design and data

Site description
Simulations are undertaken for the Preston area in Melbourne, Australia (AU-Preston; Lipson et al., 2022a), the same site used in PILPS-Urban Phase 2 (Grimmond et al., 2011).The site area includes 1-2-storey detached residential buildings, some row-style 1-2-storey commercial buildings, and substantial tree and lawn cover (Figure 3).The neighbourhood is classed as an open low-rise (LCZ6) Local Climate Zone (Stewart and Oke, 2012;Demuzere et al., 2022).The region is classified as having a temperate oceanic climate (CfB) under the Köppen-Geiger system (Beck et al., 2018).

Observational and forcing data
Observations for the AU-Preston site were gathered using sensors mounted on a telecommunication tower 40 m above ground to measure local-scale conditions (i.e., rather than microscale; Coutts et al., 2007a).Measurement height is 6.25 times mean building height (Table 3) and is thus assumed to be within the constant flux layer and inertial sub-layer.Raw data were obtained over 474.4 days (12 August 2003 to 28 November 2004) at high frequency (1)(2)(3)(4)(5)(6)(7)(8)(9)(10), which are then quality-controlled and averaged to 30-min with period-ending timestamps.Quality control removes periods unsuitable for eddy covariance observations (e.g., strongly stable conditions or periods subject to flow interference), along with significant outliers and unphysical values (Coutts et al., 2007a(Coutts et al., , 2007b;;Lipson et al., 2022a).The site observations are split into: (a) forcing data: provided to participants to drive models; and (b) analysis data: withheld from participants and used to evaluate model performances (Table 4).Analysis data are not gap-filled; models are evaluated against observed data only, and not analysed during periods with gap-filled short-wave down (SW down ; except where SW down = 0 at night, which is assumed valid).SW up is not analysed at night.After quality control and periods of equipment failure, remaining analysis data are well spread between day and night, and across the four seasons (Table 4).Additional processing description, observational data and plots are included in Lipson et al. (2022c).
The forcing dataset is gap-filled since it needs to be continuous for models.Small gaps (≤2 hr) are filled by linearly interpolating from available data.Larger gaps are filled using ERA5 global reanalysis (Hersbach et al., 2020) hourly data on single levels at 0.25 • spatial resolution (Hersbach et al., 2018).As gridded data differ from point observations (Martens et al., 2020)  ( McNorton et al., 2021), diurnal and seasonal adjustments are applied to bias-correct ERA5 data using available site observations and nearby rain gauges before gap-filling (Lipson et al., 2022a).
Systematic and random errors are present in any observations used to force and evaluate models.Random errors in flux observations over forested areas generally scale with the magnitude of the flux (Hollinger and Richardson, 2005;Richardson et al., 2006).Flux observations at urban sites have reported random and systematic uncertainties in the same range as observed over vegetated ecosystems (Järvi et al., 2018).At this site, daytime flux errors have been estimated to be up to 10% (Best and Grimmond, 2015).Evaluating models over extended periods reduces the effects of random errors.However, we cannot account for systematic errors if they exist, nor can we assess if surface energy closure is achieved with the available observations.Annualised rainfall during the analysis period (682 mm) was near the long-term average.However, preceding drought conditions and ongoing restrictions on domestic irrigation led to lower moisture availability and higher Bowen ratios during the study period (Coutts et al., 2007b).Conditions were otherwise reasonably representative of typical local climatology (Lipson et al., 2022c).

2.2.3
Spin-up strategy Soil wetness at the beginning of a simulation (the initial conditions) can strongly influence the modelled surface energy balance.Most land surface models require years to reach a hydrological equilibrium when forced by local meteorology (Yang et al., 1995;Best and Grimmond, 2014).
As soil states are model-dependent, initial conditions cannot simply be transferred between models nor set to one state across models (Koster et al., 2009).Ideally, each model would reach their own equilibrium during a spin-up period which is not analysed, with 10 years considered generally sufficient across a wide range of land surface models (Best et al., 2015;Best and Grimmond, 2016b).As model-forcing observations are rarely available to allow such a long spin-up at urban sites, past evaluation strategies include discarding some initial observations as spin-up (Grimmond et al., 2011), repeating a single year of observations several times (Best et al., 2015), using global reanalysis products such as ERA5 (Hersbach et al., 2020) or reanalysis data with bias corrections applied from gridded observations, such as WFDE5 (Cucchi et al., 2020).Using reanalysis for spin-up represents interannual variability prior to the analysis period and allows observations to be used for analysis.However, gridded reanalysis data (with grid spacing of order 30 km or coarser) may be unsatisfactory if local urban effects are not captured.To address this, we use site-bias-corrected ERA5 time series for 10 years prior to analysis (Lipson et al., 2022a).This provides meteorology (precipitation, solar radiation, temperature, wind etc.) over a sufficiently long period for soil states to equilibrate with local conditions prior to the analysis period.Of the 30 participating models, five did not use the full spin-up period (ASLUMv2.0,ASLUMv3.(Hengl, 2018c) Note: Parameters 1-9 were provided to participants for use in the 'baseline' experiment, while the 'detailed' experiment allowed the use of all parameters.For detailed definitions see Lipson et al. (2022a).
BEPCOL, K-UCMv1, TARGET) because a long spin-up was not deemed necessary by those participants.

Baseline and detailed experiments
To assess how site-specific information impacts model performance, two experiments are undertaken.First, as a baseline, participants configured their models using only land cover and location information (parameters 1-9, Table 3) and their default model configurations.This is designed to evaluate models configured with information typically obtainable from global high-resolution land cover datasets.Second, for a detailed submission, participants could use all parameters in Table 3.This is designed to evaluate if performance improves with parameters that are more challenging to obtain and not typically globally available (e.g., building height, canyon aspect ratio, and a breakdown of hard surfaces into building, road and paved fractions).
The previous intercomparison at the same site (PILPS-Urban: Grimmond et al., 2011) included four stages with increasingly detailed site information for participants.The baseline experiment in the current project is most similar to PILPS-Urban Stage 2, and the detailed experiment to PILPS-Urban Stage 4 (for which model Note: Forcing data are gap-filled with bias-corrected reanalysis data (Lipson et al., 2022a).Analysis data are used for model evaluation without gap-filling.DJF=December, January, February (summer); MAM: March, April May (autumn); JJA: June, July, August (winter); SON: September, October, November (spring).
outputs are reanalysed and compared with the current cohort in Section 3: Results).

Requested model outputs
Of the 55 variables participants are asked to return, here we analyse four surface energy fluxes -upward short-wave (SW up ) and long-wave (LW up ) radiation, sensible (Q h ) and latent heat flux (Q le ) -as well as the momentum flux (Q tau ) (Table 4b).The additional 50 variables are collected to undertake more detailed analysis in future studies and for error checking purposes (e.g., to check input forcing aligned with output time steps).
Variable names and formats follow the conventions of the Assistance for Land-surface Modelling Activities (ALMA) (Bowling and Polcher, 2001), as used in previous PILPS projects to facilitate data exchange in (non-urban) land surface model intercomparisons projects.Variables requested include both the ALMA 'mandatory' and additional urban-specific variables [e.g., anthropogenic heat (Q anth ) and water (Q irrig ) fluxes, storage heat flux (Q Stor ), roof, wall and road surface temperatures (RoofSurfT, Wall-SurfT, RoadSurfT), bulk air temperature within buildings and in street canyons (T airBuilding , T airCanyon ), and urban canopy albedo (U Albedo )].Outputs are further described in the modelling protocol (Lipson et al., 2020).No submission included all requested outputs (Figure 4).

Submission and feedback
Submissions were accepted through a web portal (https: //modelevaluation.org) that stores data and undertakes comparison with observation (Abramowitz, 2018).Various automatic and manual checks (Table 5) are undertaken to diagnose human errors in model configuration and outputs, as these cause poor performance that prevents model design or parameterisation from being appropriately assessed (Menard et al., 2021).On submission, immediate feedback is provided to participants to inform of basic file formatting errors.Subsequently, time series and energy closure plots are provided to participants by project coordinators.Short-wave radiation is chosen as a focus for feedback because it is a relatively simple flux to model, and has an instantaneous response, making timing issues between forcing and outputs more obvious.Also, correctly representing the bulk albedo is known to be important for urban model performance, as the net short-wave radiation dominates the energy balance (Best and Grimmond, 2015).Following feedback, participants had an opportunity to resubmit prior to more complete analysis and final Variables analysed here are defined in Table 4, with others defined in the Urban-PLUMBER modelling protocol (Lipson et al., 2020), following the ALMA naming conventions (Bowling and Polcher, 2001).Two models (CABLE and CHTESSEL) submitted only the baseline experiment, as they could not use the detailed urban parameters (Table 3).A1.

TA B L E 5
error statistics being shared.The number of submissions (including the first) varied from one to nine.Initial checks identified human errors including incorrect start times, mislabelling of outputs, variable sign errors, forcing interpolation errors (where model time step was shorter than forcing) and errors in model source code causing unphysical or unexpected behaviour.Results were presented to participants through the project website (https://urban -plumber.github.io/),including time series plots of all submitted variables and individual model results and metadata (archived in the supporting information).

Benchmarks
Following PLUMBER (Best et al., 2015), we use benchmarks (Figure 5) to guide performance expectations that are both physically (e.g., Manabe, 1969; bucket model) and empirically based.The empirical benchmarks are determined by statistical regressions using observational data independent of the site (so-called 'out-of-sample'), meaning that data from the site being tested are not used to establish regression parameters.PLUMBER used out-of-sample benchmarks to provide a lower bound on performance expectations.Here we include two 'in-sample' empirical benchmarks (i.e., derived using test site data) to give an expected upper bound on flux predictability.More complex empirical models with lagged inputs can improve benchmarks further (Haughton et al., 2017); however, the benchmarks described below are sufficient for our analysis and allow direct comparison with those used in PLUMBER.
The out-of-sample regressions are trained using meteorological data: downward short-wave radiation (SW down ), air temperature (T air ), and relative humidity (RH) from 20 urban sites (Table A2).The sites are from Europe, the Americas, Asia and Australia, and have different regional climates, urban surface characteristics and observational period (Lipson et al., 2022a).All empirical benchmarks rely only on contemporaneous meteorological data (i.e., do not draw on data from previous periods to make predictions).
Six benchmarks are categorised into three groups.The ALMA short-names (Table 4) are used to denote model driving variables of empirical benchmarks.
Group one includes a single physically-based benchmark: • Manabe_1T: A simple 'slab and bucket' model (Figure 5a) based on physical principles (i.e., conservation of energy, mass and momentum).The impervious (built) fraction is simulated using a one-tile slab scheme (Best, 2005).For the pervious fraction a simple representation allows precipitation to fill a store which overflows when full, and otherwise freely evaporates (Manabe, 1969).At each time step, the impervious and pervious tile outputs are calculated and aggregated with a weighted mean.Manabe_1T is configured using baseline parameters (Table 3: parameters 1-9).Additionally, Manabe_1T is treated as a participating model (i.e., evaluated against other benchmarks) through a secondary configuration using the detailed site parameters (Table 3).
Group two has three out-of-sample empirical benchmarks: • REG1-SW down : Linear regression with one variable (SW down , Figure 5b) is used separately to predict SW up , LW up , Q h , Q le , and Q tau .At night all predicted values are constant because SW down = 0 W⋅m −2 .
• REG2-SW down -T air : Two-variable (SW down and T air ) linear regression (Figure 5c) provides some information at night and provides benefit for variables strongly dependent on temperature (e.g., LW up and Q h ).5d).To use this benchmark, at each time step the proximity of the forcing data to one of the 27 training cluster centroids is determined, and then that cluster's regression is applied to form a prediction.This benchmark equates to PLUMBER's EMP3KM27 (Best et al., 2015), which outperformed all participating land surface models when predicting sensible and latent heat fluxes across 20 sites based on common metrics.
Group three has two in-sample empirical benchmarks: • KM3-IS-SW down -T air -RH: Following the previous k-means clustering method, but trained with in-sample data only (i.e., AU-Preston).This will outperform an equivalent out-of-sample model, but performance is expected to degrade if applied to dissimilar conditions (i.e., another site) because of overfitting.
• KM4-IS-SW down -T air -RH-Wind: k-means is applied incorporating a fourth variable (wind speed), increasing the clusters to 3 4 following the above rationale.Wind speed provides information to help predict turbulent heat and momentum fluxes.
Benchmark time series data are openly available (Lipson and Best, 2022).

Error metrics
Following PLUMBER (Best et al., 2015), we use statistical measures in three groups: • Commonly used model comparison statistics: mean absolute error (MAE) measures average error; mean bias error (MBE) for overall bias; normalised standard deviation (nSD) compares the variance of model output to that of the observations; correlation coefficient (r) measures pattern errors.
• Extremes of the observed distribution: absolute error at the 5th and 95th percentile of observed and modelled outputs.
• Shape of the distribution compared to observations: skewness measures differences in symmetry of the distributions; kurtosis measures differences in the weight of the tails of the distribution; overlap indicates the closeness of fit across the two distributions.
We also separately use centred root-mean-square error (cRMSE) as a measure which combines variance and pattern errors, but does not capture bias errors (Taylor, 2001).For aggregated scoring, error statistics (e.g., MBE) are redefined into error metrics (m MBE ) to be positive with perfect score of zero (Table A1).For benchmark scoring, MAE gives identical results to the normalised mean error (NME) used in PLUMBER.All statistics and metrics are defined in Appendix Table A1.

2.3.3
Benchmark scoring PLUMBER used a simple rank-based score to evaluate models and benchmarks (Best et al., 2015).However, simple ranking may give a false impression of difference where metric results are nearly identical (Haughton et al., 2016).
Relative scoring allows relative performance to be shown (Sabot et al., 2020).Furthermore, if global extrema are used across all models and benchmarks, this ensures a single benchmark has the same relative score across models.Thus, our scoring differs from PLUMBER.For each participating model i, variable v and metric m, a score S is calculated for the i th model using the minimum and maximum metric result across all models and benchmarks for that variable: This gives a score of 0 for the best-performing model or benchmark, and 1 for the poorest, with all others scaled relative to the range of results.Rescaling scores between 0 and 1 ensures that no metric has greater weight when aggregated with others.Different metric scores (Section 2.3.1) are aggregated into groups with: where n m is the number of metrics in group being aggregated (e.g., for the extremes group n m = 2).

RESULTS
To build our understanding of the model's performance, we initially consider one error statistic, the MAE (Section 3.1: Figure 6).Although no single measure can fully characterise model skill (Jackson et al., 2019), MAE provides a simple and unambiguous measure of average error (Willmott and Matsuura, 2005) and allows comparison of error magnitude across fluxes in natural units (W⋅m −2 ).Subsequently, three statistics (for correlation, variance and difference errors) are analysed in a Taylor diagram (Taylor, 2001) (Section 3.2: Figure 7).Aggregated benchmark performance scores (Section 2.3.3) are then analysed in benchmarking diagrams, using common error metrics (Section 3.3: Figure 8).Finally, all metrics (common, extreme and distribution) are used to compare models with benchmarks (Section 3.4: Figure 9).The PILPS-Urban Phase 2 Stage 4 (Grimmond et al., 2011) (hereafter G11) anonymised model outputs are reanalysed here conforming to this project's metrics and analysed periods.

Assessment using the mean absolute error
Individual model MAE results are combined into boxplots (Figure 6) for three experiments (G11, baseline and detailed) with results also analysed by model cohort (Section 2.1).The performance of the ensemble mean (i.e., the mean of participating model outputs at each time step; rightmost column) and the benchmarks (coloured horizontal lines) are also shown (Figure 6).
For upward short-wave radiation (SW up ), the detailed site information (e.g., albedo) improved all cohort performance where utilised (non-urban models did not submit detailed simulations).Most one-tile, two-tile and canyon models outperform the physical and out-of-sample benchmarks when given detailed information, whereas complex models did not use this information as effectively.The ensemble means perform similarly across the three experiments (G11, baseline and detailed), matching the best-performing individual models.The relatively low MAE for all benchmarks (2.3-7.0W⋅m −2 ) indicates this flux can be well simulated with few inputs.
For upward long-wave radiation (LW up ), providing more detailed site information reduced the MAE for one-tile models, but performance changed little for other model types (in some cases becoming poorer).Most models outperform the physical benchmark but only some beat the empirical benchmarks.The ensemble mean time series outperforms the physical and out-of-sample benchmarks.The LW up benchmark MAE values are larger (5.2 -22.4 W⋅m −2 ) than for SW up , indicating the flux is more challenging to predict with the information available to benchmarks.
For sensible heat flux (Q h ), providing detailed information broadly improves performance, particularly for two-tile, canyon and complex cohorts.Models with initially large baseline anthropogenic heat fluxes benefitted from knowing this site's relatively small flux magnitude (11 W⋅m −2 annual mean).All cohort mean and median MAE outperform the physical and out-of-sample empirical benchmarks for the detailed simulations, with the ensemble mean able to outperform all benchmarks (including in-sample benchmarks).The larger benchmark errors (18.5-32.9W⋅m −2 ) indicates that the flux is more challenging to predict than either radiative flux.
For latent heat flux (Q le ), more detailed information provided little benefit for reducing MAE, and performance degrades slightly with more complex cohorts (based on built geometry, not vegetation or hydrology attributes).This suggests the more detailed information (mostly related to urban morphology) may not be in a form useful for models.However, the non-urban models do most poorly as, without any impervious surface fraction, they vastly overestimate evapotranspiration.The detailed ensemble mean outperforms all individual models and the in-sample empirical benchmarks.Q le has a relatively high benchmark range (18.6-26.1 W⋅m −2 ), indicating a greater challenge to predict than radiative fluxes.
In summary, the range of MAE is lower for the current models than for the G11 models, indicating better performance of urban models in the present intercomparison.Differences in mean MAE for models in G11 and detailed experiments (which have comparable site information) are statistically significant for SW up , Q h and Q le , but not for LW up (t-test, p < 0.05).The range of MAE for the detailed simulations is generally smaller than for the baseline, indicating models benefitted from additional site information.The difference in the mean MAE for baseline and detailed experiments reaches significance in SW up only.

Taylor diagram evaluation
A Taylor diagram (Taylor, 2001) combines three error statistics: (a) a difference error metric (centred root-mean-square error: cRMSE); (b) a variance error metric (modelled standard deviation normalised by observed standard deviation: σ); and (c) a correlation error metric (Pearson's correlation coefficient: r; all defined in Table A1).Taylor diagrams use the centred RMSE (the RMSE after mean bias is removed) because it has a geometric dependence with the other two (independent) metrics, allowing the construction of the diagram (Figure 7).For each model, a marker shows where the three metrics intersect.The cRMSE of benchmarks is indicated by the concentric dashed lines.A model that would perfectly align with observations is indicated with a star at the figure base.The G11 PILPS-Urban Phase 2 Stage 4 results (small dots) are compared with the detailed experiments, as the site information available in each is comparable.
For SW up , most Urban-PLUMBER (UP) models and benchmarks are grouped tightly around the observation star (Figure 7a).Some UP models (e.g., 22, 29) have high correlations, but different variances than observed, indicating errors in bulk albedo.Others (e.g., 09) captured the observed variance well, but had lower correlation, indicating a potential time-of-day issue with SW up (either a time offset, or in this model's case, an asymmetrical diurnal profile).Using cRMSE as a metric, 23 of 30 UP models outperform at least one benchmark, while only 18 of 31 of the G11 models do (Table 6).The spread in σ and r indicates G11 models had greater albedo and time-of-day errors than UP models (Figure 7).
For LW up (Figure 7b), many participating models have larger variance than observed because of an overprediction in the diurnal range of LW up .Some UP models (e.g., 22, 20) are high-end outliers, with σ of approx.1.7 (i.e., 170% of observations), greater than the cRMSE of all benchmarks.G11 outliers are larger still (0.5 to 2.0).The LW up ensemble mean in UP and G11 performs similarly, as do the number of models that outperform benchmarks (Table 6).
For Q h (Figure 7c), correlation and variance statistics (and hence cRMSE) improved substantially in UP compared with G11.For UP, 28 of 30 models outperform at least one benchmark in cRMSE (cf.18 of 31 in G11, Table 6).Twelve UP models outperform all out-of-sample benchmarks (cf.six G11 models).One UP model (14: Lodz-SUEB) outperforms the four-variable in-sample benchmark, which is the upper limit of performance expectations.The UP ensemble mean also performs very well, outperforming all benchmarks nSD, R and cRMSE.

F I G U R E 7
Taylor diagram combines the normalised standard deviation ( σ), correlation coefficient (r) and cRMSE (defined in Table A1).Models (coloured numbers) have better performance if closer to star at diagram base, with cohort colours: non-urban (dark greens), one-tile (blues), two-tile (greens), canyon (orange to reds), complex (purples).Benchmarks models (coloured symbols) with their cRMSE contours (concentric dashed lines), and the PILPS-Urban Phase 2 Stage 4 (G11: small black circles).The UP ensemble mean variance is nearly identical to observations ( σ = 1.00), while for G11, 26 models had higher variance than observations and the ensemble mean σ = 1.23.Higher variance in G11 models indicates an overprediction in maximum Q h values or a general overestimation of the variability in Q h .As cRMSE is 'centred' it does not measure bias error (Taylor, 2001).For Urban-PLUMBER, the MBE for Q h ensemble mean is 5.2 W⋅m −2 (cf.12.1 W⋅m −2 in G11), indicating the UP models have improved partitioning of available energy into Q h .
Compared with other fluxes, the poorer cRMSE of the six benchmarks for Q le (Figure 7d) indicates this flux is more challenging to predict, or that it requires other inputs to improve performance (e.g., precipitation, soil states or vegetation characteristics).Most UP and G11 models underestimate the variance of this flux.The ensemble mean's σ = 0.70 (i.e., 70% of observations standard deviation).However, this is an improvement over the G11 ensemble mean ( σ = 0.60).The G11 results exclude six models that did not provide Q le output (i.e., some assumed Q le = 0 W⋅m −2 ).The ensemble mean for Urban-PLUMBER MBE (−4.1 W⋅m −2 ) is better than for G11 (−8.2 W⋅m −2 for G11 models that explicitly resolved Q le ).Combined with the improved ensemble mean Q h MBE, this indicates the UP models are better at partitioning available energy  into Q h and Q le .No model in either project outperformed in-sample empirical benchmarks for Q le (Table 6).
Eleven models provided the simulated momentum flux (Q tau ) from both Urban-PLUMBER and G11 (Figure 7e).Performance in both projects is similar, cf.benchmarks (Table 6).Benchmarks without wind information (i.e., one-, two-and three-variable empirical benchmarks) did not perform well.All models ranked between the three-variable and four-variable in-sample benchmarks, and all were able to beat out-of-sample empirical benchmarks.

Benchmarking evaluation: Common metrics
We can evaluate model results relative to the benchmarks for various experiments using the aggregated scores (Equation 1and 2) from four common metrics (MAE, MBE, nSD, R).A lower relative score indicates better performance (Figure 8).Of the 30 models in this project, 11 (in an earlier form) also participated in G11, so allow direct comparison.In Figure 8, the models are ordered in increasingly complex cohorts (Section 2.1), and within cohorts by the 'total complexity' (Figure 2), which includes hydrology and anthropogenic related characteristics.
For SW up , providing more detailed site information consistently improves the aggregate scores.The models in the simpler cohorts (one-tile, two-tile) benefit more from the more detailed information (square marker in each model column), where eight of 10 outperformed both physical and out-of-sample empirical benchmarks (dark green, model column base).Almost all canyon models outperform a benchmark, and in the detailed experiment three outperform all physically-based and out-of-sample empirical benchmarks.Only one complex model outperforms a benchmark for SW up (excluding G11 results), indicating the complex cohort have more difficulty using provided site information.The ensemble mean (last column) for G11 and UP performs well, beating all out-of-sample benchmarks (dark green) in the more detailed experiments.
For LW up , additional site information does not always improve performance, and sometimes degrades it.Performances of cohorts are inconsistent, with some models in non-urban, one-tile, two-tile and canyon categories outperforming all physical and out-of-sample empirical benchmarks, but others beat none.The canyon and complex models, and some two-tile models with radiation parameterisations, should be able to improve their LW up performance by utilising the detailed information provided on building morphology (e.g., representing canyon long-wave trapping), but appear unable to do so.The most complex urban schemes again are not able to effectively use provided information, with none outperforming a multivariate empirical benchmark.The ensemble mean performance changes little between G11 and UP.
For Q h , the non-urban, and most one-tile, two-tile and canyon models outperform all out-of-sample empirical benchmarks (dark green).This is in stark contrast to PLUMBER (Best et al., 2015) when no model outperformed even the one-variable linear regression at non-urban sites.Again, fewer complex models are able to beat empirical benchmarks, but all outperform the physically-based benchmark.Some one-tile ( 14), two-tile (11, 16) and canyon (01, 07, 19) models beat the three-variable, but not the four-variable in-sample benchmarks, which we consider as an upper performance expectation.Providing additional site information more frequently improves, rather than degrades, performance.The biggest improvements occur in models that assumed initially large baseline anthropogenic heat fluxes (e.g., models 21, 24, 26), drawing on the detailed characteristic estimates of mean anthropogenic flux.Substantial improvement in the ensemble mean occurs from G11 to UP baseline to detailed experiments, with the latter outperforming all benchmarks, including those trained in-sample (purple).Q h is the only variable where the ensemble mean beats all the benchmarks for common metrics.
Although additional site information degrades Q le model performances as often as it improves it, the improvements are larger in magnitude.Hence, the ensemble mean improves across the three experiments, in each case outperforming all the out-of-sample and physical benchmarks.Non-urban models are unable to outperform any benchmark because they overestimate Q le magnitudes.The three best-performing models in Q le (10,11,12) are all JULES-based models used with different urban schemes.Some one-tile (10, 14, 18), two-tile (11, 12), canyon (07, 13, 23, 27, 28) and complex (09, 25) models outperform all physical and out-of-sample benchmarks in detailed simulations.It is unclear at this stage why these models are performing well in Q le , as they all differ in their approaches in representing vegetation and hydrology processes.Further analysis across multiple sites may be informative.No detailed information for vegetation is provided that may have improved model performance further, e.g., leaf area index (LAI) phenology or stomatal conductance.
The 11 models that submitted Q tau results perform better for the detailed than the baseline simulations but have a slightly degraded ensemble mean than for G11.Of the benchmarks, the four-variable in-sample benchmark (i.e., including wind speed) performs best, followed by the physically-based benchmark (Manabe_1T).Other benchmarks, without lacking the critical wind speed information for the momentum flux, perform poorly.No clear pattern is evident by modelling approaches.

Benchmarking evaluation: All metrics
The prior metrics (MAE, MBE, nSD, R and cRMSE) are focussed on central tendency rather than the extremes and/or the distribution.Ability to predict high-impact weather events (historical, current, future climate) makes performance skill for extremes important.We therefore expand the 'common' benchmark scorecard results from the base of Figure 8 with 'extreme' and 'distribution' error metrics in Figure 9. Relative performances (as in Figure 8) for metric groups are provided in the supporting information (Figures S1-S3).
For Q h and Q le , some models outperform in-sample benchmarks (purple), with many able to outperform all out-of-sample benchmarks (green).An exception is that all model cohorts find predicting the 95th percentile for Q h challenging (white or blue) because they overestimate the upper Q h tail.For Q tau , models generally perform very well compared with in-sample empirical benchmarks, although this flux is heavily reliant on instantaneous wind information which is only provided in the most complex benchmark (KM4-IS-SW down -T air -RH-Wind).Most models assessed for Q tau perform similarly to the simple physically-based benchmark (Figure 8).
The ensemble mean time series performs strikingly well across all fluxes, beating in-sample benchmarks for many metrics in Q h , Q le and Q tau , and beating most out-of-sample benchmarks in SW up and LW up .The ability of the ensemble mean to outperform even empirical models trained in-sample suggests the participating models adequately span the range of uncertainty from the parameterisation of processes.DISCUSSION PILPS-Urban established that correctly representing the ratio of impervious to pervious surfaces is of first-order importance for urban model performance (Grimmond et al., 2011), so we do not investigate that here and provide land fraction information in the first 'baseline' experiment (Table 3a).We instead assess the impact of secondary information (e.g., three-dimensional morphology, bulk albedo, anthropogenic heat) in a 'detailed' experiment (Table 3b).We also compare outputs from Urban-PLUMBER (UP) models with those from PILPS-Urban Phase 2 Stage 4 (G11) (Grimmond et al., 2011), which used the same site and observations.Current models show reduced errors across four energy fluxes (Figure 6), with a lower MAE range, and lower mean MAE for both the baseline and detailed experiments compared with G11 (however not reaching statistical significance for LW up ).For cRMSE (Table 6), we find broad improvement for upward short-wave (SW up ), sensible (Q h ) and latent (Q le ) heat fluxes, but little or no improvement in upward long-wave (LW up ) and momentum (Q tau ) fluxes.When assessing performance using four common evaluation metrics (Figure 8), the ensemble mean has clearly improved for Q h and Q le but is little changed or slightly degraded for SW up , LW up and Q tau .
These results suggest the current generation of models is performing better than the G11 models for Q h and Q le .In the last decade considerable community effort has been applied to improve existing and develop new models with particular focus on better resolving vegetation and soil processes after these were found to be highly important for performances in previous intercomparisons (Grimmond et al., 2010(Grimmond et al., , 2011)).This implies the better performance seen here is from model development, but model application (i.e., configuration) may have also improved.Compared with G11, participants' previous experience modelling the site, the provision of more site-specific data (Table 3), the additional rapid automatic feedback identifying human errors, and/or improved spin-up strategy may also have enhanced performance, rather than model parameterisation improvements alone.
The poorer correlation of G11 models compared with current models (Figure 7) indicates time-of-day errors (i.e., human rather than model errors).However, when all models with SW up correlations below 0.99 are excluded from the analysis (i.e., retaining only those with good timing), G11 models still perform poorer for LW up , Q h and Q le compared with the current cohort (Figure S4), implying other factors have led to performance improvements.
Feedback provided to participants prior to final evaluation (Table 5) significantly reduces SW up and LW up errors for some models, and to a lesser degree also reduced Q h and Q le errors (Figure S5).Models that resubmitted more times are generally able to improve their performances relative to their first submission but did not typically outperform those that submitted only once.When models are categorised by previous experience modelling with this site (e.g., via the G11 project), more experienced groups tend to have lower errors, particularly in the initial submissions (Figure S6).This suggests that resubmission helped 'level the playing field' for those with less experience with the application of a model at this site.
Those models that undertook the extended 10-year spin-up tend to have better performance in detailed simulations than those that did not (Figure S7), however, this effect was small.Models with a closed surface energy budget have better performance in radiative fluxes, but not in turbulent fluxes (Figure S8).Fully separating the various influences (better models, more experienced modellers, better spin-up strategy) will require additional investigation to assess their relative impacts.However, in synthesis, urban model performance has improved since the last major urban model intercomparison over a decade ago.
Flux magnitudes and dominant processes vary diurnally, and so separately analysing day and night periods provides additional insight.Compared with standard benchmarking scores (Figure 8), daytime only (Figure S9) and nighttime only (Figure S10) results are presented in the supporting information.For LW up most models' daytime results degrade in comparison with benchmarks, with fewer outperforming the two-variable out-of-sample regression.However, at night, most models beat all out-of-sample benchmarks.The complex model cohort benefits most at night, with all but one complex geometry model beating all out-of-sample benchmarks, and nearly matching the performance of the in-sample benchmarks.Best and Grimmond (2015) found a similar result, concluding that the more complex geometric representations were able to account for nighttime long-wave trapping between urban structures that simpler schemes could not.In this project, some models in every cohort perform well for LW up at night, but the most complex cohort was most consistently improved, and had the best overall detailed experiment results, implying they were better able to use the additional morphology information provided at that later stage.For Q h , Q le , and Q tau most models across all cohorts perform very well at night, typically beating one or both in-sample benchmarks, and nearly every model easily surpassing the out-of-sample empirical benchmarks.For nighttime Q h , some models in the urban canyon and complex categories performed better than all non-urban, one-tile and two-tile models, again implying more complex geometry is beneficial at night.The same consistent benefit from more complex geometry was not apparent in daytime periods, nor in the overall assessments.
The use of benchmarking helps to guide performance expectations (Best et al., 2015).For example, without benchmarking Q le appears to be poorly modelled according to the Taylor plot statistics (Figure 7), as was concluded in the earlier PILPS-Urban study (Grimmond et al., 2010(Grimmond et al., , 2011)).However, the benchmark results show that it is more challenging for models to minimise the Q le errors than, for example, SW up errors (Figure 7).Benchmark assessment of extremes and distribution skill finds Q le to be one of the better modelled fluxes, with models able to outperform the in-sample empirical benchmarks in many instances (Figure 9: purple).Likewise, Q h is well modelled for the common (Figure 8) and other (Figure 9) metrics, compared to benchmarks.Fewer models outperform benchmarks for LW up (Figures 7 and 8), particularly in the daytime (Figure S9).More site information generally degraded the ensemble mean skill in LW up (Figures 6 and  8), except for more complex models at night (Figure S10).The overall poor performance in LW up compared with benchmarks indicates an area for which model development may prove beneficial.This may be difficult, as LW up is dependent on surface temperature, itself a result of the surface energy balance, so is sensitive to errors in all other surface energy fluxes (and related parameterisations).The evaluation of LW up is additionally complicated by the fact that the footprints of radiative observations from a flux tower differ from the footprint of the turbulent fluxes (Schmid et al., 1991;Sailor, 2011), so may be poorly represented by site parameters, which were intended to capture the larger turbulent-flux footprint.
A key PILPS-Urban (Grimmond et al., 2010(Grimmond et al., , 2011) ) finding was that simpler models generally performed as well or better than more complex models.Similarly, we find the 'complex' models (Section 2.1) are often outperformed by one-tile, two-tile and canyon models (Figures 6-9).However, model complexity needs to consider many aspects of urban environments, including morphological, hydrological, vegetation and anthropogenic influences.Some of the most complex 'built' representations have the simplest soil, water and vegetation approaches (Figure 2).The simpler models within a cohort (i.e., left side of each cohort, Figures 8 and 9) often had poorer intracohort results.Thus, the hydrological and vegetation complexity is important.Many simpler PILPS-Urban built schemes benefitted from being coupled to more sophisticated vegetation land surface models, performing well, as they do here.However, the two participating non-urban models (with sophisticated hydrology) significantly overpredict Q le , performing worse than benchmarks and all other model cohorts in this flux (Figures 6-8).This indicates the representation of impervious surfaces is important even at this suburban site.Canyon models have improved compared with earlier evaluations, with some performing here as well as the best one-tile and two-tile schemes.This implies that community efforts in model development over the last decade have paid dividends, particularly the focus on integrating soil hydrology and/or vegetation into canyon models.
In stark contrast with PLUMBER (Best et al., 2015), this project's submissions are often able to outperform all out-of-sample empirical benchmarks for Q h and Q le , and in some cases for the three-or four-variable in-sample benchmarks (Figures 6-9).For common metrics, PLUMBER (Best et al., 2015) found no model able to outperform a single-variable linear regression using SW down for Q h , or a three-variable regression for Q le in applications (over non-urban terrain).An explanation such as Urban-PLUMBER simply having better models than PLUMBER is unlikely as some models (CABLE, CHT-ESSEL) or their vegetation components (NOAH, JULES) participated in both projects.Analysis coding errors are unlikely as we confirmed we could recreate the PLUMBER results using their site, model data and aggregation methods.These results suggest models for urban areas perform better than those for non-urban areas when assessed against the same empirical benchmarks.
Urban sites are highly diverse, and only one case is considered here.The AU-Preston site could be unrepresentative of other urban sites used for training, leading to poorer performing regression using out-of-sample data.This is supported by the fact that the out-of-sample three-variable regression (KM3-SW down -T air -RH) performed poorer than simpler regressions for Q h and Q le (Figures 6-8), indicating overfitting.However, some models outperform the regressions trained in-sample (i.e., using only AU-Preston data), and therefore good model performances are not simply related to the site's (un)representativeness.Alternatively, models may have performed well here because we provide participants with more site-specific information (Table 3) than in PLUMBER.For the latter, participants were provided with a single plant functional type descriptor (e.g., grassland).However, some Urban-PLUMBER models are outperforming benchmarks in the 'baseline' experiment when only minimal surface information is given, so better model performance is not simply from the surface descriptions provided in this project.
Ultimately, models participating in Urban-PLUMBER are performing better against benchmarks than the PLUMBER project land surface models were able to.This implies the complexity of urban surfaces benefits from the more complex modelling techniques used to address urban areas, compared with the natural landscapes evaluated in PLUMBER.A multisite evaluation is required to confirm these initial results (now under way).
In Urban-PLUMBER, we focus on bulk local-scale surface-atmosphere exchanges as these variables and scale act as the lower boundary conditions for weather, climate and air quality modelling.They may also act as the upper boundary conditions for more detailed models used for applications in cities (e.g., pedestrian thermal comfort).Some modellers using the latter type of models declined to participate in this project as their models require more detailed surface information than we could provide, and are more computationally intensive, making long (e.g., 10+ years) simulations unfeasible.Some models are not intended to predict the bulk land-atmosphere exchanges assessed here, but for predicting other details within the urban canopy.Other MIPs have encountered this challenge (bulk vs detail).ESM-SnowMIP (Menard et al., 2021) found comparatively complex models developed for specific purposes, and tested rigorously for their intended use, are outperformed by simpler bulk models when bulk variables are assessed.Thus, intended model use is a key consideration when evaluating performance.
Following ESM-SnowMIP (Menard et al., 2021) and our earlier experience (e.g., Grimmond et al., 2010), that human errors can be widespread in intercomparison projects, we provide rapid automatic checks with feedback to participants, and follow up with manual checks (Table 5).Allowing resubmission where human errors are identified enables this evaluation to focus more closely on intended model performance.Identified human errors included: start times, output labels, variable sign, and forcing interpolation errors.These, plus model source coding mistakes, all impacted initial results.Our initial feedback focussed on SW up to check that forcing and output timing aligned, and we link this to the net improvement in SW up performance seen (cf.G11, Figure 6).Best and Grimmond (2015) previously established that correctly modelling the bulk surface albedo is critical for model performance for all surface energy fluxes.Ensuring albedo is simulated better in this project helps focus evaluation on other aspects of model design, such as the impact of hydrology, vegetation and anthropogenic influences.
While considerable efforts are undertaken to compare models rather than users, the different application of models will impact results, and undoubtedly some human errors remain.Hence, individual model results presented here should be interpreted with caution.We highlight broad patterns, but cannot untangle whether individual model performances are a result of: (a) aspects of model design; (b) user model configuration, or 3) model assessment methods (e.g., variables, metrics, spatial and time scales).A multisite evaluation will provide more certainty for model performances.
Despite the limitations of any model comparison project, they remain one of the foundational elements of climate science (Eyring et al., 2016).Model intercomparisons help define common working practices amongst disparate modelling groups, identify broad strengths and weaknesses of different modelling approaches, build the knowledge and skills of participating scientists and help direct future community efforts to improve the skill and application of models.

CONCLUSIONS
An international group of 45 scientists have evaluated the performance of 30 land surface models at a suburban site in Melbourne, Australia.Participating models vary in the complexity of their built geometry, hydrology and anthropogenic representations.Ten error metrics are used with both physically-based and empirical benchmarks to assess the models performance.

Key study findings
• Compared to the earlier PILPS-Urban model comparison at the same site (Grimmond et al., 2011), there is broad improvement in modelling upward short-wave radiation (SW up ), sensible (Q h ) and latent (Q le ) heat fluxes, but little/no improvement in upward long-wave radiation (LW up ) and momentum (Q tau ) fluxes.
• As in PILPS-Urban, the ensemble mean time series performs very well across all fluxes, suggesting participating models adequately span the range of uncertainty from the parameterisation of processes.
• As in PILPS-Urban, some one and two-tile urban schemes (particularly when coupled to sophisticated soil/vegetation land surface schemes), performed well across all fluxes.
• Some canyon models also perform well, indicating the integration of hydrology and vegetation into canyon models after PILPS-Urban has paid dividends.
• 'Complex' urban models are generally outperformed by others, but their overall performance is likely penalised by having simpler hydrological and vegetation approaches.
• Schemes that do not represent impervious surfaces (i.e., non-urban models), as well as urban models with simplistic hydrology/vegetation performed poorly in Q le , confirming that representing both pervious and impervious surfaces is important in suburban locations.
• Detailed site information broadly improves turbulent heat fluxes but has little impact on daytime radiant fluxes.
• A two-variable out-of-sample regression outperforms most models for daytime LW up , thus indicating an area for which future model development may prove beneficial.
• Many models outperform the non-linear three-variable empirical benchmarks for Q h , with some even beating in-sample non-linear benchmarks (i.e., exceeding expected predictability using contemporaneous information).This is in stark contrast to the PLUMBER (Best et al., 2015) results where no model outperforms simple SW down linear regression derived from 20 non-urban sites for standard statistical metrics.It is not clear from this study if model design, model configuration, spin-up strategy and/or poorer performing benchmarks explain this.
• The empirical benchmarks may be less effective in urban locations because of anthropogenic (human behavioural) influences on fluxes, or non-contemporaneous information (e.g., memory effects of surface heat storage) being more important at urban sites, particularly at night.This implies more complex modelling techniques (i.e., land surface models rather than simple empirical models) may provide greater benefit in urban landscapes.
• Results are based on a site previously used in evaluation studies.When the details of a site are not known and not previously modelled, we should not expect such a high level of performance.
Recommendations and lessons learnt from this project: • We recommend the use of benchmarks when evaluating models to help guide performance expectations.
In this project, simple information-limited models set minimum expectations, while more complex in-sample empirical models helped indicate an upper bound for performance expectations.
• Model evaluations traditionally consider observational and modelling errors caused by parameterisation design decisions but should also explicitly consider errors caused by human factors (communication or coding errors in model or postprocessing code).
• Human errors can be reduced (but probably not eliminated) by providing participants with initial feedback and allowing resubmission prior to final analysis.We recommend the use of web-based analysis portals (e.g., modelevaluation.org)that can provide immediate feedback to participants (plots and error statistics), particularly: -Indicating variables that exceed expected physical limits, as well as checks on energy closure, as this helps identify model numerical errors, or errors in submitted variable's identification, units or sign.
-Correlation of modelled vs observed short-wave radiation, as this flux varies nearly linearly with forcing, helping indicate time-of-day human errors.
• Model configuration files (e.g., parameter namelists) and model revision numbers should be submitted with model outputs to help ascertain why outputs have changed between submissions, and allow submissions to be reproducible.Chow et al., 2014;Chow, 2017) Note: For the out-of-sample benchmarks, data from the AU-Preston site are not used to train the empirical models.For the in-sample benchmarks, only AU-Preston data are used to train the model.Tower data are openly available (Lipson et al., 2022a(Lipson et al., , 2022c)).

Participating model descriptions
For each model (Figures A1-A30) characteristics (Figure 1) are summarised.F I G U R E A1 ASLUMv2.0 (Arizona State University Single-Layer Urban Canopy Model) (Wang et al., 2021(Wang et al., , 2013) is a single-layer canyon model that analytically resolves surface temperatures and conductive heat fluxes based on Green's function (Wang et al., 2011b), and explicitly resolves subfacet heterogeneity and urban vegetation.ASLUMv2 incorporates detailed hydrology, multilayer soil/ground and roof vegetation (via a multilayer green roof model), and enables simulations of irrigation, anthropogenic heat, and urban oasis effect (Yang et al., 2015).

F I G U R E A3
BEPCOL is a multilayer canyon model based on the building effect parameterization (BEP; Martilli et al., 2002) with parameterizations for drag coefficient and the length scales used for turbulent transport and turbulence dissipation (Simón-Moral et al., 2017).BEPCOL does not explicitly consider vegetation and the non-urban fraction is computed by the bare soil model from RAMS (Tremback and Kessler, 1985).
F I G U R E A4 CABLE (Community Atmosphere-Biosphere Land Exchange model) (Kowalczyk et al., 2006;Wang et al., 2011a) is used in regional and global climate models including ACCESS (Kowalczyk et al., 2013).CABLE has a one-layer, two-leaf canopy vegetation scheme with up to five tiles (vegetation types, bare soil and ice) but no urban tile.In this project CABLE uses four soil layers, up to three snow layers, and models impervious urban surfaces as bare soil.
F I G U R E A5 CHTESSEL (Carbon-Hydrology Tiled ECMWF Forecasts Scheme for Surface Exchanges over Land) is the land surface model used in the Integrated Forecast System (IFS).This is used by ECMWF for weather forecast and to create reanalysis products (Balsamo et al., 2009;Boussetta et al., 2013;ECMWF, 2020).F I G U R E A8 CM is a multilayer urban canopy model with roofs (impervious roof and vegetation), walls (impervious wall and vegetation) and roads (impervious road and vegetation) (Kondo and Liu, 1998;Kondo et al., 2005).These impervious tiles consider water content and ponding for the present comparison.CM considers an urban block in which buildings stand on a lattice array.This horizontal arrangement of buildings is defined using the average length of the building and distance between buildings.CM accounts for building drag, anthropogenic heat release (prescribed), and three-dimensional radiation interactions and distribution of the height of buildings.A new parameterization for mixing length is introduced (Kondo et al., 2015).
F I G U R E A9 CM-BEM is CM (Figure A8) coupled to a building energy model (BEM) (Kikegawa et al., 2003(Kikegawa et al., , 2006)).It is embedded within WRF (WRF-CM-BEM: Kikegawa et al., 2014).It has three urban categories: office, and two residential types.The BEM, box-type heat budget model, simulates heating, ventilation, and air conditioning (HVAC) system energy consumption and the resulting anthropogenic heat release including sensible and latent heat components.BEM can consider whether the outdoor units are air-cooled or water-cooled.CM-BEM can integrate social big data such as real-time population and estimate the impacts of human behaviour changes on urban temperature and energy consumption (Takane et al., 2022).

F I G U R E A10
JULES_1T is a one-tile urban scheme (Best, 2005) within the Joint UK Land Environment Simulator (JULES) (Best et al., 2011).JULES is a community land surface model forming part of the UK Met Office's Unified Model (UM).JULES has nine non-urban surface tiles (five vegetation types, inland water, bare soil and ice).Tiles can be covered with up to three snow layers.Soil hydro-thermodynamics are modelled through four soil layers, with the top layer coupled to the urban tile through long-wave radiation only.

F I G U R E A11
JULES_2T is a two-urban tile (roof, canyon) (Best et al., 2006) version of JULES_1T (Figure A10).
F I G U R E A12 JULES_MORUSES, has the same two urban tiles (roof, canyon) as JULES_2T (Figure A11) but uses MORUSES (Porson et al., 2010).MORUSES has only two facets (roof and canyon) but includes a parameterisation for canyon radiative effects which updates canyon bulk values such as albedo, emissivity, heat capacity and aerodynamic resistance.This provides benefits of canyon schemes with a computational efficiency gain important for use in operational numerical weather prediction.
F I G U R E A13 K-UCMv1 (Klimaat Urban Canopy Model v1) solves the local surface energy balance based on land cover input and regional meteorological forcing.To calculate conduction, the urban facets (ground, roof, walls) have 10 layers.Effects of low vegetation are accounted for in the surface energy balance without shadowing effects.Vegetation is perfectly watered and non-vegetation surfaces store no water.
F I G U R E A14 Lodz-SUEB (Łód ź SUrface Energy Balance model) (Fortuniak, 2003) is a bulk scheme that aggregates urban and natural surfaces parameters based on surface fraction.The ground heat flux assumes a 12-layer slab.This is a single snow layer and any surface liquid water or snow, if present, assumed to completely cover the slab.Moisture content on the slab is constrained between site-specific limits, with excess leading to runoff and supply from deeper layers during dry periods.
F I G U R E A15 Manabe_1T uses a one-tile urban scheme (Best, 2005) with a simple Manabe bucket model for non-urban fractions (Manabe, 1969).A Manabe bucket model has no heat conduction into the soil and accumulates precipitation until it freely evaporates or exceeds storage capacity as runoff (Pitman, 2003).We use this 'slab and bucket' scheme as a physically-based benchmark.

F I G U R E A16 Manabe_2T only differs from Manabe_1T
(Figure A15) in that it uses a two-tile urban scheme (roof, canyon) (Best et al., 2006).
F I G U R E A17 MUSE (Microscale Urban Surface Energy) (Lee and Lee, 2020) is a building-resolving microscale urban surface model for real urban meteorological and environmental applications.It represents urban buildings on a three-dimensional Cartesian grid and solves urban physical processes of short-wave and long-wave radiative transfer, turbulent exchanges of momentum and heat, and conductive heat transfer into urban subsurfaces.The effect of urban vegetation is parameterized based on a simple Bowen ratio method in calculating the radiative and turbulent sensible/latent heat fluxes.
F I G U R E A18 NOAH-SLAB uses a bulk one-tile urban scheme (Liu et al., 2006) with the Noah land surface model (Noah-LSM) (Chen and Dudhia, 2001).Separate urban and non-urban energy and water balances are simulated, then weighted by surface fraction.Although Noah-LSM has up to 27 land-use tiles (including urban), here the urban and one dominant non-urban land-uses are used.For the non-urban surface, parameters (e.g., albedo, roughness length) are set to urban values provided, allowing evaporation from vegetation for an otherwise urban surface.
F I G U R E A19 NOAH-SLUCM uses Noah-LSM as in NOAH-SLAB (Figure A18) but uses the Single-Layer Urban Canopy Model (Kusaka et al., 2001;Chen et al., 2011) rather than the urban slab scheme.SLUCM separates the urban tile into three facets (roof, road, wall) using a two-dimensional canyon approach but without street orientation or varying building heights.A diurnally varying anthropogenic heat flux is prescribed.
F I G U R E A20 SNUUCM (Seoul National University Urban Canopy Model), a single-layer model, parameterises short-wave and long-wave radiation absorption and reflection, energy and moisture turbulent exchanges between surfaces and adjacent air, and conductive heat transfer through sublayers (Ryu et al., 2011).It calculates canyon wind speed using regression equations based on CFD model simulations.Here the non-urban area fluxes are simulated by the Noah land surface model v3.4.1 (Chen and Dudhia, 2001).
F I G U R E A21 SUEWS (Surface Urban Energy and Water Balance Scheme) (Järvi et al., 2011;Ward et al., 2016) has two impervious (buildings, paved areas) and five pervious (evergreen trees and shrubs, deciduous trees and shrubs, grass, bare soil and water) surface types, underneath which is a single vertical layer for soil model with lateral flow between surfaces (except water tile).Storage heat fluxes can be calculated using empirical relations with net all wave radiation (Grimmond et al., 1991) while the latent heat flux is calculated as the integrated resistance network of all the surfaces.There is one snow layer but with clearance activities between surfaces.Anthropogenic heat emissions, irrigation-related fluxes and snow-clearing are either modelled with empirical relations or prescribed with observed values.It has a dynamic leaf area index model to allow phenology to change through the year and between years.

F I G U R E A25
TEB-SPARTCS is builds on TEB-CNRM (Figure A23) by incorporating the SPARTACUS-Urban radiative transfer within the urban canopy using a discrete ordinate method (Hogan, 2019a(Hogan, , 2019b)), which assumes an exponential distribution of wall-to-wall distances and allows varying building heights (Hogan, 2019a).In this project the buildings all have the same height and tree height is limited to building height.

F I G U R E A26
TERRA_4.11 uses the bulk urban scheme TERRA_URB (Wouters et al., 2015(Wouters et al., , 2016) ) with TERRA-ML (Schulz et al., 2016) for non-urban surfaces that are characterized by eight soil layers and a one-layer snow scheme.As the latter does not account for urban features (e.g., urban pollution, snow removal or a change in effective albedo due to snow-free walls, roads), the urban fraction is considered as completely snow-covered in the presence of snowfall.TERRA_URB uses the Semi-empirical URban canopy dependencY algorithm (SURY) to condense the three-dimensional urban canopy information to a limited number of bulk properties (Wouters et al., 2016;Varentsov et al., 2020).Aggregated diurnal and seasonal anthropogenic heat fluxes (traffic, industry, etc. combined) are prescribed by equations proposed in Flanner (2009).The TERRA version used here is a standalone version, which differs from the official TERRA version embedded in the recent COSMO(-CLM) version (Rockel et al., 2008;Garbero et al., 2021) (Rockel et al., 2008;Garbero et al., 2021).Where possible, features from the online version are approximated using the same underlying data sources.For more information on this submission see https://github.com/matthiasdemuzere/urban-plumber-terra-pub.

F I G U R E A27
UCLEM (Urban Climate and Energy Model), with integrated street vegetation and building energy/waste heat (Lipson et al., 2018;Thatcher and Hurley, 2012), is used in the stretched grid global climate model CCAM (McGregor and Dix, 2008).The four urban facets (roof, road, two walls) have four layers/ five nodes for heat conduction (Lipson et al., 2017), with single-layer snow on a fraction of roof and road surfaces.Low (grass and shrub) canyon and roof vegetation use a reduced set of prognostic variables with a simple bucket hydrology.Irrigation is assumed to occur when soil moisture approaches wilting point.

F I G U R E A28
UT&C: (Urban Tethys-Chloris) (Meili et al., 2020) combines an urban canyon approach with ecohydrological principles of Tethys-Chloris (Fatichi et al., 2012).Vegetation can occur on roofs and within the canyon (i.e., ground vegetation and/or street trees).Separate soil columns occur below impervious, bare and vegetated ground facets.Transpiration is modelled as a function of plant photosynthetic activity and environmental conditions (Meili et al., 2021).Irrigation can be prescribed at the soil surface or through preserving soil moisture in deep soil layers.Wall facets are split into upper and lower parts to partition their contribution to near surface heat fluxes.Snow and water bodies are currently not modelled in UT&C v1.0.

F I G U R E A29
VTUF-3D (Vegetated Temperatures of Urban Facets in 3D) (Nice et al., 2018) resolves energy transfers in three dimensions by combining TUF-3D (Krayenhoff and Voogt, 2007) with the MAESPA tree model (Duursma and Medlyn, 2012).The vegetation shading and physiological processes are directly integrated with building and urban effects, allowing the role vegetation and water to be assessed in human thermal comfort in urban areas.

F I G U R E A30
VUCM (Vegetated Urban Canopy Model) (Lee and Park, 2008;Lee, 2011;Lee et al., 2016) is the first mesoscale urban canopy model that parametrizes radiative/dynamic/thermodynamic/hydrological processes of urban vegetated area (tree, grass, soil) interactively with urban artificial surfaces (roof, wall, road), which has been developed based on an integrated framework of a two-dimensional single canyon and a new single tree canopy concept.

F
I G U R E 3 Study area, AU-Preston: (a) location within Melbourne, Australia with the extent of the ERA5 (Hersbach et al., 2020) grid cell used for gap-filling observations (red rectangle) (image: © OpenStreetMap contributors); and (b) aerial imagery around the flux tower site (yellow cross with circle of 500 m radius) (image: © State of Victoria [Department of Environment, Land, Water and Planning]).

F
I G U R E 5 Example benchmark diagrams.(a) Manabe_1T, (b) REG1-SW down , (c) REG2-SW down -T air , (d) KM3-SW down -T air -RH with cluster centroids (black crosses), and (e-h) predictions using corresponding (a-d) benchmarks.A two-week period of upward long-wave radiation (LW up ) is used as an example of benchmark output (coloured lines, with observations in black).Error statistics are calculated over all available periods, for mean absolute error (MAE), mean bias error (MBE), normalised standard deviation (nSD) and the coefficient of determination (r 2 ), all defined in Table

F
I G U R E 6 Mean absolute error (MAE) boxplot results.Individual models (dots) are split into cohorts (Section 2.1) with benchmarks (horizontal lines) for four fluxes: upward short-wave radiation (SW up ), upward long-wave radiation (LW up ), sensible heat flux (Q h ) and latent heat flux (Q le ).Boxplots are shown for model experiments (left to right in each column): (1) PILPS-Urban Phase 2 Stage 4 (G11: Grimmond et al., 2011; grey); (2) Urban-PLUMBER baseline (blue); and (3) Urban-PLUMBER detailed (orange).The level of information in G11 (the final stage of PILPS-Urban) is comparable to the detailed experiment in this project.Boxes show the 25th and 75th percentile, median (horizontal line), full range (whiskers) and mean (open circle).

F
I G U R E 8 Benchmarking assessment for 'common' set of metrics (MAE, MBE, nSD, R) showing benchmarks (coloured lines) and models (black markers).Lower scores are better (Equation 1 and Equation 2).Model columns include up to three markers for different experiment submissions (1, PILPS-Urban Phase 2 Stage 4 [G11]; 2, baseline; and 3, detailed).Colours at the base of each column indicate the benchmarks a model outperforms per experiment (colour, lower legend).Models are ordered by nominally increasing complexity (Figure 2).For Q tau , grey indicates no submission.

F
I G U R E 9 All metric benchmarking results.Models are ordered nominally by increasing complexity (Figure 2) showing benchmark scorecard of individual and aggregated metrics based on ability to outperform benchmarks (colour) for both the baseline (left) and detailed (right) simulations.Note a single box indicates only one submission made.For Q tau , grey indicates no submission.TA B L E 6 Number of models outperforming the cRMSE benchmarks for Urban-PLUMBER detailed simulations (UP) and PILPS-Urban Phase 2 Stage 4 (G11)(Grimmond et al., 2011) CHTESSEL can tile up to six non-urban surfaces (high and low vegetation, bare soil, intercepted canopy water, shaded and sunlit snow).It has four soil and one snow layer.Tile fractions in this project are based on global surface cover databases as used in IFS products, i.e., without urban surfaces.F I G U R E A6 CHTESSEL_U (Urbanised Carbon-Hydrology Tiled ECMWF Forecasts Scheme for Surface Exchanges over Land) a two-tile (roof, canyon) urban scheme (McNorton et al., 2021) to CHTESSEL (Figure A5).It follows MORUSES' (Figure A13) infinite canyon assumptions for radiative effects.The urban surfaces (cf.CHTESSEL) have increased runoff and reduced soil infiltration.F I G U R E A7The CLMU (Community Land Model Urban) is a single-layer urban canopy model that consists of roofs, walls (sunlit and shaded) and impervious and pervious canyon floor(Oleson et al., 2010; Oleson and Feddema, 2020).It features a simple building energy model to explicitly represent space heating and air conditioning of building interiors and a pervious canyon floor to approximate evaporation from vegetated surfaces within the urban canyon.Radiation parameterizations account for trapping of short-wave and long-wave radiation inside the urban canyon.Roof and canyon floor hydrologic processes including snow accumulation and melt, liquid water ponding, and runoff are simulated.CLMU is embedded within a global climate model, the Community Earth System Model (CESM), incorporating three urban tiles or land units, categorized by density of development, within each model grid cell.Urban extent and thermal, radiative, and morphological properties are prescribed from the global dataset ofJackson et al. (2010) as modified byOleson and Feddema (2020).
, and ERA5 does not use a model with urban climate effects

TA B L E 4 Observational data description and availability Variable Description Units Positive All Day Night DJF MAM JJA SON
Submission error checks.Feedback was provided to participants, who were able to resubmit prior to final analysis ManySWnet (average) Simulated midday albedo: Midday albedo provided for detailed experiment Some Anthropogenic flux Simulated anthropogenic flux: Expect mean magnitude provided for detailed experiment Some Urban) and all those involved in model development since then.We also acknowledge all scientists involved in collecting and providing observations used to derive benchmarks in this study.Thank you to Belinda Roux, Asiful Islam and Vinod Kumar (Bureau of Meteorology) for reviewing this manuscript.This project uses modified Copernicus Climate Change Service Information.Open access publishing facilitated by University of New South Wales, as part of the Wiley -University of New South Wales agreement via the Council of Australian University Librarians.Site locations of tower observational data used to generate the empirical benchmarks for this study