Surrogate modelling of solar radiation potential for the design of PV module layout on entire façade of tall buildings

This research investigated the performance of a surrogate modeling approach for the simulation of solar radiation potential on the vertical surfaces of tall buildings. Surrogate modeling is used to approximate the input–output behavior of the existing simulation model. The Random Forest (RF) machine learning approach was used to investigate three different scenarios, namely (1) Random variation, (2) Grid variation, and (3) Uniform variation, and the Genetic Algorithm is used to optimize the hyperparameters. A case study was performed to investigate the performance of surrogate models using a building in the Sir George William (SGW) campus of Concordia University in downtown Montreal Canada. The results suggest that even by only using a small sample size of the random solutions, surrogate modeling can achieve up to 94% accuracy in the prediction of solar radiation potentials. From the three scenarios, the best accuracy was obtained when using the Random variation method. In short, solar radiation simulation is very complex and too sensitive to the location and shadow effect. Therefore, simpliﬁcation of those factors cannot be made to approximate the solar radiation potential. Also, using RF, the computational time improved by 16 times faster than when using the existing simulation model. (cid:1) 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
Buildings are responsible for the annual consumption of more than 30% of the total energy used across developed countries [1,2]. A large portion of this energy is still fossil-fuel-based, which is inevitably limited and depleting. Furthermore, the worldwide energy consumption records show that the electrical energy demand will continuously increase [3]. In recent years, researchers and building planners have begun to focus on creating decentralized energy generation where each building can (partly) supply its own energy [4]. Photovoltaic (PV) solar energy is one of the best sources of clean energy used in the building environment to substitute fossil-fuel-based energy and can be used to supply local energy [5]. Solar panels, also known as PV panels, are commonly installed on the building's rooftop or any horizontal or tilted surfaces. Nevertheless, the current market of PV panels allows the installation on various surfaces of building surfaces. For instance, building-integrated photovoltaics (BIPV) enables solar energy harvesting from buildings' façade, as shown in Fig. 1. BIPVs can reduce the overall material cost because they serve multiple functionali-ties [6,7]. This means vertical surfaces, too, have the opportunity to produce a high amount of energy if panels are strategically installed [8]. It was shown that PV panel installation on the vertical surfaces of tall buildings is promising [9,10]. The effective use of vertical surfaces for harvesting clean energy on a tall building is essential because it is shown that the average height of buildings in an area has a negative impact on energy consumption due to increases in population density [11,12]. A tall building in the context of this study refers to a building with ample unshaded vertical surface area suitable for the economically feasible installation of solar panels. Reiter [13] and Trlicik [14] suggested that from an urban design perspective, a tall building should have at least double the height of the average surrounding buildings.
Despite the potential, vertical surfaces of tall buildings are seldom leveraged for harvesting solar energy. This is partly because tall buildings are usually located in dense residential or commercial areas, where many other buildings surround them. Because of the shadow effect of buildings on one another, vertical surfaces receive considerably less solar radiation than horizontal surfaces. That is why the efficiency of vertical solar panels is heavily dependent on how their layout is designed [15]. That is why a detailed solar simulation of the building surfaces considering the surroundings is required [16]. The PV layout is typically determined by multiple factors such as the location, sizes, and orientation of panels [17].
There are many different approaches for the simulation and assessment of solar radiation potentials of urban surfaces [18][19][20][21]. Conventionally, physics-based numerical models, e.g., the Atmospheric and Topographic Model (ATM), were used to assess the solar potential of surfaces in an urban environment [22]. But these earlier methods were only applicable on large scales and could not support detailed surface-level analysis. With the rising popularity of Geographical Information Systems (GIS), solar radiation analysis can now be done on a more granular level using 2D models developed based on Digital Elevation Model (DEM), Light Detection and Ranging (LiDAR), and photogrammetric approaches [23][24][25]. However, GIS-based methods of solar radiation analysis are normally limited to horizontal surfaces or flat rooftops because the available models lack a sufficient level of 3D details [26,27].
This situation has changed in recent years with the advent of 3D modeling methods [28]. Nowadays, software packages like RADI-ANCE can perform solar analysis on volumetric 3D models and curved geometries using accurate ray-tracing algorithms [29]. Although these tools and methods use 3D models, they still take an indiscriminate approach toward different surfaces. They, therefore, are not applicable to cases where PV modules that can only be used on specific types of surfaces are considered. This limitation is a significant factor, especially for installing PV modules on vertical surfaces, because the diversity of vertical surfaces (e.g., windows, walls, curtain walls, balconies) requires surface-specific simulation of surfaces.
With the improvement of PV module technology and supporting simulation tools, researchers started to consider the vertical surfaces of buildings in urban areas for solar radiation analysis [27,30,31]. In recent years and with the rising popularity and availability of Building Information Models (BIM), it has become possible to leverage the semantic data embedded in the 3D model of buildings to perform surface-specific solar simulation of building surfaces. Several recent studies have demonstrated this possibility [10,32,33]. It is shown that these models can be integrated with meta-heuristic optimization methods, e.g., Genetic Algorithm (GA) [34,35], to find the best layout design, a methodology commonly referred to as generative design [36,37]. While proven effective, the metaheuristic models are sensitive to the evaluated number of solutions, i.e., the higher the number of solutions, the greater the chance of optimality. This makes the simulation platform computationally expensive. This is especially important because PV layout optimization is only one of many criteria that need to be considered for the optimal design of building facades. Other criteria, such as insolation and aesthetics, also need to be considered. Therefore, in the general practice of building design, having a computationally very expensive pipeline for the PV layout design may push designers to forgo the use of the analytical approach for the façade design and restore to heuristic-based methods.
One potential approach to address this problem is to use surrogate or meta-models that can substitute the computationally expensive simulation model, as indicated in Fig. 2 [38]. These models can mimic the behavior of the original simulation model based on the statistical approximation of the original model. A widely applied method for developing a meta-model is to use of Machine Learning (ML). ML models can learn from a large volume of training data and identify relationships and patterns between a set of dependent and independent variables, i.e., inputs and outputs of the simulation model [39]. Surrogate models are shown to be very useful for reducing the computational intensity of simulation models [40]. Bornatico et al. [41] applied surrogate modeling to optimize PV systems and demonstrated that the computation time could be reduced to 150 times less than physics-based simulation. Similar results were reported by Perera et al. [42]. However, this study only considered the optimization of the PV system from the mechanical perspective and not the layout design. Xu et al. [43] also showed promising results in applying surrogate models for the building façade design. However, only the impact of using a default BIPV system on the cost of the overall design was considered and not the detailed layout design.
To the best knowledge of the authors, the surrogate modeling approach for the design of solar panel layout on the building façades has not been considered before.

Research objective and scope
On the premise of the above research gap, this study aims to investigate the feasibility and effectiveness, i.e., in terms of prediction accuracy, of using a meta-model to assess the solar radiation potential of building façade. To this end, first, a framework will be developed to apply the concept of surrogate modeling for the approximation of the PV layout parametric model developed previously by Salimzadeh et al. [10]. Different approaches for the development of these surrogate models will be explored. Finally, the performances of these surrogate models are assessed through comparison with the results of the parametric model. This newly developed framework is expected to provide the building façade designers with an insight into how data-driven metamodeling techniques can help incorporate a better analytical approach to building design. It should be highlighted that this research focuses only on developing the surrogate model for the vertical surfaces of the building, and thus the rooftop PV modules are not considered. Nevertheless, it is expected that the same method can be applied to the rooftop as well.
The remainder of this paper is organized as follows. Section 2 presents the proposed framework and various metamodeling approaches. Section 3 elaborates on implementing the proposed  method into a case study and is followed in Section 4 by the corresponding results. Finally, the conclusions, limitations, and future work are presented in section 5.

Proposed method
This section presents the overall method applied in this research to develop the surrogate model of solar radiation potential. It is worth noting that the surrogate model is only concerned with building substitutes for the physics-based solar radiation simulator. As in the case of solar radiation simulation engines, the technical feasibility of the simulated solution is the responsibility of the users who want to assess a layout design. The simulator engine only applies physics-based modeling of solar radiation potential. The developed surrogate model also transfers that responsibility to the user of the model. Fig. 3 shows the overview of this method. Overall, the proposed method consists of three phases. The first phase is allocated to building the dataset that will be used to develop the machine learning model. In this phase, several datasets will be generated based on different strategies for approximating solar panel behavior. Next, in Phase 2, an ML method is used to develop the surrogate model for multiple datasets. Then, the performance of the developed surrogate model is evaluated through comparison with the physics-based simulation model.

Building datasets
As shown in Fig. 3, the first phase is dedicated to building the dataset used for the training and validation of the surrogate model. To this end, a previously developed parametric model of the façade PV module is used [10]. For completeness, a brief overview of this parametric model is explained below. Fig. 4 shows a schematic representation of how the PV module layout parametric model functions. Fig. 4(a) shows that the potential candidate location for installing PV modules is first selected on the external surfaces. In this step, only feasible surfaces are considered. Therefore, if there are elements on a specific surface, e.g., a mechanical unit, which hampers the installation of the PV module, the surface is excluded. Next, the parametric model allows the user to specify the geometric specification of the PV module in terms of width (W i ), length (L i ), and tilt angle (h i ), as shown in Fig. 4(b). Finally, the user-specified layout is determined in terms of panels that are installed on the desired location (P i ), as shown in Fig. 4(c).

Data structure for machine learning
In preparing the data for ML development, it is vital to identify the relevant features that will be used for the training of the ML model. The selection of features determines the parameters deemed to be suitable for predicting the PV module radiation potential. In this research, the following features are considered: (1) location of the PV Module: the location of the PV module, as shown in Fig. 4(a), is an important factor in determining the radiation potential. The location of the panel captures the impact of geographical location, building orientation, as well as the morphology of the surroundings of the building. The location can be expressed in terms of a cartesian coordinate of the center point of the installation (i.e., x, y, z) or using ordinal data that represents the index of the installation point, e.g., point ''1 00 represents (x = 10, y = 15, z = 25). Both possibilities will be evaluated in this research; (2) size of the panel: the size of the panel, i.e., W and L in Fig. 4(b), has an impact on the amount of solar irradiation; (3) orientation of the PV module: this represents the tilt angle of the PV module in terms of h; (4) size and orientation of panels above the PV module: the size and orientations of panels above any given PV module have an impact on the radiation potential of the PV module because of the shading effect they have on the panel. However, due to the impact of building orientation on the incident angle of radiation, it can be imagined that it is not only the panel immediately above the studied PV module that may cast shade but also panels further away on the left and right, as shown in Fig. 5. Therefore, this study considers the size and orientation of the panels above, top left, and top right of the PV module. It should be highlighted that depending on the geographic location and orientation of the building, even panels further away might have a shading effect. However, in this research, it is assumed that the immediately adjacent panels capture the majority of the panel-induced shading effect. Therefore, the impact of panels further away can be ignored. In case any of these locations do not have a PV module, the W, L, and h are considered zero, as shown in Fig. 4

(c).
As for the labels (or output variables) of the machine learning model, the Annual Solar Radiation (ASR) can be used in terms of MWh. Ultimately, each data point in the dataset has a structure shown in Table 1. It should be highlighted that if the panel size is considered the same for all the panels, the pertinent variables can be ignored for the development of the machine learning model.

Development of different scenarios
The common practice in the development of surrogate models is to generate a large number of random solutions that is wellbalanced and distributed within the design space and use it as the dataset for the training. However, in the context of this research, a number of other scenarios can be envisioned to develop the ML model, mainly to simplify the dataset development process. In total, three different scenarios are considered in this research, as shown in Fig. 6. In Scenario one, as shown in Fig. 6(a), a number of entirely random solutions are generated using the parametric model explained above. It should be noted that random layout  solutions are only built on technically feasible locations, as shown in Fig. 4(a). Therefore, this scenario places a panel on each of the feasible locations and assigns a random tilt angle. This scenario is called random variation, and the same data structure presented in Table 1 is used to build the dataset. In Scenario 2, as shown in Fig. 6(b), the design space is discretized into larger cells. It is assumed that the behavior of PV modules in that portion of the façade can be represented by the four-panel layout shown in Fig. 5. To this end, first different cells are formed on the façade of the building. The size of each cell is a factor of the size of PV modules, the geometry of the façade, and the morphology of the surrounding building. As will be shown later, the smaller the size of the cell, the more accurate the model. But this would be at the cost of the increased number of total cells and thus increased computational intensity. Then, PV modules are placed at the center of the cell as well as above, top left, and top right of the cell. Next, each panel is tilted between 0°to 90°with an increment of 10°. This procedure creates a combinatorial set consisting of 10,000 variations (i.e., 10 possible orientations for 4 panels generate 10 4 variations for the layout). This scenario is called Grid Variation. The structure presented in Table 1 is slightly modified for this scenario, in the sense that the grid label replaces the point label, and the Â, y, and z of location are replaced with those of the center of the cell.
This adjustment has a consequence in the Estimation phase and how the validation dataset needs to be prepared for this scenario, as will be explained in Section 2.1.4. It should be noted that since this grid is an approximation behavior of the modules on the façade, the typical constraints, such as mechanical units, are disregarded in the placement of the representative modules in the grid. For instance, PV modules are placed in the center of the window frame in the schematic Fig. 6(b). This has no impact on the feasibility of the final solution because only the feasible PV modules belonging to each cell will be considered during the Estimation phase. It is hypothesized that the size of the cell has a negative impact on the accuracy of the surrogate model, meaning the larger the cell size, the lower the accuracy. This hypothesis will be tested in the case study by comparing the performances of ML models for three different sizes of cells.
In Scenario 3, shown in Fig. 6(c), it is assumed that the relationship between the tilt angles of top panels and ASR of the target PV module is relatively linear and, therefore, can be approximated by linear functions. Thus, all panels are tilted in a uniform manner, meaning that all panels would always have the same angle. In this scenario, all panels are tilted from 0°to 90°using an increment of 5°. This scenario generates 19 different variations for the layout.

Splitting the datasets
Once datasets of different scenarios are built, they need to be split into training and validation datasets. While the training dataset is developed for each scenario, the validation dataset only contains solutions from Scenario 1, i.e., the Random Variation scenario. This is because regardless of how the ML model is developed, it is intended to approximate the solar radiation of any possible layout. Therefore, the model must be tested in cases where PV modules can be installed at any given location with any configuration.
However, some adjustments are required to prepare the validation dataset for the Grid Variation scenario, as shown in Fig. 7. First, an extra feature needs to be added for each data point to represent the cell to which each panel belongs. This can be done through a simple point in the polygon algorithm. Once the hosting cell of each panel is identified, the Â, y, z of the PV module will be changed to the Â, y, z of the cell. This is because the model in this scenario is built based on the cell coordinates.

Development of surrogate model
The main steps of the proposed GA-based surrogate modeling method are shown in Fig. 3. This research proposes the use of Random Forest (RF) as the ML method. Although the RF algorithm is well established [44], a brief overview is presented for completeness.   [45]. This method is selected mainly due to its demonstrated superiority in terms of handling multi-dimensional and imbalanced datasets [46][47][48]. Besides, RF offers an approach to assess how important a variable is compared to others, select the most important variables, and reduce dimensionality. Moreover, the parameters of RF are simple and computationally lighter than other machine learning methods, i.e., RF is computable even when it only has the number of decision trees (n tree ) and the number of input variables (m variables ) [49]. RF combines several individual decision trees and forms the so-called forest. Every particular tree {h (x, HT), T = 1,2, . . .} will be grown using the training set and the value of an independently-sampled random vector {HT}, where this value is distributed equally among each tree in that forest. The training data subsets for each tree are created through a procedure called bootstrapping. Bootstrapping creates training data by randomly resampling the original dataset without deleting the data selected from the input sample. This process makes the model more robust when facing slight variations in input data. Therefore, better prediction stability and higher accuracy can be achieved.
When the target variable is continuous, RF uses the sum of errors or weighted variance as the criterion for branching the tree at a node [50,51]. Each selected feature will be calculated to explore the possible split point with minimum variance. In each possible split, the variance of each child node is individually calculated. Then, the variance of each split is computed as the weighted average variance of the child node. The one with the lowest variance value is selected as the best split [52]. At any step of the RF growth, the potential of the child nodes being a ''leaf" is checked to determine the end of the tree branch. A branch is considered a leaf when the information gain of the node is larger than any possible split to be considered. If it is still possible to split the node into two new nodes, it is not a leaf yet. This procedure is repeated until there are no more unbranched nodes and no more features left. The procedure of growing a regression tree is repeated for T trees in the forest. After numerous trees are generated, the final prediction of RF uses the averaged value of the predictions from each tree [44].

Random forest hyperparameters
Based on the aforementioned development process of RF, a general structure of the RF can be provided as represented in Fig. 8, where several model configurations can be identified, such as the  Table 2.
Because the performance of the RF is subject to the values of the hyperparameters, it is, therefore, essential to fine-tune the model hyperparameters until the optimal performance is achieved [53][54][55]. In this research, a GA-based ML development approach is adopted to optimize the hyperparameters of the ML model.

Proposed GA-based surrogate modeling
As demonstrated in Fig. 3, the GA-based ML development approach is an iterative process. In this method, firstly, a population consisting of a random set of individual possible solutions is generated. Each solution in this population is called a chromosome, which is denoted by i in Fig. 3. As shown in Fig. 8, the chromosome can be divided into several segments (i.e., the genes), where each gene represents a certain hyperparameter as indicated in Table 2. Therefore, in each hyperparameter gene, each value in the given range will be used and paired with the values of other hyperparameters. This method will make multiple configurations to be evaluated.
Conventionally, the dataset used for the ML model development will be randomly yet non-respectively divided into the training set and the testing set, where the former will be used for building the model structure and the latter will be used as the ''unseen" data to evaluate the performance of the model. However, the evaluation of   Table 2 Hyperparameters of RF.

Hyperparameter Description n_estimators
The number of trees in the forest max_depth The maximum depth of each MTRT in the forest max_features The number of features to consider when looking for the best split min_samples_leaf The minimum number of samples required to be at a leaf node Fig. 9. A general framework of k -fold cross-validation iteration. the obtained ML model can thus be sensitive to the random choices of the splitting of the subsets. In order to avoid this, -fold crossvalidation method is used to train and test the ML model. Every k-1 subsample from the training dataset will be used exactly once as the testing data. Then, as shown in Fig. 9, the results from each iteration of the k À 1 (denoted by j) are averaged to estimate the final performance of each chromosome [56]. The fitness function for the evaluation of the RF model is the accuracy of the prediction. In this research, the Mean Absolute Percentage Error (MAPE) is used to measure accuracy. Eq. (1) presents MAPE calculation method.
Where:n: the number of samplesy i : the actual valueb y i : the predicted value The GA-based method stops if the improvement in the fitness function between two generations is smaller than a threshold or when the maximum generation number defined by the user (denoted by C in Fig. 3) has been reached. The final RF model repre-sents the optimum feature subset and hyperparameters. If the stopping criteria are not yet satisfied, another population of solutions will be generated through selection, crossover, mutation, and replacement.

Estimation
Next, the ML model is used to estimate the solar radiation of the validation dataset. As stated in section 2.1.4, the validation dataset is a randomly selected subset of data from Scenario 1, i.e., Random Variation, and will be excluded in the training process of the RF models. Each includes a Cartesian coordinate (x, y, z), geometry (L and W of modules), the random tilt angle of the PV module, and also the random tilt angle of the top, top left, and top right panels, as shown in Table 1. Besides, as explained in Section 2.1.4, this extra validation set will be adjusted by adding one more feature to associate the target panels with the corresponding cells.
Finally, this validation dataset is fed into the ML model along with its best RF model developed in Phase 2, predicting the solar radiation amount. The accuracy performance of the prediction is then calculated using MAPE.

Case study
A case study is conducted to test the performance and feasibility of the developed methodology. The case used in this research will be the John Molson School of Business (JMSB) building from the Sir George William (SGW) campus of Concordia University in downtown Montreal, Canada. This building stands 55 m tall with 15 stories above the ground and an all-glass curtain wall as the façade [57]. This is the same building used in the earlier research of the authors [10]. Having the same case study allows direct comparison of the results and, therefore, better identifies the effect of the proposed method in improving the PV module layout design.
Although the process of preparing the 3D model for the solar radiation simulation was explained in the authors' previous work [10], a brief overview is provided for completeness. As shown in Fig. 10(a), Revit [58] was used to model the building in an object-oriented fashion, i.e., the BIM model of the building. This model was then integrated with the CityGML model of the surrounding buildings [59] to consider the shadow effect of the neighboring buildings in solar simulation in the Revit environment, as shown in Fig. 10(b). Inside Revit, Dynamo visual programming [60] was used to develop the PV layout parametric model. The implementation details of this parametric model are provided in the previous work of the authors [10].
It should be noted that after a careful study of the surfaces in the JMSB building, it was discovered that the northeast façade of the building has a scant solar radiation potential because of the surrounding buildings. Therefore, this face of the façade was not considered for the installation of PV modules. The final configuration created a total of 1137 potential points for the installation of PV modules on the vertical surfaces of this study. It is essential to mention that in all considered scenarios, the panel size was fixed at 2 Â 1.5 m. Because of this simplification, the features pertinent to the size of the panel were removed from the dataset, i.e., because they were uniform across all the layouts in the dataset.
As discussed in Section 2.1.3, three different scenarios were studied in this research. Table 3 shows the detail of the three scenarios and their configurations. As shown in this table, for the Random Variation scenario, 2200 random layouts were generated. For these random layouts, the option of not putting a PV module on a potential location was not considered because the absence of a PV module on a location results in the radiation of zero which can simply be calculated by logic and thus not needed to be included into the dataset and ML model development. In addition, the impact of empty location on the surrounding panels can also be represented by cases where the tilt angle of the panels is 0. Therefore, the random solutions only include the random variation of PV modules' tilt angles in the range of 0°to 90°. Of the 2200 generated solutions, 200 were set aside as the validation dataset used to validate the performance of the three scenarios of the surrogate model.
Four different configurations of the first scenario were considered to test the impact of the dataset's size on the surrogate model's performance, using 500, 1000, 1500, and 2000 of the random solutions as the training dataset.
In Scenario 2, three different sizes of grids were considered, as shown in Fig. 11. As explained in Section 2.1.4, the location coordinates of panels were adjusted. These grid sizes were first built based on the surface given by Dynamo nodes topology.vertices and vertex.pointgeometry. These nodes allow the user to take a list of geometry surfaces containing a layout and export the corner point coordinates of the surfaces. The first surface export gives the corner points of the cells for the large grid, as given in Fig. 11(a). For the medium grid, the sizes of the cells are reduced to 50% of the large grid, except for façade segments that can already host less than 50 modules (because they are already assumed to be in medium range size). For the small grid, the size of the cells is reduced to 33% of the large grid, except for façade segments that can already host less than 25 modules. These thresholds created 26, 37, and 56 cells for large, medium, and small grids, respectively. As explained in Section 2.1.3, each cell contained one PV module in the center and three PV modules on the top, creating a 4-panel system that is expected to approximate the behavior of PV modules in each cell. It is worth noting that all the panels in every cell have the same size.
Scenario 3, as mentioned in Section 2.1.3, includes uniformly tilted panels with the tilt angle varying between 0°to 90°with the steps of 5°, generating an overall of 19 distinct solutions for the ML training.
It should be noted that as mentioned in Section 2.1.2, each configuration of the training dataset was used to develop two surrogate models, one with the Cartesian coordinates and the other with the index representing the specific panel location, as shown in Table 1.
As explained in Section 2.2, a GA-based RF model is proposed in this research. Table 4 presents the ranges of RF hyperparameters explored in this research. Concerning the minimum number of samples required for a node to be a leaf, it is not recommended to use the default value of 1 because it can cause overfitting in cases where the training dataset is very large [61]. The ''auto" in max_feature means that all features are considered when looking for the best split.
Also, the configuration of the GA used for the optimization of RF is presented in Table 5. The maximum generation number was set to 20, which means that GA ran 20 iterations to find the near-optimum RF model. In each generation, 50 individuals were generated. For the cross-validation, a 5-fold crossvalidation structure was used, with MAPE being the minimisation objective function.

Results and discussion
As stated in Section 2.3, the accuracy performance of the developed ML model is assessed using MAPE. However, in regression problems, researchers usually also use R 2 to demonstrate the accuracy and performance of the ML model's prediction [44,49,62]. Therefore, to confirm the accuracy performance of each model, this research uses two estimations, i.e., MAPE and R 2 . Unlike MAPE, when R 2 is closer to 1, it means that the accuracy is higher. Table 6 presents the optimal configuration of RF hyperparameters for each scenario. As shown in this table, the optimum hyperparameters are relatively consistent across different scenarios. Interestingly, the maximum performance was achieved with the smallest number of trees, i.e., estimators, in the RF. Regarding the Table 3 The configurations of different scenarios used in this research.

Scenario
Training dataset Validation dataset considered features that give the best performance, most of the optimized configurations returned ''auto" which means those scenarios consider all features when looking for the best split and no features were excluded. However, two scenarios, i.e., Random variation with 1500 layouts and 2000 layouts, showed that the number of considered features is only four. Both apply to cases where the coordinates of the points are used in the training. Table 7 shows the result of the feature importance of the developed models, where for each scenario, the most important feature are highlighted in blue while the least important feature(s) are marked in red. This table gives which features are the best and which have little to almost no contribution to the prediction of the data in each scenario. The most important features for all of the models are related to the location of the point, but the height, i.e., the z coordinate, has less impact on the prediction. Even for some models, the height is less important than the tilt angle of the panel. Also, for most scenarios, the surrounding panels' tilt angles are the least important in predicting the solar radiation amount. Table 8 presents the results of the estimation of various surrogate models developed in this research. Also, Figs. 12 to 14 show the regression plots of different surrogate models. Since all the models use the same set for the validation dataset, the regression plots for generated solar radiation are normalized. From the three different scenarios, all configurations of the Random variation scenario perform decisively better than other scenarios. This result is interesting because the dataset size for this scenario is orders of magnitude smaller than that of the Grid variation scenario, as explained in Section 2.1.3. In the Random variation scenario, it seems that the size of the dataset has little impact on the performance of the ML model, more so when the coordinates of the points are used for training. When training with the Table 4 Range of hyperparameters to be optimised.

GA Parameters Description Value
Generations (C) Number of iterations to run the pipeline optimisation process. 20 Population size The number of individuals that will be evaluated and selected 50 Objective function Function used to evaluate the quality of a given pipeline for the regression problem MAPE The number of folds (k) Number of folds in cross-validation strategy 5 Table 7 Results of feature importance.  index of points, the size of the dataset became more influential, where the larger dataset performed better in terms of both R 2 and MAPE. While the use of index instead of the coordinates had a negative impact on the accuracy of the models in Random variation and Uniform variation scenarios, it had a minimal positive impact in the Grid variation scenario. Looking at the Random variation scenario, the negative impact of using an index instead of coordinates became smaller with the increase in the size of the training dataset, showing that index-based training is more sensitive to the size of the dataset. The fact that even the smallest dataset in the Random variation dataset had a high performance is promising because it indicates that even with a small number of randomly generated layouts, a reliable and accurate surrogate model can be developed. Despite having massive training datasets, all configurations of the Grid variation scenario showed low accuracy. Nevertheless, the increase in the number of cells, i.e., a smaller grid, proved to have a slight but positive effect on the model's performance. A few points in the regression plots with the greatest deviation were analyzed to understand better why this scenario did not perform well. It was observed that these deviations belong to locations on the façade that, based on the grid-based approximation, were supposed to generate a large amount of radiation because the surrounding panels had little to no tilt angle. However, when projected to the actual layout, it was observed that these panels received considerable shadow effects from the building façade or surrounding buildings. Although the result of this scenario is not favorable, it provides valuable insight that the simulation of solar radiation is very sensitive to the location and configuration of the panels, so an accurate approximation cannot be made by applying zoning on the façade. Nevertheless, as shown in the first scenario, it is evident that by simulating a random set of solution samples for each potential installation point, a high-accuracy prediction can be made for the solar radiation potential.  The above observation about the complexity of the solar radiation simulation is further corroborated by the very low performance of the Uniform variation scenario. It is demonstrated that an accurate prediction cannot be made using a small dataset consisting of a Uniform variation of tilt angles. This shows the high sensitivity of the solar radiation simulation to the shadow effect of the surrounding panels.

Type of Scenario Approaches Location Tilt
In terms of the simulation time needed, generating the radiation amount of one random configuration in Dynamo requires approximately 19 s. Using the RF model, 200 random solutions from the validation dataset require 236 s to export the amount of solar radiation obtained. This means the RF model takes 1.18 s to generate one solution. Therefore, using the RF model is 16 times faster compared to using the simulation-based parametric model.

Conclusion
This research investigated the performance of a surrogate modeling approach for the simulation of solar radiation potential on the vertical surfaces of tall buildings. Surrogate modeling was used to approximate the input-output behavior of the existing simulation model. The RF machine learning approach was used in investigating three different scenarios, namely (1) Random variation, (2) Grid variation, and (3) Uniform variation. GA was used to optimize the hyperparameters of the RF model. A case study was performed to investigate the performance of the surrogate models. The case study used a building in downtown Montreal, Canada.
It was demonstrated that, in general, surrogate modeling has a great potential to accurately approximate the simulation of solar radiation on the vertical surfaces of tall buildings. It was shown that an accuracy of up to 94% could be achieved even by only using a small sample of data, i.e., 500 random layouts. In fact, the surrogate model is capable to give the result of solar radiation amount 16 times faster than the existing simulation model. This development can help tremendously reduce the computational intensity of optimization-based PV model layout design. However, it was observed that the best approach to develop the surrogate model is to use a number of random layout designs rather than more guided strategies, such as grid-based approximation or uniform variation of tilt angles. This attests to the fact that while surrogate modeling is very promising and applicable, solar radiation simulation is very complex and too sensitive to the location and shadow effects. Because of this sensitivity, simplification cannot be made to approximate the solar radiation potential. Nevertheless, even a small sample of random design layouts that captures the diversity of panel configurations for all the potential locations can be used to predict solar radiation potential accurately.
However, there are a few limitations to this research. First, this study only considers using Random Forest for developing the ML models. It is possible to use other types of ML methods, such as Neural-Network-based methods. Also, the PV panel size in this study was fixed at 2 Â 1.5 m. It is also possible to add other possible panel sizes to see how different sizes affect ML models' prediction and performance. Finally, although the developed surrogate model could be easily used to optimize the PV layout, i.e., to perform a generative design, it was out of the scope of this research. In the future, the authors intend to perform generative design based on this surrogate model and then compare the results of the design optimization with that of simulation-based optimization to investigate to what extent the use of the surrogate model can contribute to finding a better layout design with less computation effort.

Data availability
Data will be made available on request.