Constrained, mixed-integer and multi-objective optimisation of building designs by NSGA-II with ﬁtness approximation

Reducing building energy demand is a crucial part of the global response to climate change, and evolution- ary algorithms (EAs) coupled to building performance simulation (BPS) are an increasingly popular tool for this task. Further uptake of EAs in this industry is hindered by BPS being computationally intensive: optimisation runs taking days or longer are impractical in a time-competitive environment. Surrogate ﬁtness models are a possible solution to this problem, but few approaches have been demonstrated for multi-objective, constrained or discrete problems, typical of the optimisation problems in building design. This paper presents a modiﬁed version of a surrogate based on radial basis function networks, combined with a deterministic scheme to deal with approximation error in the constraints by allowing some infeasible solutions in the population. Different combinations of these are integrated with Non- Dominated Sorting Genetic Algorithm II (NSGA-II) and applied to three instances of a typical building optimisation problem. The comparisons show that the surrogate and constraint handling combined offer improved run-time and ﬁnal solution quality. The paper concludes with detailed investigations of the constraint handling and ﬁtness landscape to explain differences in performance. The Authors. the CC


Introduction
The building and construction industry is one which offers large potential for reduced environmental impact and improved sustainability. Buildings are a large contributor to world carbon emissions, particularly due to their heating, cooling and lighting energy demands (for example, over 50% of UK carbon emissions are related to building energy consumption [1]). Usual practice, driven by expectation and building function, is for each individual building to be designed afresh, losing the benefit of mass-production where a single optimisation process applies to thousands or millions of units. This means that improvements in the building design process to aid designers in reducing building carbon emissions are an important area of research. Computer-based optimisation methods are of rapidly-increasing importance in this area.
Several characteristics of building energy optimisation present a challenge. Each specific problem has a large decision space, with a mixture of continuous and discrete variables determining varied aspects of a building's design. Typically there are multiple are evaluated by the full simulation while others are evaluated by a much cheaper model offers potential to improve run times without losing accuracy. There are, however, a limited number of works featuring surrogates for problems with discrete decision spaces, and still fewer looking at problems with constraints or multiple objectives. This has led to minimal adoption of surrogates in building optimisation, where the problems often exhibit these characteristics. The need for greater efficiency of EAs for building optimisation motivates further research into surrogates that can simultaneously handle constraints, multiple objectives and discrete or mixed variable types. If successful, such surrogates will improve the practicability of EAs for this application area.
Given the above points, this paper explores the use of radial basis function networks (RBFN) to reduce the number of calls to the full building performance simulation, by approximating objective and constraint values. Several contributions are made. The first is that an existing approach [12] is adapted for multi-objective and constrained problems, with a change to the way that categorical variables are handled. A separate RBFN for each constraint and objective allows for retraining on each independently. The paper's second contribution is Deterministic Infeasibility Sorting (DIS), a scheme where infeasible solutions are retained in the population to deal with two issues. One issue is that estimation of constraints is difficult because an approximation error can mean that the algorithm discards feasible solutions that are in truth feasible: particularly problematic with many constraints. The second issue is that approximation errors can also affect the dominance relationship between solutions in multi-objective problems. The paper's third contribution is the comparison of several possible approaches to integrating the surrogate and DIS within Non-Dominated Sorting Genetic Algorithm II (NSGA-II) [13]. Several variants of NSGA-II are applied to three versions of a mixed-integer, constrained, multi-objective building optimisation problem. Comparisons are made in both the quality of the final solution sets and convergence speed. It is found that the best performing NSGA-II variant offers a performance improvement for two of the three problems (with no significant difference on the third). The final contributions are a detailed analysis of the constraint handling, to allow greater understanding of the impact of the techniques, and a study of the problem's fitness landscape to aid analysis of the results.
The rest of the paper is structured as follows. Section 2 describes related work in this area. Sections 3 and 4 describe the building optimisation problem and the proposed modifications to NSGA-II: the RBFN surrogate; the constraint handling approach; a number of minor changes to improve performance; and the resulting variants of NSGA-II. Section 5 describes and discusses experiments tuning and comparing the algorithm variants. Sections 6 and 7 provide deeper analysis of the constraint handling and fitness landscape, and Section 8 concludes the paper.
Of particular interest for the present work is the combination of fitness approximation for problems with discrete variables, multiple objectives and constraints. Several of the above papers reported success with surrogates for multi-objective problems. A recent review [27] on constraint handling approaches in evolutionary algorithms (EAs) suggests that the use of surrogate approaches for constraints remains quite rare, and all for problems with continuous decision variables.
There has, however, been some work on fitness approximation for discrete problems without constraints, including [28][29][30][31][32]. Tresidder et al. [33] and Zemella et al. [34], make use of surrogates in EAs for discrete building design problems. Bajer and Holeňa [35] also used a RBFN surrogate, for a mixed continuous and discrete problem. In contrast with the present paper, the RBFs were defined over only the continuous variables, with centres located in clusters determined by Hamming or Jaccard distance on the discrete variables.
In [12], an RBFN is used as a surrogate for a mixed-integer search space. The model is used as a filter; too many offspring are created each generation and the model is used to choose promising ones which are then evaluated with the true fitness function and retained. The technique was applied to intravascular ultrasound image analysis; this approach is the basis of the method proposed in this paper.
This paper also describes an operator which allows infeasible solutions to remain in the population as a means of improving performance for constrained problems. There are many examples of work in this area, including [36][37][38][39][40].

Application
Variants of NSGA-II were applied to a real-world simulationbased building optimisation problem; this will now be described in detail. The aim is to find the trade-off between operational energy use (for heating, cooling and lighting) and construction cost for a mid-floor of a small commercial office building (Fig. 1), subject to constraints relating to the comfort of the building's occupants. The floor is divided into five rooms, treated as separate zones for energy and thermal comfort simulation. The two end zones are 24 m × 8 m, and the three middle zones 30 m × 8 m. Floor to ceiling height throughout is 2.7 m. Building energy consumption strongly depends on local climate and the experiments in this work placed the building in three locations: Birmingham (UK), Chicago (IL, USA), and San Francisco (CA, USA). A building situated in Chicago (hot summer, cold winter) would be expected to need better insulation and more artificial heating and cooling than Birmingham, and even more so San Francisco (more temperate, flatter climates). Consequently these represent three quite different instances of the same optimisation problem, albeit sharing the same objectives, variables and constraints.
The optimisation takes place at the building's detailed design stage, with overall form and layout fixed. Construction variables to be optimised are building orientation, glazing type and windowto-wall ratio, and wall/floor material types. The optimisation also includes variables for the operation of the building's HVAC systems. This allows for the final design to be tuned to a particular operational pattern, for example exchanging an increased capital  cost for a building which can be economically run at a lower temperature range. These are the temperature and air flow set-points (thresholds beyond which HVAC systems are switched on) and start-stop times for heating and cooling systems. Overall there are 8 integer, 30 continuous and 12 categorical variables (the latter taking one of 2 or 3 values), summarised in Table 14 (A). Gray coding is used for crossover and mutation operations, allowing the same operators to apply across all variables. This means that the continuous variables are discretised for mutation and crossover; the step size is also noted in the table.
Energy use is determined by the freely-available whole-building energy simulation package EnergyPlus (E+) [41]. This models heating, cooling and lighting loads required to meet predefined setpoints for measures such as temperature and light level. Energy use covers a whole year of operation given the weather patterns at each location. The construction cost objective uses a simple function based on the Spon's [42] costs for the materials used and equipment size (determined by E+). A run of the E+ simulation takes 1-2 min on an Intel i7 2.7 GHz CPU, depending on the specific building configuration. Allowing up to six simulations in parallel, a budget of 5000 evaluations completes in approximately 24 h (a practical duration for real-world use).
The optimisation is also subject to 18 inequality constraints based on occupant comfort, determined by E+, and summarised in Table 1. They cover hours of overheating during summer months, carbon dioxide (CO 2 ) concentration ("freshness" of internal air), predicted mean vote, and maximum air change rate due to natural ventilation. Predicted mean vote is an estimation of the mean response of a group of people according to the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) thermal sensation scale, in which +3 is hot, zero is neutral and −3 is cold.
Several assumptions are made about the final working conditions for the building that impact on operational energy use and comfort. Working hours are 9:00 to 17:00, outside which the building is unoccupied. Each zone has typical design conditions of 1 occupant per 10 m 2 floor area and equipment loads of 11.5 W/m 2 floor area. Lighting is controlled to provide an illuminance of 500 lux at two reference points located in each perimeter zone; visible as points on the floor of Fig. 1. Infiltration is 0.1 air changes per hour, and ventilation rates are 8 l/s per person. Heating and cooling are modelled using an idealised system that provides sufficient energy to offset the zone loads and meet the zone temperature set-point during hours of operation; free-cooling is available through natural and mechanical ventilation.
A simplified version of the same problem formed the basis for experiments describing an approach to multi-criteria decision making and visualisation in [43].

Algorithms
This section describes the proposed surrogate, constraint handling technique, a number of minor changes to improve performance, and methods by which all of these can be integrated within NSGA-II.

RBFN surrogate
Radial basis function networks (RBFN) are fully connected feedforward artificial neural networks which can be used for function interpolation [44], providing a mapping from a set of real-valued inputs to a set of real-valued outputs. They have been applied as surrogates for evolutionary algorithms in several previous works including [45,12,46,47]. Li et al. [12] noted that an RBFN can be used for approximation in many different decision spaces, as long as an appropriate distance measure can be defined. This makes RBFNs particularly suitable for a problem with mixed variable types such as the example building problems.
For the distance measure, the present research adopts the Heterogeneous Euclidean-Overlap Metric (HEOM) approach suggested by Wilson and Martinez in [48] to combine distances for different parameter types. Li et al. [12] also used the measure within a RBFN meta-model for a mixed-integer search space. Solutions are divided into three components; real/continuous, integer and non-numeric or categorical parameters. Continuous and integer parameters are normalised to the range (0,1) prior to calculating the distance metric. The total raw distance is the combination of the Euclidean r , Manhattan z and Hamming d distances between these parameters respectively: In normalising the variables to (0,1), categorical variables are over-weighted compared to integer and continuous variables. While the mean distance between two normally distributed random variables with a range of (0,1) is 1/3, if the variables are restricted to 0 or 1, the mean distance is 1/2. To correct for this, a weight that will be termed the Hamming weighting ω is introduced, changing the distance measure to (2).
ω is 2/3 so the mean contribution to the total distance that a categorical variable makes is 1/3, equal to that for the other variable types. This could also apply to the integer variables but the problem is much reduced because the lowest cardinality for the integer variables in the example building optimisation problem is 7, for which the mean difference between a pair of variables is 0.38. An experiment exploring the impact of Hamming weighting further is described in Section 5.4.

Training
Training of the RBFN takes three steps: determining the centres for the hidden layer nodes, the widths for the Gaussian functions, and the output layer weights. This work adopts the common practice that the first two steps are unsupervised and the last step is supervised. For determining the ideal cluster size and determining when to retrain the RBFN, the fitness prediction correlation (FPC) [49] is used to measure the model's accuracy. The FPC is Spearman's rank correlation [50], between the true constraint/objective values and the surrogate predicted values, for a solution set.
Typically, a clustering algorithm such as k-means is used to locate the centres for RBFN hidden layer nodes, or the centres are simply the set of training points as in [12]. However, training time for the output layer weights increases rapidly with the number of training points. Karakasis and Giannakoglou [51] also suggested that using the training patterns as centres is suboptimal for multiobjective problems. The present work uses a clustering algorithm to determine the centres. k-medoid rather than k-means clustering is used, so the centres represent real points in the discrete decision space. This is necessary as it is impossible to interpolate between points separated by categorical values.
To find a suitable number of clusters, 1000 solutions were generated at random, and the network was repeatedly trained on random subsets of solutions (subset sizes matched the population sizes in Section 5). The FPC was calculated for objective and constraint predictions of the remaining solutions, and was highest when the number of clusters was one third of the population size.
The width for each hidden node RBF is the root mean square distance from that node's centre to the centres of the two nearest neighbours. Least squares fitting is used to determine the output layer weights.

Integrating the surrogate with NSGA-II
For convenience, the core of NSGA-II [13] is reproduced in Fig. 2. The important modification is at step 11: too many offspring are generated, and the surrogate filters promising solutions for inclusion in the new population (Fig. 3).
The surrogate is initially trained using the starting population P 0 . Then, in place of the original offspring creation step, standard crossover and mutation operators are used to generate a surrogate population S, of size |S|, a multiple of the population size |P|. The surrogate approximates objective and constraint values in S, and NSGA-II's fast-non-dominated-sort ranks S by feasibility and nondominance, with the highest ranking solutions inserted into the offspring population R. R is evaluated using the full fitness function (the building performance simulation), and model accuracy is checked to determine if retraining is needed. NSGA-II continues to the next generation as usual.

Model updates
A separate RBFN is used for each objective and constraint, allowing each to be retrained as needed. FPC determines model accuracy for each objective and constraint every generation, using the filtered offspring solutions (Q t+1 ), after evaluation by the full simulation. (This approach is termed evolution control [14].) The threshold FPC for retraining was 0.7, representing a strong statistical correlation between predicted and true values. The solutions of the most recent generation (both parents and offspring) were used as training data. 1

Constraints
In applying an EA to a constrained problem, an important consideration is how constraints are handled alongside the objective values of solutions. This is further complicated by the possibility of approximation error caused by the surrogate resulting in solutions being incorrectly classified as (in)feasible.
Surveys of EA constraint handling approaches [52,53,27] have identified four categories: penalty functions, decoders, special operators and separation of objectives and constraints, the last being used in NSGA-II (classed as a feasibility rule in [27]). The concept (originally proposed by Deb [13,54]) is simple; when comparing solutions (either in tournament selection, or elitism): Mezura-Montes and Coello [27] describes several new categories of approach, and of these, the present work builds on stochastic ranking and use of multi-objective concepts. A number of studies have shown that inclusion of infeasible solutions in the population is beneficial (see Section 2). The argument can be made that often infeasible solutions which are almost feasible are valuable stepping stones between feasible regions of the search space. Further, when approximating constraint values, there is a risk that some solutions will be incorrectly classified as infeasible, and it is better to increase the chance that they will be retained.
The method proposed here takes a deterministic approach, referred to as Deterministic Infeasibility Sorting (DIS). The elitist component of NSGA-II was modified to ensure that a fixed proportion ˛ of the population is taken from the set of infeasible solutions, unless the set is smaller than the ˛-proportion, in which case all are copied into the new population. Like [38,39], after the population is evaluated, it is divided into feasible and infeasible solutions, and the feasible solutions are non-dominating sorted (with crowding distance applied). The infeasible solutions are sorted using NSGA-II's fast non-dominated sort, ignoring the magnitude of their constraint This is similar to the approach in [38,39], although simpler. DIS does not substitute constraint violation with an extra objective, instead ignoring constraint violation completely in the infeasible population. Additionally, selection for reproduction still favours feasible solutions. DIS can be applied in two places; in the elitist step of the core algorithm, and in the surrogate filtering step. Both approaches are explored.

Algorithm variants
The experiments covered several variants of NSGA-II, including different approaches to surrogate and constraint handling. NSGA-II refers to the original algorithm of Deb et al. [13]. Use of the surrogate is denoted with suffix -S. DIS, if present in the elitist step of NSGA-II, is denoted by subscript c, and if present in the surrogate filtering step, is denoted by subscript d. Overall, we have: Minor changes were made to the base NSGA-II algorithm and all variants. To maximise efficient use of the 5000 evaluation quota, population size was small. An external archive was added to the algorithm to store the Pareto front, allowing a bigger front than the population size to be returned. The algorithm attempts to insert every solution evaluated into the archive, and the same dominance rules used by NSGA-II are used to determine which solutions stay in the archive.
An internal cache of previously-evaluated solutions was maintained for each run, to avoid re-evaluating identical solutions.
Solutions drawn from the cache did not count towards the limit of 5000. Six threads were used to conduct evaluations in parallel.
Optimisation variables were handled as Gray-encoded bitstrings during crossover and mutation to simplify operator choice (bit-flip mutation and uniform crossover). The surrogate operated directly on the unencoded continuous, integer and categorical variable values.

Experiments and results
Three sets of experiments compared the performance of the different approaches to constraint handling and the surrogate. Firstly, the algorithm variants were tuned on the Birmingham problem. Secondly, all variants were compared on the Birmingham problem. Thirdly, a subset were compared on the Chicago and San Francisco problems. All runs were restricted to 5000 true evaluations, and comparisons were made of the final Pareto fronts returned. The goal was to find the algorithm setup returning the highest hypervolume and lowest spread metric for the same number of evaluations (i.e., better quality solutions for the same effort). Also given are comparisons of the effort required to reach the same quality of solutions.

Parameter tuning
Preliminary runs aimed to find good parameters for the algorithm and variants: the parameters tuned and the value for each yielding the highest mean hypervolume is given in Table 2. Each run was limited to five repeats (limited by the time needed for a single run: around one day).

Comparisons for Birmingham
Each NSGA-II variant was run 30 times: Table 3 gives the mean hypervolume and spread for the Pareto fronts found. To avoid issues with multiple t-tests, a Dunnett test for significance using plain NSGA-II as the base-case was performed: the resulting p-values are also given. Analysis of variance also found that P < 0.05 for the null hypothesis that the results from all runs were drawn from the   Fig. 4 gives the corresponding median summary attainment curves [55]: the curve of points in the objective space reached by at least half of the repeat runs. A comparison was also made with random search, seeking the set of non-dominated feasible solutions from 5000 randomly generated solutions. None of 30 runs of random search found a feasible solution, whereas all runs of the NSGA-II variants found solutions meeting all constraints within 1984 evaluations.
Although an improvement in final solution quality is useful, at least as important is reduced overall run-time while achieving the same quality of solutions. This is particularly true for the building industry, where timely production of plans for tendering is critical, and a speedup of 10-20% could allow an optimisation run to complete overnight rather than postponing a morning's work. For this reason, the original results were reprocessed, taking the median hypervolume reached by NSGA-II (0.8486) as a target, and measuring the number of evaluations taken by each algorithm to find a  Pareto set with that hypervolume. The results are given in Table 4; where a run did not reach the target hypervolume within 5000 evaluations, it was classed as failed.
There are several key points to note. None of the methods improve or worsen spread significantly. This is important because it would be less useful to have a set of solutions with improved hypervolume if they were unevenly distributed, reducing the choices available to the decision maker.
DIS (in NSGA-II c ) reduces the performance of NSGA-II, with lower mean hypervolume and higher mean spread, although neither difference is statistically significant. This is reflected in a lower success rate (fewer runs reached the target hypervolume within 5000 evaluations). Further, the median attainment curve for NSGA-II c , while dominating part of that for NSGA-II, is dominated at the high capital cost end by the latter. This point of the front is where the comfort constraints are more likely to be broken, further analysed in Section 7. This indicates that DIS improves search efficiency in parts of the search space with a balanced mixture of feasible and infeasible solutions, but DIS decreases efficiency where the local search space is mostly infeasible. Given that all algorithm variants found feasible solutions within a few hundred evaluations, it would appear that the feasible region of the search space is large and contiguous enough (further observed in Section 7) to avoid needing to retain infeasible solutions to help jump gaps in the feasible region. Instead, DIS adds drag to the search: removing feasible solutions from the population effectively reduces its size with no countering benefit, lowering the search speed.
The mean hypervolume of the sets found by NSGA-II-S is higher than for NSGA-II, though not significantly. This is reflected by an increased success rate, and the median attainment curve of NSGA-II-S dominating that of NSGA-II.
More importantly, NSGA-II-S was further improved by adding DIS -particularly at the filtering stage (NSGA-II-S d ), where the difference over NSGA-II is significant. This would appear to be because DIS compensates for constraint approximation errors made by the surrogate: further explored in Section 6. Adding DIS at the elitist step (NSGA-II-S c ) also improves final hypervolume, but not significantly. Adding DIS at both the filtering and elitist steps (NSGA-II-S cd ) is sub-optimal, except at the low-cost end where the curve for NSGA-II-S cd dominates that for NSGA-II-S d (similar to the slight benefit shown for NSGA-II c vs NSGA-II). A similar effect occurs when DIS is applied at the elitist step, but when it is already applied at the filtering stage this benefit is lost due to the drag effect described earlier. The median attainment curves show no similar benefit for NSGA-II-S c . It would appear that the issues related to constraint approximation have more impact on efficiency than retaining infeasible solutions in the population.
In practical terms, the real-world benefit of the improved fronts found by NSGA-II-S d is substantial. Taking the median summary attainment curves for NSGA-II and NSGA-II-S d , and choosing a capital cost about half way along the curve (£380,000), NSGA-II found a solution with an annual energy use of 487,65 kWh, and NSGA-II-S d found a solution with an annual energy use of 422,34 kWh. This saving of 6531 kWh (13%) per annum will reduce carbon emissions and running costs over the lifetime of the building.

Comparisons for Chicago and San Francisco
Further experiments applied NSGA-II, NSGA-II-S, NSGA-II-S d and NSGA-II-S cd to the same building, using the Chicago and San Francisco weather, for 30 independent runs. As noted in Section 3, while the basic problem is similar, the search space and optima for these problems are expected to be quite different to those for Birmingham. Compared to the Birmingham climate, Chicago is more demanding, requiring high levels of heating in winter and active/mechanical cooling in summer (the building has to work across the range). San Francisco has a local micro-climate resulting in almost no need for heating and mechanical cooling, the building remaining comfortable with passive operation. Tables 5 and 6 give the mean hypervolume and spread of the Pareto sets found by each algorithm within the 5000 evaluation budget for the two problems. Table 7 give the success rate for each algorithm (proportion of runs achieving the median hypervolume reached by NSGA-II: 0.6095 for Chicago and 0.7334 for San Francisco). Summary attainment curves for Chicago are given in Fig. 5; those for San Francisco are excluded as they were all overlaid over each other and add little to the discussion.
For Chicago, all NSGA-II-S variants found fronts with statistically significantly higher mean hypervolumes than NSGA-II, and NSGA-II-S d found fronts with significantly lower (better) spread values. This is particularly clear in the median attainment curves; the curves for all variants dominate that for NSGA-II. As with the Birmingham problem, NSGA-II-S d performs best, with NSGA-II-S and NSGA-II-S cd performing similarly, both better than NSGA-II. The issue seen with Birmingham where the attainment curves coincided for low cost solutions does not appear for Chicago. This is because the more extreme weather conditions require all the Table 7 Optimisation results for Chicago and San Francisco. SR is the percentage of runs for each algorithm variant that were successful (i.e., reaching the target hypervolume within the 5000 evaluation budget), and Evals is the mean number of evaluations taken by successful runs to reach a front with the target hypervolume. buildings to have heavier construction types (e.g. all solutions have heavy weight external walls, rather than a variety) in order to meet the comfort constraints. This moves the buildings from running close to constraint violation with the lighter construction types to running safely within the constrained region with heavier types, making the impact of DIS less than for Birmingham. This also comes with a higher capital cost for all solutions in the trade-off than for Birmingham. For San Francisco, all algorithm variants performed similarly. There is no statistically significant difference in the hypervolumes of the fronts found, and success rates are also similar (although NSGA-II-S d found the target hypervolume within the 5000 evaluation budget slightly more often than the other variants). Indeed, all the algorithms converge at approximately the same rate, demonstrated for NSGA-II and NSGA-II d in Fig. 6.
In part, the difference in algorithm behaviour for the three problems is due to the constraints. Table 8 shows the mean number of evaluations taken by runs of NSGA-II to find solutions meeting the individual constraints (this is not the time taken to find feasible solutions, which is much longer as all constraints must be simultaneously met). Most constraints for Chicago are harder to solve than for Birmingham, and in turn are harder than those for San Francisco. This implies that where the constraints are easiest, there is least to gain from use of the surrogate.

Hamming weighting
The weighting for all NSGA-II-S variants was 2/3, justified in Section 4.1. The optimisation experiment for Birmingham was repeated with NSGA-II-S and NSGA-II-S d , using a surrogate with a Hamming weighting of 1 (the same as [12]). The mean hypervolume and spread of the Pareto fronts found in 30 independent runs are given in Table 9, alongside results for NSGA-II and NSGA-II-S with the Hamming weighting of 2/3, repeated for convenience. ttests between the hypervolumes are given in Table 10 (none of the differences for spread were significant).
While hypervolume is higher for the approaches with ω = 2/3 compared to those with ω = 1, the difference is not significant; the change also does not affect whether the hypervolume reached by NSGA-II-S or NSGA-II-S d is significantly higher than for NSGA-II. There is no significant difference in the spread for the fronts found by any of the algorithms; for NSGA-II-S, setting ω = 2/3 improves spread, but for NSGA-II-S d setting ω = 2/3 results in poorer spread.
The algorithms with ω = 2/3 did converge more reliably to a front with the target hypervolume of 0.8486 within the 5000 evaluations budget. NSGA-II-S and NSGA-II-S d reached the target in 83% of the 30 runs. With ω = 1, this dropped to 57% and 67% respectively. This provides some justification beyond the theory in Section 4.1 for using the Hamming weighting of 2/3.

Analysis of the constraint handling
This section explores in more detail the success of the surrogate in handling constraints. Section 4.1.2 described how, within NSGA-II-S, the surrogate assigns values to each objective and constraint for each solution in the surrogate population. A fast-nondominated-sort is conducted, and the top ranking N solutions are evaluated using the full simulation, then combined with the parent population by the elitist step of NSGA-II. To explore this further, for one run of NSGA-II-S on the Birmingham problem, the full simulation was run for all solutions evaluated by the surrogate. This amounted to an assessment of the surrogate's performance over 14,940 solutions. These solutions were divided into four categories for each constraint: • Correct fail -predicted to have violated the constraint, which the simulation confirmed; • Incorrect fail -predicted to have violated the constraint, but had not; • Correct pass -predicted to have met the constraint, which the simulation confirmed; • Incorrect pass -predicted to have met the constraint, but had not.
These can be compared with: • True fail -violates constraint, according to the simulation; • True pass -meets constraint, according to the simulation; • Predicted fail -violates constraint, according to the surrogate; • Predicted pass -violates constraint, according to the surrogate.
The counts of solutions in each category were used to calculate, for each constraint, two measures borrowed from the information retrieval community, precision P and recall R [56]: Correct passes or fails Total predicted passes or fails (3) Recall = Correct passes or fails Total true passes or fails (4) Precision is the proportion of surrogate predictions which were correct, and recall is the proportion of true passes or fails that were predicted as such. Ideally, both measures should be 1 meaning that all true passes and fails were correctly predicted as such. P p and R p refer to precision and recall for passes, and P f and R f for fails. The results are given in Table 11, together with the FPC and number of times the surrogate for each constraint was rebuilt over the run. The abbreviated constraint names are given in Section 3. P p exceeds 0.85 for all constraints, except the PMV constraints, which have a low mean FPC (<0.3) and a high number of rebuilds (>95: around one rebuild every third generation). In contrast, P f is low for most constraints; exceeding 0.5 for only three of the ACH constraints (note that for these, FPC is high and rebuilds are low). This is in part because there are far more true passes than fails, so with a similar error rate, more passes will be incorrectly classed as fails than the other way round. The converse of this is shown for the WPMV and EPMV constraints, where the number of true passes is much closer to the number of true fails, and P p drops (with an increase in P f ).
The surrogate clearly performs better with some categories of constraint than others: • For PMVs, pass precision is worse than for the other constraints but not low, and fail precision is not high but better than for SOH and CO 2 . • For SOH and CO 2 , pass precision is high, and fail precision is low.
• For ACH, pass precision is high, and fail precision is not high but better than for SOH and CO 2 . • Where mean FPC is low, as expected, rebuild count is high.

Table 11
Summary of the surrogate predictions of the constraints over one run of NSGA-II-S. T f and Tp are the number of true fails and passes for the constraint. Tp/T is the fraction of all solutions evaluated that passed the constraint (intended to give an impression of the constraint's difficulty). P f,p and R f,p are the precision and recall for solutions passing and failing to meet the constraint respectively. FPC is the mean fitness prediction correlation for the raw constraint value, and C is the model rebuild count for the surrogate on that constraint. For the ACH constraints, which have higher P f and around the same P p compared with the other constraints, the mean FPC is highest and the number of model rebuilds is low. The other constraints, which either have very low P f (SOH and CO 2 ) or lower P p (PMV), have lower mean FPC values, and correspondingly higher rebuild counts. This means that where the model is accurately predicting constraint passes and fails for solutions it is being kept, and where it is less successful it is being rebuilt in an attempt to improve accuracy. These results provide further evidence that the FPC is a reasonable measure of the surrogate's prediction capability, and suitable for choosing when to rebuild it.

Impact of DIS
As noted above, precision and recall are much higher for passes than fails. In simple terms, for this problem the surrogate is conservative in its prediction of constraint values, tending to falsely fail feasible solutions rather than allowing infeasible solutions into the population. While this may be preferable, it means a large number of solutions with useful genetic material are incorrectly labelled as infeasible and discarded. This motivated the introduction of DIS, which allows some of the solutions predicted infeasible, but also predicted high in the objective values, through the filtering on the good chance that they are (in reality) feasible.
The two DIS variants were: allowing some infeasible solutions to pass from one generation to the next at the elitist step (NSGA-II c ); and allowing some solutions predicted infeasible through the surrogate filtering step (NSGA-II-S d ). To explore the impact of the latter, the experiment described at the beginning of this section was repeated, but limited to solutions only included in the population because of DIS ("filtered-in" solutions), and the solutions which they replaced ("filtered-out" solutions). As there were far fewer solutions affected by DIS than were presented to the surrogate overall, these results were aggregated from all 30 runs of NSGA-II-S d . Table 12 shows the success of the model in approximating constraints for these two groups of solutions as precision and recall figures. Figures in bold show where the value is lower than that for the whole run in Table 11.
Among filtered-out solutions, relative to the overall figures for precision and recall: 14/18 constraints had lower P f ; 12/18 Table 12 Summary of the surrogate predictions of the constraints for solutions that were lost because they were replaced with infeasible solutions by DIS. P f,p and R f,p are the precision and recall related for solutions passing and failing to meet the constraint respectively. Values in bold are lower than the corresponding values in Table 11.

Constraint
P  had lower R f ; 8/18 had lower P p and 9/18 had lower R p . Among filtered-in solutions, relative to the overall figures for precision and recall: 14/18 constraints had lower P f ; 13/18 had lower R f ; 8/18 had lower P p and 9/18 had lower R p . This means that for the solutions nearest to the feasible/infeasible boundary, most constraints were less accurately predicted by the surrogate than was the case for all solutions. This is hardly a surprise: boundary solutions were likely to be the point at which any error in approximation would manifest itself. It does, however, provide some justification for using DIS to swap some solutions at the feasible/infeasible boundary: if the model is less likely to accurately classify these solutions then they may as well be randomly labelled feasible or infeasible, avoiding any unhelpful bias introduced by modelling error. As noted above, the most common error is predicting feasible solutions as infeasible. An effect of this is that not enough solutions are predicted feasible for DIS to filter them out (enough of the filtered population is already infeasible). Where DIS does filter solutions in and out, comparing true values for filtered-in solutions with those filtered-out showed that the former were higher in overall constraint violation (as intended), higher in the cost objective and lower in the energy objective. 71% of the DIS filtering operations led to an increase in the mean true overall constraint violation for the population; 52% led to a decrease in the mean true energy use of buildings in the population; 53% led to an increase in the mean true capital cost. Mean figures for all filtered-in and filtered-out solutions are given in Table 13. The lower energy objective values among filtered-in solutions appears to contribute to improved performance when using DIS (NSGA-II-S d ), compared with using the surrogate alone (NSGA-II-S). The summary attainment curve for the Birmingham problem (Fig. 4) showed that NSGA-II-S c was able to find fronts with lower energy values but not as low capital costs as NSGA-II-S. While the difference is slight, the impact of slightly better solutions would be amplified as the algorithm runs over many generations.

Analysis of the fitness landscape
It is helpful to understand the fitness landscape for the problem to determine how difficult it is to model accurately. For singleobjective problems, common approaches include fitness distance plots [57] and autocorrelation [57]. Both use a single solution, typically the global optimum, as a starting point. For multi-objective problems, there are multiple solutions in the Pareto front rather than one single global optimum, so the picture is more complex. In the present study, a modified fitness distance plot is used to illustrate the area immediately around the Pareto front.
As the true Pareto front is unknown, a global Pareto front for the problem was constructed by taking the set of non-dominated solutions from the union of all Pareto fronts generated during the original experimental runs. These experiments covered approximately two million function evaluations (480 experimental runs of 5000 solutions, with some duplicate solutions). The resulting front consisted of 120 solutions. For each member of the front, all possible values were tried for each of its variables (given the discretisation caused by Gray encoding), one variable at a time. This amounts to an exhaustive walk around the locality of the Pareto front, covering 55,101 solutions. The results are given in Fig. 7. The points are separated using shape and colour into feasible, infeasible and the original front.
A deeper analysis of the constraints was performed by sorting the solutions generated by the stepwise traversal by ascending capital cost. This set was divided into 10 bands representing increasing cost. The fraction of solutions in each band that were feasible are plotted in Fig. 8.
A number of observations about the landscape can be made.
1. No large improvements in either objective can be found near the global Pareto front without breaking a constraint. This means that the "global" Pareto front is at least near to locally optimal and is useful for comparisons. However, without an (impractical) exhaustive search it is impossible to say that the front is truly, globally, optimal.   ous. This provides some evidence for the conclusion (Section 5) that NSGA-II c did not offer improved performance because there were not enough gaps in the feasible part of the search space for including infeasible solutions in the population to give much benefit. 4. One variable (ceiling/floor construction type) has a large impact on cost, resulting in two duplicates of the Pareto front which are much higher in cost than most of the other solutions in the plot. 5. There are large changes in the objective values (particularly energy) by simply changing one variable. This means it is harder to model than a "flatter" landscape would be; so more difficult for a surrogate to reproduce. 6. Within the vicinity of the Pareto front, as capital cost increases, solutions are more likely to be infeasible. This is of relevance to the discussion of DIS in Section 5.2.
This experiment was repeated for the San Francisco problem, to attempt to further explain the different optimisation results. The plot is not shown for space reasons, but revealed two points. Firstly, the range in the energy objective across the solutions is much lower for the San Francisco problem meaning that once close to the Pareto front, crossover and mutation are less likely to generate solutions far from it (which would be wasted effort). Secondly, a slightly higher proportion (51.7%) of the solutions around the Pareto front are feasible. If these factors simplify the problem, there may be little room for improvement in efficiency over plain NSGA-II.

Conclusions
This paper has described a constrained, multi-objective problem, with mixed variable types, from the building design domain. This application area forms a crucial part of the global response to climate change, and given the time-constrained nature of the industry it is important that optimisation approaches are able to find a high-quality solution or set of solutions as quickly as possible. To meet this demand, the paper has proposed several variants of NSGA-II, incorporating RBFN surrogates and an approach to allowing infeasible solutions to remain in the population. These variants were applied to three instances of the building problem (using different locations for the building weather data). Optimisation runs were limited to a budget of 5000 function evaluations, and compared both on the quality of solutions found (how close the set was to the Pareto-optimal set) and on the rate of convergence relative to plain NSGA-II.
The paper has made a number of contributions. It has proposed using a separate RBFN surrogate for each constraint and objective to allow for retraining. To control the retraining the Fitness Prediction Correlation (FPC) is used. An existing approach to using RBFNs as surrogates has been modified using the Hamming weighting and k-medoid clustering, to better accommodate categorical variables. Deterministic Infeasibility Sorting (DIS), a method for including infeasible solutions in the population, has been proposed, and also used to mitigate surrogate approximation errors. It was found that the optimal configuration for this problem was NSGA-II-S d : NSGA-II, with the surrogate, and DIS applied at the filtering step. This reliably found better sets of solutions than NSGA-II, more quickly, for two of the building problem instances. None of the algorithms showed much difference in performance for the third (easier) problem instance. The paper has also explored in more detail the effect on optimisation of the constraint handling approach and the Hamming weighting, and explored the landscape of the fitness functions to add context to the optimisation results.
There remain some open areas for future work to expand on that described here. Aside from the obvious routes of exploring more applications or alternative algorithms to NSGA-II, one area which will be more difficult to consider is adapting the surrogate based algorithm to cope with equality constraints. There are very few examples of this in the literature, one exception being [58]. It would be interesting to explore further the frequency with which model building for each of the objectives and constraints recurs and either use this to switch to different surrogate model types or to weight the constraints at the surrogate filtering stage to help mitigate for approximation error. Glazing type Types from [59]; applies to all zones Standard, low-emissivity a The weight refers to the effect that the materials have on the thermal response of the building, heavy being slow and light being fast.