A simple formulation for early-stage cost estimation of building construction projects

This study is aimed at improving a formula that enables easy, correct, and fast estimation of an Early-Stage Cost of Buildings (ESCE). This formula, enabling estimation of ESCE, was developed by the authors based on artificial neural networks and gene expression programming. A quantity survey was conducted for a hundred construction projects


Introduction
On construction projects, investors wish to accurately estimate necessary financing so as to arrange the budgets of their investments, and to make sure that the project is profitable.In this process, cost calculation is an important step that should be taken into consideration by all parties, including project owners and contractors [1,2].For this purpose, prices of similar finished works, archived data of experienced contractors, and price determinations from other institutions such as market research or related occupational chambers and universities, are used to obtain correct data for the Early-Stage Cost Estimation of Buildings Construction Projects (ESCE).In the case of construction projects, and especially in public investments tendered with limited resources, the most accurate estimation of construction costs is one of the most important issues in terms of construction management at the pretendering stage.Public institutions acting as employers wish to obtain accurate results in cost estimation calculations in the shortest period of time so as to obtain the necessary financing, and to arrange the budgets of their investments, whereas the bidders / potential contractors wish to calculate their cost and profits accurately, and give the best offer in a competitive environment.During feasibility studies, at the preliminary design stage, or where the bidding period is limited, approximate costs may need to be calculated as soon as practicable.While bidding for a tender, if a detailed cost analysis is not possible due to time constraints, the control of the early cost calculation with a simple ESCE method may be quite convenient for competing companies and public utility companies.An early-stage project cost calculation can also vary from one individual to another.For these reasons, it is obvious that there is a need for methods that would enable accurate estimation of ESCE, both in practice and theory.Since 1998, researchers have conducted many modelling studies in order to estimate costs of buildings at an earlier stage.Elhag and Boussabaine [3] studied the cost estimation of construction projects and developed two artificial neural network models using 14 variables.Lowe et al. [4] conducted a study using multiple regression analysis to predict construction costs.In his study, Hwang [5] proposed two distinct regression models for price estimation on construction projects.Cheng et al. [6] integrated artificial intelligence techniques in their study for the conceptual cost estimation of construction projects in Taiwan.Kim et al. [7] (12) conducted a study to measure performance of artificial neural networks, support vector techniques, and regression analysis methods in the early determination of construction costs.Besides, Bostancioglu [8], Gunaydın and Dogan [9], Akinbingol and Gultekin [10], Dogan et al. [11], Nan et al. [12], Sonmez [13], Arafa and Alqedra [14], Kuruoglu et al. [15], Cho et al. [16], Latief et al. [17], El-Sawalhi and Shehatto [18], Bayram et al. [19], Coloma et al. [20], Dimitrijević et al. [21] conducted modelling studies in order to estimate costs of buildings at an earlier stage.The literature review is detailed in Table 1.AI has been known to be a fine modelling technique in various fields including the specialty areas within civil engineering [22].Artificial Neural Network (ANN), Genetic Algorithm (GA), Gene Expression Programming (GEP) and simulation have recently been used, either combined or separately, in various fields of civil engineering.According to Table 1, researchers have generally used Artificial Intelligence Techniques (AI) and Regression Analyses (RA) for the estimation of Building Cost at Early Stage and investigated the efficiency of these techniques to predict ESCE.Studies in the literature confirm validity of AI in the assessment of overhead construction costs.To the best of the authors' knowledge, no formula that estimates ESCE has so far been presented in the literature.Therefore, the first aim of this study is to propose a formula to calculate ESCE in a rapid, easy, and accurate manner.Another aim is to show that ESCE models of satisfactory precision can in fact be created with AI.For this purpose, a new formula is proposed as a result of ANN-GEP integration model with a specific number of parameters selected from architectural and static projects.The proposed ESCE formula can help investors and bidders not only to calculate costs faster and easier, but the formula A simple formulation for early-stage cost estimation of building construction projects also prevents any differences in ESCE calculations that may arise due to the individual approaches.

Methodology
The authors' assumption was to improve a formula that enables easy, correct and fast estimation of building cost.The methodology shown in Figure 1 was applied to obtain this formula.In the mentioned methodology, after conducting extensive literature research, 18 independent variables were evaluated and grouped according to their ability to predict ESCE using the ANN method.Then, the algorithm of the group with the highest ability to predict ESCE was designed using GEP, and the GEP model was created.Finally, a formula was created in which ESCE was calculated correctly.Stage 1 is described in Section 3. The detailed information about the formula and other stages are given in Section 4.

Problem analysis and establishment of data set
Early-Stage Cost of Buildings is one of the most critical components of the employer's and contractor's budget accounts on a construction project.Employers wish to obtain accurate results in cost estimation calculations in the shortest period of time in order to arrange the budget of their investments whereas the contractors wish to calculate their profits accurately.The traditional estimation of building costs based on quantity survey calculation for the whole project is also accurate but time consuming.The cost calculation at an early stage of projects can vary from one individual to another.At the preliminary design stage or where the bidding period is limited, building costs need to be calculated in an expedited manner.While bidding for a tender, if a detailed cost analysis is not possible due to time constraints, the control of the early cost calculation with a simple ESCE method may prove quite convenient for bidding companies and public utility companies.For these reasons, it is apparent that there is a need for methods to estimate ESCE accurately both in theory and practice.[7], El-Sawalhi, Shehatto [18] and Gunaydin and Dogan [9] on "early stage building costs" were taken as a basis for the selection of independent variables that will be subject to quantity survey.Eighteen parameters selected in this context are shown in Table 2. Quantities of the parameters specified for each of the mentioned projects were calculated with exact individually performed measurements.

General features of Artificial Neural Networks (ANN) method and analysis of independent variables
Artificial Neural Networks (ANN) are one of the artificial intelligence techniques that function on the basis of current examples of learning abilities of the human brain.The ANN is supplying a contemporary, accurate and matchless solution based on artificial intelligence [25].ANN, which is not algorithmic and which can carry out parallel actions efficiently, can solve complicated and non-linear problems in a serial and convenient manner [26].The selection of the right network in ANN is the most important stage for learning the network.Network topology, addition function, activation function and learning strategy are the elements which distinguish one network model from other network models [27].The most fundamental part of ANN is the nerve cell and it is also called the "Processor".The processor is comprised of one or more entries, weights depending on the entries, integration function (total connections), transfer (activation) function, and a single outcome.
The integration function provides net entries by processing between the inputs coming to the cell and weights.The most commonly used function, among the functions that can assume the form of addition, multiplication, maximum, minimum, majority etc., is the addition function [27].The activation function, which is also called the "Learning Curve", is used to create the outcome value [28].Several types of activation functions are being used, such as sigmoid, linear, step, sinus, hyperbolic tangent, etc.These functions are selected by the user who creates the network [27].Sigmoid and hyperbolic tangent functions are the most widely used transfer functions [28].
Depending on the number of layers, the following two types of ANN can be differentiated: single-layer networks or multi-

Concrete frame x
Iron and concrete frame x

Stone exterior wall cladding x
Plaster exterior wall cladding x

Number of consoles in the building x
Soil structure-topography x

Construction type x
A simple formulation for early-stage cost estimation of building construction projects layer networks [29].A single layer ANN is comprised of entry and exit layers whereas a multi-layer ANN is comprised of three separate layers.These layers are the input layer, hidden layers, and output layer.Multi-layer ANN, which can provide 95 % results for engineering problems, are the models that are nowadays most frequently used [27].
For the ANN analyses, the ESCE, which is the dependent variable, and independent variables were applied to Equation (2), and linear normalization procedure was performed.The objective of this procedure is to facilitate learning and prevent errors [30]. ( Here y i refers to the result of the normalization, y refers to the value that will be normalized, and y max denotes the largest value of the variable.ANN analyses were done with written codes using the Matlab R2018a program.In the ANN model, the activation (transfer) function was selected as Sigmoid (logsig), which gives positive values between [0-1], whereas the integration function was selected as summation function, which is most commonly used in the literature.Learning with a trainer (consulted) was selected as the learning method.Back propagation (BP) algorithm is the most frequently used algorithm for multi-layer feed forward networks [31].BP was used as the learning algorithm as it can easily be proven and it is preferred by the studies within the literature review; Scaled Conjugate Gradient (SCG) was used as it yields successful results in problems where BP is used.

Determination of variable groups with Artificial Neural Networks (ANN)
The important stage in the ANN is to find a way-out optimized network in conformity with the numbers of neurons available based on the hidden layer for the estimation coefficient (R 2 ).
During the first stage of the artificial neural network analyses, the ESCE estimation coefficient (R 2 ) was determined for the situation where each of the eighteen independent variables was a dependent variable.In these analyses, independent variables make up the input layer individually, whereas the output layer is made of the ESCE.The number of neurons in the hidden layer that gives the best solution was determined to be seven.The number of iterations was found by trial method during each training.It was observed that each variable has a different hidden layer number.
Table 1 provides the ANN analysis results for the best hidden layer value of each independent variable.In the analyses, it was adopted that the maximum number of iterations is 10000.However, as the reduction in the slope ratio of the training data would increase the  As can be seen in Table 3, the variables with the highest R 2 value in the estimation of ESCE are wet areas and total indoor construction areas.During the first stage of the ANN analyses, independent variables were selected among the variables with training results equal to 0,85 and above, as provided in Table 3.They were then calculated with SCG algorithm.The corresponding values are shown in Table 4. Table 4 shows that the R 2 value in the total exterior wall area is 0,96 for the training data and 0.51 for the testing data.As the R 2 values between the testing and training data were different than expected, when independent variables were being named, the total exterior wall area variable was named as y3 and the wet area variable was named as y2.These are shown in Table 5.
When determining independent variable groups, the ACG1 was first created with total indoor area and wet area (y1 and y2) which are the two variables with the highest R 2 value.Afterwards, five variables other than these two variables were added to ACG1 one by one, and five new groups (ACG2-ACG6) were created.The group with the best analysis result was selected among these five new groups.The other four variables, which were not used, were added to the selected group and new ESCE groups were formed.This process was continued until ACG16 with seven variables was created.The creation of the groups is summarized in Table 6.

Independent variable name Condensation
Total indoor area y1

Wet area y2
Total exterior wall area y3 Total internal wall area y4

Vertical bearer area y5
Floor area y6 Number of vertical bearers y7 Esra Dobrucali, Ismail Hakki Demir group in the estimation of ESCE is ACG12 which has five input layers, seven hidden layers, and one output layer.This group includes the independent variables of the total indoor area (y1), wet area (y2), total exterior wall area (y3), vertical bearer area (y5), and floor area (y6).

General features of Gene Expression Programming (GEP) method and analysis of independent groups
Gene Expression Programming, (GEP), a technique developed by Ferreira, is based on genetic algorithm (GA) and genetic programming [32].The difference between these three algorithm methods arises from the structure of chromosomes.
Chromosomes are found as linear series with a fixed length in the genetic algorithm, whereas in genetic programming they are found in non-linear form and they vary in size and shape.However, chromosomes are described as linear series with a fixed length in gene expression programming, and then they are shown as non-linear simple diagrams or expression trees with various sizes and shapes [32].Chromosomes presented as an expression tree are described in different forms and sizes by the processors (operators) found in GEP.Genetic operators such as renewal, mutation, transposition, and reintegration are used on linear chromosomes.As a result of these operators, non-linear variables in fixed numbers and lengths are converted into linear series with different sizes and forms, and functions are generated [32][33][34].In this method, mathematical codes are used as the language for the genes and expression trees [32,[34][35][36].This method defines all problems, from the simplest problem to the most complicated one, with an expression tree.An expression tree is comprised of mathematical statements, constants, variables, and functions [32][33][34]37,38].Furthermore, expression trees can be converted in nearly all programming languages [39].
The building blocks of gene expression coding are chromosomes and expression trees.The codes of solution models in gene expression programming are made up of the genes with heads, tails and constants, and the chromosomes that contain the structure with a binding function, which binds these genes.While a solution architecture is being prepared, number of genes and head length and binding function that determine the largest size of each statement in the model, are selected [40].
Genetic operators, on the other hand, are operations carried out for the production of new generations with better qualifications using the existing population and, in this way, the scope of the search algorithm is expanded.Essentially, there are two general operators known are transposition and mutation [41].Various genetic strategies were created in GEP through various uses of these operations and random number conversions (RNC) [42].GeneXpro 5.0 used in this study includes 5 different training strategies listed as: optimal evolution, constant fine-tuning, model fine-tuning, subset selection, and custom [42].Solution trees with long chromosomal structures are required for the solution of complicated problems.They are coded into smaller (such as the genes in chromosomes) structures with GEP subexpression trees and a hierarchical structure is created [33][34][35][36][37]43]. Maximum weights and depths of the sub-expression trees are calculated for each gene based on Equation (4) and Equation ( 5). (4) Here, w refers to the weight of the sub-expression tree, h refers to the head length, while n refers to the largest value of the parameters obtained by the functions in the function set [40]. ( Here d refers to the depth of the sub-expression tree, h refers to the head length, and m refers to the smallest value of the parameters obtained by the functions in the function set [40].Binding functions consisting of addition, subtraction, multiplication and division operations are used for binding subexpression trees together [33,[35][36][37]43]. Fitness functions used in GEP demonstrate the capability of the solution genes as in the case of genetic algorithm.Fitness functions such as the mean absolute error (MAE), mean square error (MSE), root mean error (RMSE), relative square error (RSE), root relative square error (RRSE), relative absolute error (RAE), wrong balance, cost/revenue matrix and positive correlation, etc., are used in regression analyses conducted with GEP [40].Furthermore, GEP is a genotype / phenotype genetic algorithm used as a new method to produce formulas [44].
In this study, the GEP model that was designed used the foregoing 5 variables is used as input data and updated ESCE as output data.ESCE were revised with the iteration method using 2016 as a basis.Moreover, training-testing data were determined randomly as 80 % -20 % in the analyses.Root relative square error (RRSE) was used as the regression analysis and fitness function within the gene expression programming.RRSE fitness function (Ei) is shown mathematically in Equation ( 6) and Equation ( 7). ( Here Tj refers to the target value for j and n refers to the number of samples [42]. A simple formulation for early-stage cost estimation of building construction projects where P ij refers to the values estimated by the model, refers to the mean value calculated with Equation ( 6), T j refers to the target value for j, and n refers to the number of samples [42].
GeneXpro includes several fixed genetic strategies.As a genetic strategy, optimal evolution, constant fine-tuning, and model fine-tuning strategies, were used in the study.The three genetic strategies used and the proportion of genetic operators in these strategies are given in Table 7.
The analyses were initially started with optimal evolution.When the best fitness value of the analysis is fixed, these three strategies were used as alternates.There are two general operators in the genetic algorithm: transition and mutation [41].Other operators are named according to their location and methods in applying these two general operators to chromosomes or genes, cf.Table 7.The ratio of the constants for these three strategies is provided in Table 8.Moreover, the maximum fitness value of 1000 was adopted.Esra Dobrucali, Ismail Hakki Demir For each run, the length of the head with the number of genes is selected.The type of the binding function, the number of genes and the length of each gene is a priority that has to be chosen for each problem.Therefore, by gradually increasing the length of the head, it is always possible to use a single chromosome.If it grows too much, the number of genes can be increased and a function can be selected for binding.In other cases, however, another binding function may be more appropriate [32].Moreover, the number of chromosomes, head size, number of genes, and binding functions that constitute the basis of GEP architecture were determined with several trials in order to obtain the best analysis result in the study.GEP analysis adjustments are presented in Table 9.

Creating the ESCE formula
In the analysis conducted to estimate ESCE with GEP, 5 variables that have an impact on ESCE were identified as input data and updated ESCE were identified as output data.Results of the model that determines ESCE best are presented in Table 10.As a result of the model, an ESCE formula with five variables was created.This formula is shown in Equation ( 8).
According to Table 10, determination coefficient (R 2 ) for formula of ESCE was calculated as 0.90 for the training set and as 0.96 for the testing set.The Average Absolute Percent Error (MAPE) is a commonly used general measure to assess prediction accuracy.According to Table 10, the MAPE for the ESCE formula was calculated as 0.24 for the training set, and as 0.18 for the testing set.According to Lewis, the MAPE value between 0.20 and 0.50 is a reasonable forecasting value [47].
According to relevant literature, there is no exact standard for the percentage estimate to be followed for building cost estimation.However according to some studies, the accuracy of the estimation depends on the project information; It can range from + 40 % to -20 % before preliminary design, and from + 25 % to -10 % afterwards [48,49].A comparison of the real values obtained from the formula result is shown in figures 3, 4, and 5. Additionally, percentage errors distribution of training and testing data from the formula result is shown in figures 6 and 7.
Table 10.GEP results for ESCE (8) Here ESCE refers to the Early-Stage Cost of Buildings, CCI x refers to the Construction Cost Index for the desired year, CCI i refers to the Construction Cost Index for 2016, y 1 refers to the total indoor area, y 2 refers to the wet area, y 3 refers to the total exterior wall area, y 5 refers to the vertical bearer area and y 6 refers to the floor area.Models designed independently and by specialized applications are needed in the construction planning and management [45].
For this purpose, the formula of the algorithm determined by using GenexPro 5.0 that estimates ESCE was created with five  A simple formulation for early-stage cost estimation of building construction projects independent variables.The information about five independent variables that are included in static and architectural programs of the project, which differ from each other, is determined with the help of the user.As a result, the project's ESCE is estimated by the formula.The ESCE information calculated by classical method is presented in the Electronic Public Procurement Platform (EKAP) [46].

Case study
To see the performance of the proposed formula, a simple application was made on a real 1-storey public reinforced concrete building occupying a total floor area of 4440 square meters.This building was selected randomly among the 2019 construction tenders in EKAP.General building and tender information are given in Table 11.

Table 11. General information of a real public reinforced concrete construction
The quantity survey study was conducted for the public reinforced concrete building using 5 independent variables.The corresponding results are given in Table 12.According to the Central Bank of the Republic of Turkey 2019 exchange rate data, 1 TL was assumed to be $ 5.7.The total indoor area of each floor was calculated by deducting the door and window lengths from the plan of each floor and multiplied by the thickness wall.The total indoor area of the building was determined by summing these values for each floor.The wet area of the building was calculated by the sum of the wet areas of each floor.The total indoor area of the building was calculated by multiplying the number of floors and floor areas.
The vertical bearer area referred to the cross-sectional areas of columns and curtain elements on a single floor.The ESCE of the public reinforced concrete building calculated by the Employer (public institutions) amounted to 271064.82$, while the corresponding amount calculated using the proposed formula was 263994.40$.When these two results are examined, it can be seen that the accuracy of the ESCE estimation using the proposed formula amounts to 97 %.

Discussion and conclusion
The Early-Stage Cost of Buildings (ESCE) is one of the most critical components of an employer's and contractor's budget accounts on a construction project.Employers wish to obtain accurate results in cost estimations in the shortest time to arrange the budgets of their investments, whereas the contractors wish to calculate their profits accurately.The traditional estimation of building cost based on quantity survey calculation for the entire project is quite accurate but is also time-consuming.At the preliminary design stage or where the bidding period is limited, building costs may need to be calculated in an expedited manner.According to Kim et al. (2004), it would be appropriate to create a genetic algorithm-based model to obtain optimum cost estimation parameters with optimum NN architecture in the building cost estimation [50].
In this study, a formula was proposed to estimate the Early-Stage Cost Estimation of Building Construction Projects (ESCE) in a rapid, easy and accurate manner by using the data selected according to the construction design and by employing the ANN and GEP methods.A quantity survey study was conducted on one hundred construction projects tendered between 2011 and 2016 in relation to independent variables determined to have an impact on building costs, and a data set was created.The total floor area of the projects used in the data set varied between 141 m 2 to 7947 m 2 .According to Khamis et al. (2005), outliers in the training and testing data reduce the accuracy of the model [51].For this reason, determining the limit values for the total floor area of the projects prevented formation of extreme values and increased the learning performance of the model.The data set was examined with ANN analyses in order to determine the variables affecting the ESCE.As a result of these analyses, the total indoor area (y1), wet area (y2), total exterior wall area (y3), vertical bearer area (y5), and floor area (y6) variable groups were used to estimate ESCE, and successful results were achieved.The ESCE estimation coefficient of this group was found to be 0,99 for the training set.At the next stage, a model configuration was made with GEP where the independent variables found as a result of the ANN analysis were input data for GenexPro 5.0.The formula of the model (formula of ESCE-Eq.( 8)) was created with these five independent variables.The determination coefficient (R 2 ) for this ESCE formula was calculated as 0.90 for the training set and as 0.96 for the testing set.The testing set MAPE value for the ESCE formula was calculated as 0.18 which is within reasonable forecasting value limits.
In addition, the correlation and linear regression analysis was performed on the data set in order to compare this model (this formula) with regression analysis, which is a classical method.As a result of the analysis, it was determined that the building importance coefficient, vertical bearer area, and building height parameters, were effective in the early stage building cost estimation and R 2 value was found to be 0.77.Kim et al. [7], Cho et al. [16], Latief et al. [17], Kim et al [50], showed that artificial intelligence techniques (such as ANN, Neuro Fuzzy) performs better than regression analysis, which is a classical method for estimating construction costs.Finally, verifications were made on a case study to see the efficiency of the formula.
The following conclusions were drawn in this research: -The proposed formula alleviates the burden of long-lasting quantity surveys for reinforced concrete construction projects.-The proposed formula estimates ESCE rapidly and easily.
-It was observed during the study that the quantity survey calculation of these projects even in ESCE varied from one case to another.The use of the proposed formula in the early-stage building cost calculations is important not only for faster and easier cost calculation but also to prevent any differences that may arise due to the individual making the calculations.
-On the other hand, the results of the research showed the validity of using ANN and GEP together in the calculation of ESCE.
-It was proven that the ESCE formula with satisfactory precision can be created.
-Moreover, a significant contribution is made to the literature, in the sphere of facilitating an easy, rapid and accurate estimation of ESCE on construction projects.

Figure 2 .
Figure 2. Structure of ANN for ACG 12

Figure 3 .
Figure 3. GEP result of testing data for ESCE

Figure 4 .
Figure 4. Training data distribution diagram for ESCE Figure 5. Testing data distribution diagram for ESCE

Table 3 . Results of ANN analysis where independent variables are used as a single output Independent variable name
Esra Dobrucali, Ismail Hakki Demir error margin of the testing data, the network was stopped when the error curve for the training data reached the slope.

Independent variable name Number of neurons in hidden layer Iteration number R 2 (Training) SSE Training Testing
continued with SCG algorithm.Number of neurons in the hidden layer and the number of iterations varies for each independent variable group.Error values were evaluated with SSE, as shown in Equation (3).(3)SSE refers to the total square of errors, y refers to the real data and y' refers to the analysis result data.When R 2 and the total sum of squares of errors (SSE) provided in Table6are examined, it can be seen that the most successful Tablica 6.

Table 8 . Random constants*Table 9 . GEP analysis adjustmentsTable 7 . Operator values according to genetic strategies* Genetic operators* Optimal evolution Constant fine- tuning
(** The Ris and Its transposable elements of GEP are pieces of the genome that can be activated and that can be leaped to another location in the chromosome.)***(*Definitions and the more detailed information about other parameters in the table can be obtained in[40.42])

Figure 6. Percentage errors distribution of training data for ESCE Figure 7. Percentage errors distribution of testing data for ESCE
Here ESCE refers to the building cost at an early stage of the real public reinforced concrete construction, 190.23 refers to the construction cost index for September 2019, and 111.92 refers to the construction cost average index for 2016.