Construction cost prediction system based on Random Forest optimized by the Bird Swarm Algorithm

Predicting construction costs often involves disadvantages, such as low prediction accuracy, poor promotion value and unfavorable efficiency, owing to the complex composition of construction projects, a large number of personnel, long working periods and high levels of uncertainty. To address these concerns, a prediction index system and a prediction model were developed. First, the factors influencing construction cost were first identified, a prediction index system including 14 secondary indexes was constructed and the methods of obtaining data were presented elaborately. A prediction model based on the Random Forest (RF) algorithm was then constructed. Bird Swarm Algorithm (BSA) was used to optimize RF parameters and thereby avoid the effect of the random selection of RF parameters on prediction accuracy. Finally, the engineering data of a construction company in Xinyu, China were selected as a case study. The case study showed that the maximum relative error of the proposed model was only 1.24%, which met the requirements of engineering practice. For the selected cases, the minimum prediction index system that met the requirement of prediction accuracy included 11 secondary indexes. Compared with classical metaheuristic optimization algorithms (Particle Swarm Optimization, Genetic Algorithms, Tabu Search, Simulated Annealing, Ant Colony Optimization, Differential Evolution and Artificial Fish School), BSA could more quickly determine the optimal combination of calculation parameters, on average. Compared with the classical and latest forecasting methods (Back Propagation Neural Network, Support Vector Machines, Stacked Auto-Encoders and Extreme Learning Machine), the proposed model exhibited higher forecasting accuracy and efficiency. The prediction model proposed in this study could better support the prediction of construction cost, and the prediction results provided a basis for optimizing the cost management of construction projects.


Introduction
As a pillar industry of the country, particularly in developing countries, the construction industry plays a major role in domestic economic growth. Moreover, competition in the construction market has intensified. For construction enterprises, the cost management of construction projects is the center of all business and management activities, and forecasting construction cost is an important part of the entire cost management system [1]. Therefore, the accurate prediction of the cost of construction projects bears profound significance in reducing the construction cost and improving the efficiency of enterprises.
Currently, research results related to cost forecasting have been widely available, but two prominent problems remain. 1) Most research results are tantamount to developing the construction cost prediction index system for construction projects at the project level. This type of index system is highly detailed, and its scope of application is narrow, failing to provide reference and insight into the cost management of construction enterprises. 2) Current research methods are mainly static qualitative predictions, such as expert meetings, procedural investigation and subjective probability methods. These approaches are largely influenced by subjective factors, and the effective handling of the nonlinear characteristics of small samples in cost forecasting presents a challenge [2].
With the development of artificial intelligence, the application of machine learning algorithms in the study of cost prediction has become a trend [3]; however, inadequacies in related research results still arise. A decision tree (DT) requires considerable data preprocessing, and the prediction results easily fall into a local optimum [4]. A Support Vector Machine (SVM), a typical nonlinear modeling tool, has a highly complex computational function and its computational performance in solving multiclassification problems is minimal [5]. The Artificial Neural Network (ANN), the most commonly used nonlinear modeling tool, possesses excellent nonlinear modeling ability but has several disadvantages related to learning, local minimum and slow convergence speed [6].
Proposed by Breiman in 2001, random forest (RF) [7] is a combined classification intelligent algorithm based on statistical learning theory. The basic ideas for the application of RF in nonlinear modeling are as follows. 1) A plurality of weak classifiers is combined to form a strong classifier. 2) These weak classifiers play a complementary role. 3) By reducing the influence of a single classifier error, the classification accuracy and stability are improved. Compared with other classical nonlinear modeling techniques, RF exhibits robust data mining ability, high prediction accuracy and good tolerance for outliers and noise. In addition, RF is not easy to appear in the fitting scene. At present, RF has been successfully applied in forecasting precious metal prices [8], the gross domestic product (GDP) [9] and power load [10].
In the RF model, the appropriate selection of the number of DTs and split features can efficiently reduce the complexity of the RF and improve its computational performance to form an enhanced integrated classifier. Classical metaheuristic optimization algorithms, such as the Particle Swarm Optimization (PSO) [11] and Genetic Algorithms (GA) [12], are broadly used to solve the optimal calculation parameters of the RF. However, the capabilities of these two algorithms for global retrieval are not satisfactory.
Bird Swarm Algorithm (BSA) is a novel biological heuristic algorithm proposed by Meng et al. [13] in 2015. BSA simulates the biological behaviors of birds in nature, such as foraging, vigilance and migration. Reporting to the classical metaheuristic optimization algorithm, this algorithm has the characteristics of decentralized search, maintaining population diversity and avoiding falling into a local optimum. BSA has been successfully applied in the optimal parameter calculation of the Back Propagation Neural Network (BPNN) [14], torsional capacity evaluation of RC beams [15] and the optimization of the vehicle powertrain [16]. To the best of our knowledge, there is no research result on the optimization of RF calculation parameters based on BSA. Therefore, to address the inadequacies, such as low prediction accuracy and efficiency, we developed a novel model for predicting the construction cost of a building. This study presents the following major contributions. 1) Most index systems of construction cost prediction in related studies were built at the project level, which had some disadvantages, such as a narrow scope of application. From the perspective of construction enterprises, the index system of construction cost prediction was constructed for the first time. The index system exhibited good adaptability and could be widely applied in construction cost management for construction enterprises. 2) A construction cost prediction model based on RF optimization by BSA was proposed. This model showed strong data mining ability and high prediction accuracy, effectively enhancing the prediction accuracy and efficiency of the construction cost.
The remainder of this manuscript is arranged as follows. Section 2 summarizes the research results associated with this study; Section 3 analyzes the factors influencing architectural project construction cost, and constructs the related prediction index system; Section 4 introduces the construction cost prediction method based on the RF optimized by BSA; and Section 5 presents a case study to verify the science and effectiveness of the proposed model. Section 6 compares the computational performance of different algorithms to emphasize the advantages of the model proposed in this study. Section 7 summarizes the research results and limitations of this study.

Related work
Kim et al. [6] proposed a prediction model of construction cost, based on the Regression Comprehensive Moving Average Model (RCMAM) and ANN. The complexity and prediction workload of this hybrid model were larger than those of RCMAM or ANN. Whether the prediction accuracy was improved due to the increase in model complexity or the advanced nature of the hybrid model itself was difficult to assess. Thus, the evaluation that this hybrid model proposed by Kim exhibited enhanced calculation accuracy might be biased. With research on the concrete engineering cost in Egypt, Elfaham [17] emphasized the influence of inflation on the prediction results. The reasonable method of eliminating this effect was to translate the costs generated at different time points into the same time position. In the study by Elfaham, ANN was designed to build the prediction model but was not compared with other nonlinear modeling methods. Cao and Ashuri [18] constructed a highway construction cost prediction model based on Long Short-Term Memory. However, their study only considered the quantitative factors, such as the price of building materials and the wages of construction workers; it lacked research on the qualitative factors, such as the level of construction management and the ecological environment of construction. In accordance with the project management practice in construction engineering, these qualitative factors also exerted a significant effect on the construction cost. Using Complex Network Analysis, Mao and Xiao [19] developed a novel construction cost prediction model. The influencing factors of the construction cost were regarded as network nodes, and the cost was expected by analyzing the relationship of each network node. However, CNA is a typical sociological research technique, which is easily influenced by the subjective judgment of managers.
Pierioch and Risse [8] used RF to construct the price prediction model of precious metals. Reports on classical methods, such as Multiple Linear Regression (MLR), indicate that the results based on RF showed higher prediction accuracy. In addition, multivariate and unifoliate prediction results were compared, and the multivariate prediction was found to be more accurate than univalent prediction. Using the economic data of Japan from 2001 to 2018, Yoon [9] constructed an RF-based forecasting method for Japan's GDP. However, the effect of RF initial calculation parameters on the prediction results was not analyzed. Fast and accurate forecasting of short-term power load has consistently been a difficult area in power management research. Accordingly, Dang et al. [10] developed a stochastic RF model to effectively quantify the uncertainty of power load forecasting. Compared with three other classical power load forecasting techniques, this model, based on stochastic RF, was proved to exhibit higher forecasting speed and forecasting accuracy.
Imitating the foraging, alert and flight behaviors of birds, Meng et al. [13] proposed BSA, which more effectively avoided falling into the local optimal solution by solving 18 classic test problems. Zhang et al. [14] successfully found BPNN for Quadrature Amplitude Modulation Signal Recognition in 5G communication systems by BSA. However, their study failed to compare the computational performance of BSA and the classical optimization algorithm. Using ANN as an example, Kaya [20] analyzed the optimization performance of 16 metaheuristic algorithms. Numerous tests showed that calculation by BSA showed the highest speed and stability. Wu et al. [16] solved the typical multiobjective optimization problem with a vehicle powertrain by using BSA. Compared with other optimization algorithms, BSA was more suitable for multi-objective optimization.

Analysis of factors influencing the building project construction cost
The project construction cost is the sum of all costs incurred by construction enterprises to successfully complete construction projects. Construction projects are one-off and distinct, rendering the construction costs of different projects significantly different. The building project construction cost generally includes the material, labor, machinery, financial and management costs.
Regulations on the Construction Cost Management of Capital Construction Projects in China, together with previous research results [4,[17][18][19], indicate that the influencing factors of project construction cost are determined from the following aspects.
1) Building scale ( ) Building scale refers to the size of the volume, pattern, or scope of building projects. The scale of the building is directly proportional to the materials, machinery and personnel needed for the project; thus, it is directly proportional to the construction cost. The scale of the building generally includes influencing factors, such as the type of structure, total height, area of the standard floor and basement area.
2) Project Management ( ) Project management is an important factor considered to determine whether the project cost control is economical. Contrary to other research results [4], project management is separately listed to more clearly analyze its main effect on the construction cost. Project management consists of four types: type of contract, the difficulty of resource scheduling, the proportion of managers and the difficulty of quality and safety management.
3) Site conditions ( ) On-site conditions determine the difficulty of the construction and thus also markedly influence the consumption of construction costs. This primary index includes the quality of life of employees and the quality of the construction environment, all of which influence the smooth progress of a project. In this study, site conditions were mainly divided into the distance from the location of the material supply, frequency of disasters and rationality of site layout. 4) Fluctuation of price ( ) The cost of project construction could be simply regarded as quantity multiplied by the unit price. The first three indicators mostly affect the project construction cost of buildings by influencing the quantity of work. Fluctuation in price was selected as an indicator representing the influence of price change on the project construction cost. Fluctuation in price mostly included fluctuations in material, labor and machinery costs.

Proposed prediction index system
On the basis of the principles of comprehensiveness, science, timeliness, applicability and comparability, the prediction index system was constructed. Details are listed in Table 1. - [19,26] In order to facilitate readers, this paper briefly summarized the types and data characteristics of indicators in Table 1. There were two kinds of indicators in Table 1, qualitative indicators and quantitative indicators. Qualitative indicators ( , and ) were obtained by questionnaires or expert interviews, while quantitative indicators (all indicators excepted , and ) were obtained by quantitative methods based on statistics. For the data of and , preset and corresponding integers were taken according to the types of projects. The data sources of , , were questionnaire survey results. See the following Section 3.3 for the qualitative explanation of different data results.
was the average number of natural disasters in recent years. , and could be calculated statistically, or referred to the cost information issued by the local government. It should be emphasized that the rationality of the prediction index system of the construction cost has a significant effect on the subsequent prediction accuracy. Section 6.1 elaborates on the importance of each prediction index to reveal the regular pattern of the effect of the prediction index system on the accuracy of prediction.

Data processing methods
a) Type of structure ( ) The type of structure determined the choice of construction technology and building materials, affecting the direct construction cost. The most common types were the frame structure 1), steel structure 2), shear wall structure 3), the brick-concrete structure 4) and tube structure 5). The values in brackets represented the index scores corresponding to the different types. The data point was obtained by consulting the project management information and was an index without the measurement unit. b) Total height ( ) The total height of the buildings was calculated as the height of the highest point to the outdoor ground as the reference. The higher the building, the more rigorous the performance requirements set on the materials and the higher the cost of construction management. The data point, the area of the standard floor ( ) and the area of the basement ( ) were determined by consulting design drawings. The measurement unit of was m, whereas that of and was m 2 . c) Area of the standard floor ( ) The area of the standard floor was a critical factor in the construction cost. The larger the area of the standard floor, the higher the labor, material and machinery costs needed. Notably, the cost estimation method commonly used in engineering practice is to approximate the construction cost based on the standard floor building area. d) Area of the basement ( ) The construction of the foundation and the basement entailed difficulty, comprising about 15-30% of the total construction cost. The construction cost of the foundation and the basement was approximately considered proportional to the basement area. e) Type of contract ( ) The type of contract determined the organization and management system of the project and the contracting method, which was crucial for the implementation of the project. If the same construction project adopted different contract types, the construction costs would vary. Common contracts included fixed lump-sum contract 1), fixed unit-price contract 2), variable lump-sum contract 3) and variable unit-price contract 4). The data point was obtained by consulting the project management information and was an index without a measurement unit. The values in brackets represented the index scores corresponding to the different types.
f) Difficulty in resource scheduling ( ) Resource scheduling is the rational allocation and mobilization of construction machinery and materials. Reasonable resource scheduling enables site coordination and reduces the construction cost.
In this study, the difficulty of resource scheduling was selected as a comprehensive qualitative index because of the complexity of resource scheduling at the construction site. The data of , the difficulty of quality and safety management ( ) and the rationality of the site layout ( ) were obtained using the questionnaire survey or the expert interview. They were indexes without the measurement unit. In this study, it is very easy, easy, general, difficult and very difficult to divide the evaluation results of into five grades. The quantitative evaluation interval of very easy was (80, 100], and its qualitative language description was that the coverage of resources was small and the quantity was small, which would not cause trouble to the project cost management. The quantitative evaluation interval of easy was (60, 80], and its qualitative language description was that the resource coverage was moderate and the quantity was appropriate, and the demand could be met through simple scheduling. The quantitative evaluation interval of general was (40,60], and its qualitative language description was that resources covered a wide range and had a large number, which required more detailed and comprehensive scheduling. The quantitative evaluation interval of difficult was (20,40], and its qualitative language description was that resources covered a wide range and a large number, and scheduling required a high level of technology and management, which was easy to encounter bottlenecks. The quantitative evaluation interval of very difficult was [0,20], its qualitative language description was that resources covered a very wide range and had a large number, scheduling needed superb technology and management experience, and needed to overcome various complicated problems and difficulties. Experts could quantify the index according to the actual situation of the project and the qualitative language description of the index. g) Proportion of managers ( ) Organization and coordination refer to the organization and distribution of personnel inside and outside the project. A reasonable organizing staff can effectively avoid an increase in management costs caused by oversaturation. In this study, the proportion of managers was selected as an index describing the influence of personnel organization and coordination on the construction cost. is calculated as follows: where is the number of managers, and is the number of construction workers. During the construction stage, the staff mobility of the project management team and the construction operation team is considerably strong. Thus, the data points and should be observed and averaged several times. h) Difficulty of quality and safety management ( ) Quality and safety management are important contents of project management. If quality management is improper, the completed construction content can be easily reworked, and the construction cost could increase. The occurrence of a construction safety accident can result in casualties and property losses. In this study, the difficulty of quality and safety management was selected as a comprehensive qualitative index because of the complexity of quality and safety management at the construction site. The evaluation grade of was divided into five grades, which are very easy (80, 100], easy (60, 80], average (40,60], difficult (20,40] and very difficult [0,20]. The qualitative language description of very easy was that the quality and safety management in project implementation was very comfortable, easy to control and extremely low in risk. Qualitative language description of easy meant that the quality and safety management in project implementation was not difficult, could be well controlled and has low risk. The qualitative language of general described that the quality and safety management in project implementation was not difficult and needed to be controlled through reasonable management measures, and there were certain risks. The qualitative language description of difficult was that the quality and safety management in project implementation is difficult, which required more powerful management measures to be controlled and might face certain risks. The qualitative language description of very difficult was that the quality and safety management in project implementation was extremely difficult, requiring extremely powerful management measures to be controlled, and might face high risks. Experts could quantify the index according to the actual situation of the project and the qualitative language description of the index. i) Distance from the material supply place ( ) The transportation cost of materials was an important component of the construction cost. The farther the distance between the construction site and the material supply location, the greater the transportation cost. The material loss in transit and the cost of material preservation measures were also proportional to the distance.
is calculated as follows: where is the distance from the -th supplier of the -th material to the construction site; is the number of materials; and is the number of suppliers of the -th material. j) Frequency of disasters ( ) Natural disasters interrupted the construction process and caused casualties and property losses within the construction scope. In addition, the higher the frequency of natural disasters, the more disaster prevention and mitigation measures were needed, which increased the construction cost. The data point of was obtained by consulting local meteorological data and geological survey reports. This paper suggested that the annual frequency of natural disasters in recent 10 years should be the score of this index. k) Rationality of the site layout ( ) The layout of the construction site significantly affected the construction efficiency. The more reasonable the layout, the higher the construction efficiency, and the lower the amount of labor and machinery used. In this study, the layout of the construction site included too much content; thus, the rationality of the site layout was selected as a comprehensive qualitative index. The evaluation grade of was divided into five grades, very reasonable (80, 100], reasonable (60, 80], half (40, 60], unreasonable (20,40] and very unreasonable [0,20]. The qualitative language description of very reasonable was that the layout completely conformed to the regulations and standards, achieved the best design effect, maximized the space utilization and also considered the user's use needs and comfort. The qualitative language description of reasonable was that the layout basically conformed to the regulations and standards, the space utilization rate was high and the user's use needs were basically met. The qualitative language description of half was that the layout basically conformed to the regulations and standards, but the space utilization rate needed to be improved, and the user's use needs had been initially met. The qualitative language description of the unreasonable was that there were obvious violations or irrationalities in the layout, the space utilization rate was low and the user's use needs were not fully met. The qualitative language description of very unreasonable was that the layout was very unreasonable and there were great security risks, which could not meet the needs of the users. Experts could quantify the index according to the actual situation of the project and the qualitative language description of the index. l) Material price index ( ) The cost of materials comprised about 30-50% of the total construction cost; thus, the price fluctuation of materials significantly influenced the construction cost. The material price index was selected to reflect the change in material cost during construction. The data calculation method for is as follows: where is the number of materials, is the material planned consumption of the -th material during reporting; is the price of the -th material during the reporting period, and * is the price of the -th material during the reference period.
Notably, the data , and could also refer to the construction price index information published by the local cost management department. m) Labour price index ( ) The labour price index was selected to reflect the changes in the labour price during the construction process. The data calculation method for is given below: where is the number of the main types of work; is the labor planned consumption of the -th work during the reporting period; is the price of the -th labour during the reporting period; and * is the price of the -th labor during the reference period. n) Machinery price index ( ) The mechanical price index was selected to reflect the change in the mechanical price during the construction process. The data calculation method for is given below: where is the number of major construction machines; is the budgeted consumption of the -th machine during the reporting period; is the price of the -th machine usage fee during the reporting period; and * is the price of the -th machine usage fee during the reference period.
The construction cost of a building has apparent time benefits. Thus, the construction cost of different time points should be converted to the same time point: where is the duration; is the benchmark interest rate; * is the present value of the construction cost; and is the present value of the construction cost.

Introduction to RF
RF is a typical machine learning method [7]. It mainly uses the Bootstrap resampling method to extract multiple samples from the original data. RF builds the classification and regression trees (CARTs) for each Bootstrap sample. The predictions of all classification trees are combined, and the final result is derived by voting. With two classifications as an example, the calculation principle for RF is presented in this section.
Supposing two classes, and , exist and is the data set at the current tree node, then is divided into and such that the condition = ⋃ is satisfied. and represent the sample data assigned to and , respectively; ( ) = | | | | is the proportion of in ; and The variogram in is as follows [7]: The Gini index is the weighted sum of the variograms ( ) and ( ). The calculation method for is as follows [7,10]: After constructing a complete forest with classification trees, it was used to verify or predict new known or unknown data. This forest synthesized the prediction results for trees and determined the data category by voting. The mathematical expression for is given below [7]: where is the classification result corresponding to the feature set ; is the number of classification trees; is the indicator function; , is the classification result for class by the tree ℎ ; and is the number of leaf nodes of ℎ .
The forest classification model ( ) is expressed as follows [7]: where ℎ is the -th taxonomic tree. In addition, some data were not selected in each sampling, and the remaining out-of-bag data (OOB) were used for internal error estimation. Each classification tree had an OOB error estimation, and the average value was used as the generalization error of the model.
When a certain number of trees was reached, the OOB error of the model was considered similar to that of the optimized model. The major mathematical processes included the following: The empirical margin function of the sample data set ( , ) is defined as [7,28], The generalization error of the classifier set ℎ is expressed as [7]: where ( , ) is the feature-response space composed of and . With an increase in , the number of classification and regression trees increases [7,11]: where is the generalization error of the classification tree corresponding to the parameter . In accordance with Eq (13), should approach a finite upper boundary infinitely with an increase in DT to prevent overfitting.
According to the aforementioned analysis, the number of CARTs and the number of features in the RF method largely influenced the prediction accuracy. The number of CARTs is used to load training samples and their feature factors. The number of features is the number of randomly selected features during each node splitting operation. The flow chart of RF for data prediction was shown in Figure 1.

Introduction to BSA
For a greater survival advantage, the social behavior of birds includes not only foraging behavior but alert behavior as well. BSA is an optimization algorithm derived from the social behavior of birds. The algorithm mostly simulates the division of labor and social interaction between different individuals in the sparrow population when it is looking for food (target). In BSA, the flight position of each bird represents the potential solution. A scattered flight search of birds cannot only maintain the diversity of the population but also effectively avoids a locally optimal solution. The flow of BSA is presented in Figure 2 [13].

Start Setting parameters and initializing population by
Rule 1 Calculating the fitness values and initializing the global optimal solution by Equation (14) and Rule 2 Updating population by Equation (14) Fly?

Foraging? No
Keeping alert by Equation (15) and Rule 3 Seeker?

Yes
Following Seeker by Equation (18) and and Rule 5 Searching for other targets by Equation (19) and and Rule 5

Yes
Searching for targets by Equation (14) and Rule 4

Yes
Calculating the fitness and updating the global optimal solution by Equation (14) Is the iteration termination condition met? The searching behavior of birds in nature, as determined from their social behavior, is transformed into the following five rules [13]: Rule 1. Individual foraging behavior and alert behavior are random. Rule 2. Foraging behavior. When a bird is foraging, it can quickly record the best position of the target and update the best position of the target. This information is dynamically shared within the whole flock of birds.
Rule 3. Alert behavior. The alert behavior of birds is the tendency to move to the center of the flock. The more vigilant the bird, the greater the tendency to move. Rule 4. Flight behavior. Birds periodically fly to another location. When birds fly to a new place, individuals in the flock switch identities between Seekers and Followers. The birds with high vigilance become Seekers, whereas those with low vigilance become Followers. The rest randomly choose between Seekers and Followers. Rule 5. Seekers continue to forage, whereas Followers randomly follow a bird that becomes the individual search target of the Seeker.
Suppose there are birds in the ( ∈ [1, ⋯ , ]) position at time , foraging and warning in a -dimensional space.
1) Foraging behavior Rule 1 provides a basis for the random decision-making process of birds. When the random number between (0,1) generated by equal probability is greater than the constant , birds feed, otherwise they remain alert.
In accordance with Rule 2, every bird looks for its target by its own flight experience and group experience. Thus, the position , at time + 1 is given below [16]: where ∈ [1, ⋯ , ], (0,1) represents the random number between 0 and 1; is the cognitive coefficient of the individuals; is the cognitive coefficient of the groups; is the best position of the i-th bird before its renewal; and is the best position of the groups.
2) Alert behavior In accordance with Rule 3, the alert behavior is described below [13]: where is a positive integer from 1 to , satisfying the condition ≠ ; and belong to [0,2]; is the optimal fitness value of the -th bird; is the sum of the optimal fitness values of the colony; and is a considerably small constant to avoid the situation in which is 0; and is the average fitness of the -th birds. In Figure 2, Keeping alert means that the bird performs alert behavior.

3) Flight behavior
In accordance with Rule 4, to avoid being chased or looking for food, birds will fly to other areas regularly, and the migration cycle is set as . When they arrive in another area, they will feed again. For a Seeker, its flying behavior is described as follows [14]: For a Follower, its flying behavior is as follows [14]: where (0,1) is a standard Gaussian random number, ∈ [1, ] and ≠ ; ( ∈ [0,2]) indicates the interval between birds flying to a place to look for food.

Implementation of the proposed model
The flowchart of the prediction model proposed in this study is presented in Figure 3.

Start
Pretreating the data of continuous indicators by Equation (20) Establishing training set and test set by random sampling Determining the parameters of RF, and initializing the RF model by Equations (7)- (9) Determining the parameters of BSA and initializing the BSA model by Equations (14)- (19) Calculating fitness t=T？ Outputting the best parameter combination and constructing the optimized model.

Forecasting construction cost End
Collecting data by method proposed in Section 3.  Step 1. Data collection and preprocessing As mentioned in Section 3.3, the data for all secondary indicators could be collected. Two types of secondary indicators, discrete indicators ( and ) and continuous indicators were identified (11 other secondary indicators). The continuous indicators were normalized to reduce the complexity of the model and improve the prediction accuracy. After normalization, the index value * is expressed as [29] where is the value before normalization, is the minimum value and is the maximum value.
Step 2. Establishing the training set and the testing set. The training set was used to train the prediction model, and the testing set was used to check its computational performance. The training set was obtained by random sampling, and the remaining data of the sample set were used as the testing set. The common ratios of the training set to the test set were 95%:5%, 90%:10%, 85%:15%, 80%:20% and so on.
Step 3. Determining the calculation parameters of RF and initializing the RF model. In RF, the range and initial value of the DT and split features had to be set [7,10]. The RF model was initialized by bringing the testing set data and calculation parameters into Eqs (7)- (9).
The MSE of the training samples was selected as a fitness function. When the fitness function was minimum, BSA found the optimal combination of the calculation parameters of RF. The calculation method for is given below [30]: where is the number of samples in the testing set; is the predicted value and is the real value.
Step 4. Determining the calculation parameters of BSA and initializing the BSA model. In BSA, the parameters had to be set, including the following: the population size ; the maximum iteration number ; the individual cognitive coefficient ; and the cognitive coefficients , , and [16]. The testing set data and the calculation parameters were input, and the BSA model was initialized.
According to the optimization flow (Figure 2), the optimal combination of the RF calculation parameters was determined. If the convergence condition was not met, the number of iterations was increased by one, and the individual position and group knowledge of the flock were updated to obtain the new number of DTs and split features. If the convergence condition was met, Step 5 was performed.
Step 5. Outputting the optimal combination of the calculation parameters of RF, including the optimal DT ( ) and the optimal split number ( ). The construction cost prediction model based on RF was constructed, depending on the best parameter combination.
Step 6. Forecasting the construction cost. The testing set data was brought into the optimized construction cost prediction model. The DT was randomly selected using the Bootstrap method, and the features were randomly selected to split into leaf nodes. The final prediction result of the construction cost of the test set was determined by taking the arithmetic average of the results of each DT.

Data acquisition and preprocessing
A total of 48 construction cost data of a construction company in Xinyu City were selected as sample data. Quantitative indicators were obtained by referring to design documents, local yearbooks or project management data. For qualitative data acquisition, 50 engineering experts were invited to assign scores, and 43 valid questionnaires were collected. Personal information on the experts in the 43 valid questionnaires is listed in Table 2. The following conclusions are listed in Table 2. 1) All experts of 43 valid questionnaires were from work units closely related to the management of the project construction cost and participated in the construction cost management. These experts were closely related to the object of the case study.
2) 100% of the experts who participated in this questionnaire had working experience of more than 10 years; more than 70% of the experts had working experience of more than 20 years; 93.72% of the experts attained a bachelor's degree or higher; and all of the experts had senior professional titles attached to their names. These findings indicated that most of the experts had solid professional backgrounds in project construction cost management. Therefore, the questionnaire survey results for the 43 experts were considered qualitatively reliable. Cronbach's α is the most commonly used method for reliability testing [31]. To quantitatively verify the reliability of qualitative index data, all data for the 43 valid questionnaires were loaded into the SPSS 21.0 software, and the Cronbach's α coefficients of the indexes , and were 0.7511, 0.8124 and 0.7202, respectively. All Cronbach's α coefficients of the three qualitative indexes exceeded 0.7, which met the general requirements of questionnaire reliability testing [31]. Thus, the results of this questionnaire passed the reliability test.
The averages of the three qualitative indicators rated by 43 experts were used as their scores. The original data pertaining to the 48 projects are listed in Table 3. The construction cost ( ) was expressed in millions. Only some data are included in Table 3 because of space constraints. In Table 3, almost all structural forms were frame structures 1) or shear wall structures 3). This phenomenon was explained in detail. At present, in the field of building engineering, frame structure 1) and shear wall structure 3) are common building structures. In newly-built projects in developed countries, frame structure 1) and shear wall structure 3) account for about 80%, steel structure 2) accounts for about 20%, brick-concrete structure 4), and tube structure 5) rarely appear. In new construction projects in developing countries, frame structures 1) and shear wall structures 3) account for about 50%, steel structures 2) account for about 10%, brick-concrete structures 4) account for about 40% and there are almost no tube structures 5). Among the 43 projects in this paper, there were only 2 steel structures 2) and 5 brick-concrete structures 4), and the remaining 36 were frame structures 1) or shear wall structures 3). We believed that the different structural types of 43 projects were representative.
The data for continuous variables in Table 3 are loaded into Eq (20), and the data after normalization are presented in Table 4, with the data for two discrete indexes ( and ) not normalized.

Acquisition of the optimal combination of parameters
The ratio of the training set to the testing set was 80%:20%. The data for 38 groups were randomly selected from the data in Table 4, and the remaining data for 10 groups were the testing set. Thus, the actual ratio of the training set to the testing set in this study was 79.17%:20.83%. The data for the 38 training sets were entered into the self-compiling program based on Matlab software.
In the RF method, the conditions were set as follows: the maximum range of the DT, 500; the initial value of the DT, 1; the maximum range of split features, 5; and the initial value of split features, 1. In BSA, the conditions were set as follows: population size, 50; maximum number of iterations, 500; the individual cognitive coefficient C and the group cognitive coefficient S were 2, and were 1; the convergence error was 10 -8 , and FL, 0.5-0.9. An extremely wide range for the calculation parameters would reduce the efficiency of the algorithm, whereas an extremely narrow range might not find the optimal solution. Therefore, in this case study, a broad range was used for the calculation parameters of RF and BSA to ensure that the optimal calculation parameters would be determined.
The optimization results for BSA are presented in Figure 4. BSA determined the global optimal solution around the 120th generation. The details of the optimization of BSA are given in Table 5. With the termination iteration requirements considered, BSA found the best combination of RF parameters in the 118th generation. In this study, the optimization of the RF algorithm was repeated 1000 times. BSA found the best parameter combination in 123.941 generations on average; the best solution was the 97th generation, and the worst solution was the 161st generation. The standard deviation of the convergent generation was only 14.747, and 99.96% of the 1000 calculations found the same optimal parameter combination.
On the basis of the aforementioned analysis, RF was considered to successfully determine the optimal solution. The optimal parameter combination of RF was as follows: the number of DTs, 124; and the number of split features, 1. The computational performances of BSA and other metaheuristic optimization algorithms are compared in Section 6.2.

Prediction of the construction cost
The 10 sets of testing data and the optimal calculation parameters of RF were brought into the RF model, and the prediction results of the 10 sets were obtained. When training the model, the normalized real cost was adopted; thus, the predicted value output by the model was also the normalized value. The forecast result was loaded into Eq (20), and the corresponding forecast cost was calculated ( Table 6). The maximum errors appear bolded in Table 6. Calculated from the normalized data, the absolute error is the absolute value of the predicted value, minus the real value ( Table 6). The relative error is the absolute error, divided by the real cost. The prediction results in Table 6 show that the maximum absolute error of 10 test sets is 1.154, and the maximum relative error is 1.24%. In this study, all errors in the prediction results met the requirements of engineering practice [32].
To further analyze the prediction accuracy of the model, the coefficient of determination ( ) is selected to analyze the correlation between the actual construction cost and the predicted construction cost. The calculation method for is given below [33]: where is the number of testing data; is the predicted value of the -th test set; is the real value of the -th test set; and is the average of the real values. The data in Table 6 were brought into Eq (22), and of the results predicted using the proposed model was 0.9997. The coefficient of determination between the predicted result and the real value was very close to 1, indicating that the predicted result was almost equal to the real value.
The Coefficient of Variation is also a commonly used error analysis tool, which is equal to the ratio of standard deviation to average value. After calculation, the Coefficient of Variation of the case study object was 6.329383364. This is very similar to the classic research result [34].
All samples were analyzed by tenfold cross-validation to verify the ability of the proposed model to extrapolate. In the tenfold cross-validation, the average absolute error was 0.974, and the average relative error was 1.05%. These results indicate that the proposed model could possessed the ability to generalize, was effective, and could complete the forecasting task with high accuracy. The proposed model is compared with the classical and latest prediction models with respect to calculation performance in Section 6.3.
It is worth mentioning that the ratio of the training set and testing set may have an impact on the prediction results. Therefore, this paper selected 43:5 (89.6%:10.4%), 34:14 (70.8%:29.2%) and 29:19 (60.4%:39.6%). The results of their 1000 repeated calculations were shown in Table 7. In the Table 7, MAE is the maximum average error, SDAE is the standard deviation of average errors, SDSD is the standard deviation of standard deviations and AVSD is the average value of standard deviations.
As could be seen from Table 7, with the decrease of the proportion of training sets, the prediction accuracy gradually decreased. When the proportion of training set was as high as 89.6%, the MAE in 1000 calculations was only 1.31%. If the proportion of training set was 60.4%, the MAE in 1000 calculations was increased to 9.23%. In addition, when the proportion of training sets was 70.8 and 60.4%, the model proposed in this paper might not pass tenfold cross-validation every time. Therefore, this paper suggested that the proportion of training set should not be less than 80% when BSA-RF model was adopted.
It was worth mentioning that each calculation had an average error in 1000 repeated calculations. The average error was the average of the prediction errors of the test sets. Because the samples of the test set were randomly selected in each calculation, the average error of each calculation was different. MAE in 1000 calculations was the MAE of test set prediction results in these 1000 calculations. In this study, it was not to solve the average error of all the calculation results of 1000 calculations, so it was not a definite value.
To analyze the stability of the prediction results, the standard deviations of 1000 calculations under different ratios of the training set and testing set were calculated and shown in Column 3 of Table 7.
The calculation results showed that the standard deviation increased rapidly from 0.000763471 to 0.026951901 with the decreasing proportion of training sets. It could be considered that with the decreasing proportion of training sets, the prediction stability of the proposed model was decreasing. This was consistent with the results of similar studies [35,36].
It should be emphasized that in Column 3 of Table 7, the SDAE in 1000 calculations was to directly find the variance of the average error obtained by 1000 repeated calculations, and it studied the deviation between the average errors and their mathematical expectation in 1000 repeated calculations. Specifically, when the ratio was 89.6%:10.4%, five errors and an average error of five test samples were obtained by the one-time repeated calculation, and the SDAE in 1000 calculations was the standard deviation of these 1000 average errors. Similarly, at 70.8%:29.2%, it was also the standard deviation of 1000 average errors. In each of the 1000 calculations, 14 errors and an average error were generated, so it was 1000.
Studying the distribution of standard deviations in 1000 repeated calculations was also helpful to reveal the stability of the proposed model in 1000 repeated calculations. For this reason, the SDSD in 1000 calculations was also calculated. It should be emphasized that standard deviations in 1000 calculations was the standard deviation of test set prediction errors in 1000 repeated calculations, which had 1000 data. SDSD in 1000 calculations was used to describe the distribution of standard deviation in 1000 calculations, and there was only one data. The relevant calculation results were in Column 4 of Table 7. Its calculation process was to solve the standard deviation of all test set errors obtained in a repeated calculation, which had 1000 variance results, and then calculated the standard deviation of these 1000 standard deviation results. Specifically, when the ratio was 89.6%:10.4%, five errors and an average error of five test samples were obtained by repeated calculation. First, the standard deviation of the errors of five test samples was solved, and then the standard deviation of these standard deviations was solved, so that the data in Row 2 and Column 4 in Table 7 was obtained. In Table 7, with the decrease of the proportion of test sets, the number of standard deviations in 1000 calculations was increasing. The calculation results showed that the stability of the prediction results of the model proposed in this paper was positively correlated with the proportion of test sets.
In addition, the AVSD in 1000 calculations under different ratios of the training set and testing set was further calculated. With the proportion of training set reduced from 89.6% to 60.4%, the AVSD in 1000 calculations increased from 0.00718178858 to 0.09305018740. This calculation also showed that the stability of prediction results was decreasing with the decrease of the proportion of test sets.

Analysis of the importance of the prediction index
RF can overcome the interference of the possible complex linear relationship between the characteristic variables; however, the influence of the scale of the characteristic variables on the performance of the model still needs to be considered. The index system proposed in Section 3 was adjusted, based on the difference in the degree of importance of the variables, to determine the relatively most appropriate variable scale.
The Mean Decrease Accuracy Index (MDAI) is one of the most common tools used in Variable Importance Analysis [37]. MDAI indicates the extent to which the prediction accuracy decreases when a randomly selected indicator is removed. The larger the MDAI, the stronger its effect, and the greater the importance of the index. The index importance analysis of this study is shown in Figure 5.

Start
Deleting any secondary index, calculating the average decline accuracy, and sorting it.
The accuracy meets the requirements？ Yes Adding a secondary index Using the current index system to predict the construction cost

No
Retaining the last deleted index Deleting the last index of average descending accuracy ranking Outputting the minimum index system satisfying the prediction accuracy.

End
The accuracy meets the requirements？ No Yes Figure 5. Flowchart of indicator importance analysis by the MDAI.
The prediction accuracy requirement was set to 10%, allowing the average relative error to fall within 10%. The research results in Section 5 indicate that the current predictive index system met the accuracy requirements, and no secondary indexes had to be added.
Any one of the secondary indexes (Table 1) was deleted, successively, to reconstruct the predictive index system. The newly constructed index system was adopted, and the prediction model described in Section 4 was used for iterative calculations. The calculation results are shown in Figure 6(a). The ordinate in Figure 6 represents the secondary index deleted in this step. Figure 6. Ranking of indexes importance during gradual dimensionality reduction.
In Figure 6(a), the prediction accuracy of the 14 cases met the requirements when only one index was deleted. The average reduction accuracy of was the smallest, only 4.041%. Thus, was deleted on the premise of satisfying the prediction accuracy.
On the basis of deleting , the successive deletion of a secondary index in Table 1 continued to reconstruct the predictive index system. The calculation results are shown in Figure 6(b). The prediction accuracy of all cases satisfied the requirements.
had the lowest average reduction accuracy, which was 5.778%; thus, was deleted on the premise of satisfying the prediction accuracy. On the basis of deleting and , one more secondary index was deleted to reconstruct the predictive index system. The prediction calculation results are shown in Figure 6(c). All 12 cases met the requirements with respect to the prediction accuracy.
had the lowest average reduction accuracy, which was 7.403%; thus, was further deleted. On the basis of deleting , and , the deletion of one more secondary index (Table 1) proceeded. The prediction calculation results are shown in Figure 6(d). However, the average relative error was 12.34%, which failed to satisfy the preset accuracy requirements. Therefore, the smallest set of input parameters in the case study was the predictor system, excluding , and .

Comparison of different optimization algorithms by performance
PSO [20], GA [20], Tabu Search (TS) [38], Simulated Annealing (SA) [39], Ant Colony Optimization (ACO) [40], Differential Evolution (DE) [41], Chicken Swarm Algorithm (CSA) [42], Artificial bee colony algorithm (ABC) [43], Covariance Matrix Adaptation Evolutionary Strategies (CMAES) [44], Wolf pack algorithm (WPA) [45], Whale Optimization Algorithm (WOA) [46] and Artificial Fish School (AFS) [47] were selected to compare their computing performance. The calculation parameters of these algorithms are presented in detail in the corresponding references. All optimization calculations were conducted using the same personal computer (i7-10510U, 1.8GHz, acceleration frequency 4.9GHz quad-core 8MB, 512G SSD, DDR4-2400 8GB). The results of 1000 repetitions of all optimization algorithms are listed in Table 8. As shown in Table 8, BSA determines the best parameter combination of RF in 123.941 generations, on average, which was 77.987 generations faster, compared with PSO, which had the second-highest calculation speed. In the 1000 repeated calculations, the accuracy of BSA optimization was also significantly higher, compared with other metaheuristic optimization algorithms, reaching 99.96%. However, PSO exhibited the smallest standard deviation, indicating that the computational stability of BSA in this case study was slightly lower than that of PSO.
From the perspective of optimization, BSA includes the advantages of PSO and DE, which improved the search efficiency and had relatively good average stability [14]. Therefore, the results in Table 8 are interpretable. The research results in this section are similar to previous research results [20], which proves not only the correctness of the research results but also the more efficient computational performance of BSA.

Comparison of different prediction models by performance
Data forecasting methods, such as BPNN [14], SVM [5], Stacked Auto-Encoders (SAE) [48] and Extreme Learning Machine (ELM) [49], were selected to compare the computing performance. The parameter settings of the three prediction models referred to the corresponding literature, and BSA was used to determine the best calculation parameters of these three models. In this study, the average value of the relative error, the standard deviation of the relative error, and the average calculation time were selected as the evaluation indexes of the calculation performance of the prediction model [28]. The results of 1000 calculations of the four prediction models are listed in Table 9. The four prediction models could effectively predict the project construction cost of buildings, but their prediction accuracies noticeably varied. The average relative error of the prediction mode based on RF was only 1.05%. Relative to those of BPNN, SVM and ELM, the average relative error of RF was reduced by 2.82, 5.92 and 3.79%, respectively. The prediction accuracy based on SVM was the lowest, probably because SVM was not suitable for the prediction of small sample data in the current study [5]. The standard deviation of the relative error of the prediction model based on RF was at least one order of magnitude smaller than that of the other three models. Compared with those of BPNN, SVM and ELM, of RF was increased by 0.0494, 0.0713 and 0.0217, respectively. Among the four prediction models, the average calculation time of ELM was the shortest, only 1122.95 s. The calculation principle of ELM was to randomly select the input weight and analyze it to determine the output weight of the network. Therefore, ELM can provide the ultimate performance in learning speed [49]. On the basis of the aforementioned analysis, RF exhibited better computing performance than BPNN, SVM, or ELM.
In Table 9, the calculation errors of RF, ELM and BPNN were all less than 5%, which all met the practical requirements. The main reason was that BSA was used to determine the optimal combination of calculation parameters of ELM and BPNN, which made the calculation accuracy of ELM and BPNN very good. Therefore, this paper further adopted the GA, a classical meta-heuristic optimization algorithm, to determine the best calculation parameter combination of RF, ELM and BPNN. The average errors of GA-BPNN, GA-ELM and GA-RF were 6.47, 8.53 and 4.39% respectively. This also proved the advanced nature of RF and BSA. As could be seen from Table 9, among many prediction algorithms, the SDSD in 1000 calculations bases on RF was also the smallest. This implied that the prediction result of RF was more stable. In addition, the AVSD in 1000 calculations based on different forecasting models was analyzed in this paper. The average values of standard deviations in 1000 calculations based on RF, BPNN, SVM, SAE and ELM were 0.000699460, 0.010262891, 0.075292102, 0.006578722 and 0.006064134, respectively. Obviously, the average value based on the RF was the minimum, which also exhibited that RF had better prediction performance.
Different algorithms might have different calculation results for different data sets. The calculation result of RF was better than that of BPNN, which is a classical supervised learning algorithm. There are great differences between RF and BPNN in their implementation methods and characteristics, which affect their performance in this case study [50]. RF makes the decision more robust by training multiple decision trees and voting or averaging. Each decision tree is trained with random samples and features, so it has good generalization ability and robustness. In contrast, the performance of BPNN depends on the structure of neurons, activation function, initial weight and other factors and the effect may be poor for more than two types of classification tasks [51].

Sensitivity analysis of indicators
The construction cost forecasting system is a typical multi-parameter nonlinear system. In order to improve the analysis efficiency and system performance, this section used the Sobol index method to analyze the sensitivity of indicators [52]. Sobol index method is a global sensitivity analysis method proposed by Russian scholar Sobol in 1993. The core of this method is to analyze the sensitivity of parameters by calculating the influence of variance of single parameters and combined parameters on the total variance [53].
In this paper, the BSA-RF model was used as the benchmark model, and the quasi-Monte Carlo method was used for specific sensitivity analysis, and the sensitivity of each index was sorted, as shown in Table 10. In addition, BSA-SAE and BSA-BPNN were also used for sensitivity analysis. It could be seen from Table 10 that the global sensitivity analysis results of the three models were basically the same.
, and were the most sensitive. Combined with the construction practice of building engineering, the possible reasons why these three factors had the greatest influence on the construction cost error were as follows. 1) Reasonable resource scheduling could improve the construction efficiency, avoided the waste and lack of resources and reduced the construction cost. However, if the resources were not properly scheduled, it would lead to the delay of working procedure, the delay of construction period and the waste of resources, thus increasing the construction cost. 2) The basement was an important part of the building, and the construction of the basement was relatively difficult, and there were many materials and human resources needed, so the change of the basement area had a great influence on the error of the construction cost. 3) Reasonable site layout could make full use of the site, optimize the structure and design of the building. In addition, , and were the least sensitive. In BSA-RF, their sensitivity coefficients were only 0.0026, 0.0321, and 0.0131 respectively. This was consistent with the analysis results of the importance of indicators in Section 6.1 of this paper.

Conclusions
The complex and unpredictable construction costs of building projects were analyzed in this study. A prediction index system of the construction cost was established to address the complicated problem of construction cost. The system included 14 secondary indexes, such as the type of structure, total height and standard floor area. A novel prediction model of construction cost based on RF was proposed to solve the problem of low construction cost prediction accuracy. BSA was used to optimize the number of DTs and split features of RF, considering the tendency of randomly set parameters in RF to cause low prediction accuracy. A case study of a construction company in Xinyu, China showed that the optimal combination of RF calculation parameters was i) the number of DTs set to 124 and ii) number of split features equal to 1. The maximum absolute error of the proposed model was 1.154, and the maximum relative error was 1.24%. These findings confirmed the feasibility of the prediction model in the construction cost prediction of building projects. Compared with the classical metaheuristic optimization algorithm, BSA could more quickly determine the optimal combination of the calculation parameters, on average. Compared with the classical and recent predictive methods, the proposed model presented advantages in prediction accuracy and generalization ability.
The major limitations of the current study were as follows. 1) The index system constructed in this study was only applicable to housing construction projects, and a more general index system of construction cost prediction will be constructed in the future. 2) The case study presented in this article included a small data sample, and the prediction accuracy of the proposed model for large sample data is yet to be determined. 3) The relationship between the prediction index and the construction cost is highly complex. Thus, a hybrid model combining linear and nonlinear methods will be built in the future.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.