A Review of Data Mining Strategies by Data Type, with a Focus on Construction Processes and Health and Safety Management

Increasingly, information technology facilitates the storage and management of data useful for risk analysis and event prediction. Studies on data extraction related to occupational health and safety are increasingly available; however, due to its variability, the construction sector warrants special attention. This review is conducted under the research programs of the National Institute for Occupational Accident Insurance (Inail). Objectives: The research question focuses on identifying which data mining (DM) methods, among supervised, unsupervised, and others, are most appropriate for certain investigation objectives, types, and sources of data, as defined by the authors. Methods: Scopus and ProQuest were the main sources from which we extracted studies in the field of construction, published between 2014 and 2023. The eligibility criteria applied in the selection of studies were based on the Preferred Reporting Items for Systematic Review and Meta-Analyses (PRISMA). For exploratory purposes, we applied hierarchical clustering, while for in-depth analysis, we used principal component analysis (PCA) and meta-analysis. Results: The search strategy based on the PRISMA eligibility criteria provided us with 63 out of 2234 potential articles, 206 observations, 89 methodologies, 4 survey purposes, 3 data sources, 7 data types, and 3 resource types. Cluster analysis and PCA organized the information included in the paper dataset into two dimensions and labels: “supervised methods, institutional dataset, and predictive and classificatory purposes” (correlation 0.97–8.18 × 10−1; p-value 7.67 × 10−55–1.28 × 10−22) and the second, Dim2 “not-supervised methods; project, simulation, literature, text data; monitoring, decision-making processes; machinery and environment” (corr. 0.84–0.47; p-value 5.79 × 10−25–-3.59 × 10−6). We answered the research question regarding which method, among supervised, unsupervised, or other, is most suitable for application to data in the construction industry. Conclusions: The meta-analysis provided an overall estimate of the better effectiveness of supervised methods (Odds Ratio = 0.71, Confidence Interval 0.53–0.96) compared to not-supervised methods.


Introduction
The activities attributable to the construction sector, according to the International Labour Organisation (ILO) classification, are as follows: (i) building, including excavation and the construction, structural alteration, renovation, repair, maintenance (including cleaning and painting), and demolition of all types of buildings or structures; (ii) civil engineering, including excavation and the construction, structural alteration, repair, maintenance, and demolition of structures such as airports, docks, harbors, inland waterways, dams, river, avalanche, and sea defense works, roads and highways, railways, bridges, tunnels, viaducts, and works related to the provision of services such as communications, drainage, sewerage, water, and energy supplies; and (iii) the erection and dismantling of prefabricated buildings and structures, as well as the manufacturing of prefabricated elements on construction sites [1].
Construction safety research is abundant and motivated by the alarming rates of accidents and fatalities, focusing on two perspectives: management and technology [2].In general, workplace safety management is based on organizational and technological strategies.Construction safety standards and accident reduction are achieved through information and worker training, aiming to enhance the level of risk perception associated with the production process.However, the impact of traditional accident prevention strategies has been limited due to their reactive and regulatory nature [2,3].A relevant aspect is the increased risks associated with the organization and production goals of construction companies.
According to Razi et al. [4], Artificial Intelligence (AI) is a broad field of computer science concerned with developing intelligent robots capable of performing tasks that traditionally require human intellect.In a more in-depth analysis, the same authors list the most common sub-areas of AI applicable in the construction sector, such as machine learning, computer vision, automated planning and scheduling, robotics, knowledge-based systems, natural language processing, and optimization, listing their advantages and disadvantages.AI plays a crucial role in assisting construction supervisors in minimizing accidents, supporting project efficiency, and significantly improving operational safety.Alongside the advancement of information and communication technology, various innovative technologies have been investigated to aid and improve existing management-driven safety management practices.Besides the aid of technologies, new injury prevention strategies have been developed for the construction industry.The risk analysis method is one of them, used in safety programs to improve safety performance.A relevant factor is the relationship between the type of construction project and the type of accident.
Data mining methods are applicable in various fields dealing with different types of data and objectives.Studies focusing on DM techniques applied to construction safety date back to no later than 2014 [5].Our study has been developed as part of Inail's 2022-2024 research program and the objective "Study of the effectiveness and efficiency indices related to innovative technologies aimed at preventing the risk of injury in highly variable work environments".Considering the articles found to be eligible for review (Appendix A), we first focused on data mining methods (Appendix B) by categorizing them into three types: supervised, unsupervised, and other (not supervised).Subsequently, through cluster analysis, principal component analysis, and meta-analysis, we identified statistical associations between the two types of methods and the study objectives, types, and sources of data.The protocol of review is led by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) protocols [6].Despite its limitations, the review has enabled us to determine the most effective method between supervised and other methods for different survey purposes, sources, and data types.It also gives a reference for those who have to choose and apply a DM method on the basis of certain fundamental inputs, such as the type of data available and the objectives to be achieved.
Section 1 of the article introduces the background and objectives of the investigation.Section 2 describes the materials and methods, while Section 3 presents the results obtained from applying cluster analysis, PCA, and meta-analysis.Section 4 offers an extensive discussion of the results considering the current state of the art and our future goals.Finally, Section 5 summarizes the salient results achieved in this review.

Materials and Methods
The set of articles published from 2014 to September 2023, which were useful for the purposes of this review, was extracted from Scopus [7,8] and ProQuest [9].Authoritative sites on conferences in the field of computer science and DM and Management in Construc-tion field were queried; however, only Web of Conference provided an eligible contribution for the purposes of our review.

Selection and Inclusion Criteria
All searches were conducted using a combination of subject headings and free-text terms.The search criteria applied in the PRISMA methodology were obtained by successive reiterations using different arguments and different Boolean AND and OR operators.Of these reiterations, the final one is given in Appendix A. We focused exclusively on peer-reviewed articles, conference papers, and book chapters.The topics included were "machine learning" AND construction AND work OR safety, across the following subject areas: (i) Engineering, (ii) Social and Environmental Sciences, and (iii) Computer Sciences.The criteria applied in the search strategy are defined in Table 1.The final search strategy was developed through several preliminary searches, including (i) articles, (ii) conference papers, and (iii) book chapters (Appendix A). Figure 1 summarizes the result of the PRISMA document selection process.The collected dataset includes information on authors, title, year of publication, source of title, volume, issue, number of pages, citation number, DOI, affiliations, author information, abstract, keywords, type of publication, and further information.Three authors (AP, AB, and DB) independently reviewed the titles and abstracts to assess the eligibility of all studies.We applied PRISMA procedures and checklists [6] to identify topic datasets and keywords and filter content according to the abstract, assessing the eligibility of publications in the research scope.Further insights were made into the selected articles by conducting full-text reviews and analyzing the content for search purposes (see Figure 1) [10].Disagreements were resolved by a fourth evaluator (ML) until a consensus was reached between the authors.Only studies that met the eligibility criteria were included.

Risk of Bias for Selected Studies
The risk in non-randomized studies was assessed based on the following biases: (1) due to confounding, (2) in the selection of the types of data in the study, (3) in the classification of the study objective, (4) due to missing data, (5) in the measurement of outcomes, (6) in the evaluation metrics, and (7) in the selection of the reported outcome.Each indi-

Risk of Bias for Selected Studies
The risk in non-randomized studies was assessed based on the following biases: (1) due to confounding, (2) in the selection of the types of data in the study, (3) in the classification of the study objective, (4) due to missing data, (5) in the measurement of outcomes, (6) in the evaluation metrics, and (7) in the selection of the reported outcome.Each individual study included was assessed as having a low, moderate, severe, and critical risk of bias.If critical information was missing for the assessment of the risk of bias, these studies were considered devoid of information.

Data Quality and Items
The titles and abstracts of the identified studies were independently checked at two different points in time.Eligibility and inclusion criteria were initially assessed on a subset of 30 studies before searching all databases.Decisions were made by examining both the abstracts and the full texts.Only studies that were complete and met all inclusion criteria were included in the qualitative and quantitative synthesis.The information and data included in the papers obtained through the PRISMA method were then included in the review.

Study Design
The scientific articles falling under the eligibility criteria of PRISMA were pre-processed to extract information suitable for the purpose of review.The 63 papers included in the review were categorized by 31 source titles and publication year.The bibliometric analysis involved a review of the global literature and geographic mapping worldwide.The cluster analysis (HC) was used to find the best aggregations between groups.Using the Silhouette index, it was found that the best degree of aggregation was represented in a cluster plot based on correlations and variances.Principal component analysis (PCA) was useful to find the correlation classes between the various parameters of the dataset in a simplified reading of the results.Through PCA, we reduced the items and obtained the extent of correlation between variables, methods, and components.The meta-analysis of these classes was useful in estimating the reliability of HC and PCA results and the odds ratios OR and confidence intervals CI of groups of items.Spatial data collection, analysis, classification, and bibliometric analysis were performed with VOS viewer [11], R (https://www.r-project.org/accessed on 13 June 2024), and QGIS 3.18.3-Zürichsoftware, (Free Software Foundation, Inc., Boston, MA 02110-1301 USA).
Articles that met the PRISMA eligibility criteria were classified according to their country/region of origin (corresponding author) and placed in one of the classes depicted in Figure 2 according to the numerosity marked by a colour (brown, green, purple, red, blue).
In Figure 2, to the brown class belong the 18 countries/regions with 1 article such as Austria, Brazil, Cyprus, India, Iran, Iraq, Italy, Japan, Jordan, New Zealand, Poland, Republic of Korea, Saudi Arabia, Singapore, Spain, Sweden, Taiwan, United Arab Emirates.To the green class belong the 6 countries/regions with 2 items such as Australia, Hong Kong, Pakistan, South Korea, Turkey, UK.To the purple class belong Malaysia with 6 items.To the blue class belong the USA with 8 items.Finally, to the red class, the most numerous, belongs China with 19 items.The details of classes 1, 2, 6, 8, 19 in Figure 2 and their articles are specified in Appendix B.
Articles that met the PRISMA eligibility criteria were classified according to their country/region of origin (corresponding author) and placed in one of the classes depicted in Figure 2 according to the numerosity marked by a colour (brown, green, purple, red, blue).In Figure 2, to the brown class belong the 18 countries/regions with 1 article such as Austria, Brazil, Cyprus, India, Iran, Iraq, Italy, Japan, Jordan, New Zealand, Poland, Republic of Korea, Saudi Arabia, Singapore, Spain, Sweden, Taiwan, United Arab Emirates.To the green class belong the 6 countries/regions with 2 items such as Australia, Hong Kong, Pakistan, South Korea, Turkey, UK.To the purple class belong Malaysia with 6 items.To the blue class belong the USA with 8 items.Finally, to the red class, the most numerous, belongs China with 19 items.The details of classes 1, 2, 6, 8, 19 in Figure 2 and their articles are specified in Appendix B.

Study Selection and Bibliometric Analysis
The search strategy based on the PRISMA eligibility criteria yielded 63 papers that were included in the review and categorized by 31 source titles (Table 2) and publication year.Regarding the latter, there was an increasing trend in publication from 2014 to September 2023, where the articles recorded the following trend: 1 paper each in 2014 and 2015, 3 papers in 2016 and 2017, 4 papers in 2019, 6 papers in 2020, 3 papers in 2021, 17 papers in 2022, and 23 papers in 2023.

Study Selection and Bibliometric Analysis
The search strategy based on the PRISMA eligibility criteria yielded 63 papers that were included in the review and categorized by 31 source titles (Table 2) and publication year.Regarding the latter, there was an increasing trend in publication from 2014 to September 2023, where the articles recorded the following trend:

Classes of Data and DM Methods
The information extracted from the individual articles was grouped into six homogeneous classes: DM method, study objective, field, data type, DM type, and resource type.In a separate dataset, we compiled the study objective, type of data under investigation, applied DM methods, applied DM type (supervised, unsupervised, and other), validation metrics (if available), the DM method found to be most effective, and number of rows and columns in the dataset used by the authors (if available).The set of classes has been reduced to 20 features, which are summed up in Appendices D and F. The 63 selected articles provided 206 observations, 89 DM methods (50 of which were considered the best method), 4 survey purposes, 3 fields, 7 data types, 3 DM types, and 3 resource types (as detailed in Table 3 and Appendices D and F).DM method: for each method and each method found to be the most effective (best method among those applied by the authors), absolute frequencies were reported.This feature consists of the method(s) used by the authors (from 1 method to more than 10).Study objective: this feature consists of the purpose for which the authors applied one or more methods in their article (from 1 up to 4).As a survey objective, we obtained X1 classifying (18%), X2 decision making (15%), X3 monitoring (16%), and X4 predicting (51%).Field: this characteristic indicates the source from which the data came in terms of construction process data, accident data, and health and safety risk management data.As fields, we obtained X5 construction process (38%), X6 occupational accident (34%), and X7 health and safety risk management process (28%).Data type: this means the format in which the information is represented and made available to authors for research purposes.The types of data investigated were X8 construction project (5%), X9 institutional dataset (70%), X10 interview report (2%), X11 literature data (3%), X12 narrative text (6%), X13 signal (10%), and X14 simulation (4%).DM type: this indicates a grouping into three classes of the feature DM method.The need for this additional class is linked to the fact that some DM methods can be used both as supervised and unsupervised.As the Type of DM investigated, we found X15 supervised method (58%), X16 unsupervised method (24%), and X17 other method (18%).Resource type: this feature was necessary to specify the field to which the authors' results referred.This is the case with data from accidents to predict the outcome of a production process.As a resource type, we found X18 process (63%), X19 environment resource (15%), and X20 plant and machinery resource (22%).

Cluster Analysis
Clustering is a significant approach in DM that aims to identify groups within datasets.In real-world applications, both numeric and categorical features are often used to define the data.Clustering analysis is one of the most important approaches in DM, and it seeks to find the nature of groupings or clusters of data objects within an attribute space [72][73][74].For an exploratory approach, we applied clustering analysis to the dataset in Appendix D. With this unsupervised ML approach, the algorithm processes input data and generates a sequence of clusters based on relational similarities with surrounding data points.The questions to answer in this DM method are "when do we stop combining clusters?"and "How do we represent clusters?".By applying hierarchical clustering (HC) and the appropriate indexes, we identified the optimal number of clusters of our data.
According to Chang and Mirking, the "silhouette" index provided the best determination of cluster number; the highest average silhouette width indicates the optimal number of clusters.The concept of silhouette width involves the difference between the within-cluster tightness and separation from the rest.Specifically, the silhouette width s i for entity i ∈ I is defined as: where "a i " is the average distance between "i" and all other entities in the cluster to which "i" belongs, and "b i " is the minimum of the average distances between "i" and all entities in every other cluster.Silhouette width values range from −1 and 1.If the silhouette width value for an entity is approximately zero, it means that the entity could also be assigned to another cluster.If the silhouette width value is close to −1, it means that the entity has been incorrectly classified.If all silhouette width values are close to 1, it means that the set "i" is well clustered [75].As shown in Figure 3, the best aggregation of the dataset in Appendix D consists of two clusters with a silhouette index of more than 0.7.We created the item groupings through an iterative hierarchical process of aggregating pairs of "most similar" groups of methods by calculating the dissimilarity ("distance" for triangular inequality).Thus, we obtained the dendrogram in which the Euclidean distance between the elements, the similarity, and the shape of the clusters are represented.Figure 4 shows the results of the hierarchical cluster (HC) for the dm methods included in Appendices C and D. The abscissa shows the dm methods, the ordinate the Euclidean distances between the methods.The two red squares comprise the two large clusters into which the dm methods have been aggregated according to their Euclidean distances.Specifically, the first box explains the first cluster, containing RF (random forest), DT (decision tree), KNN (k-nearest neighbour) and SVM (support vector machine).The second large box contains the grouping of the remaining methodologies.We created the item groupings through an iterative hierarchical process of aggregating pairs of "most similar" groups of methods by calculating the dissimilarity ("distance" for triangular inequality).Thus, we obtained the dendrogram in which the Euclidean distance between the elements, the similarity, and the shape of the clusters are represented.Figure 4 shows the results of the hierarchical cluster (HC) for the dm methods included in Appendices C and D. The abscissa shows the dm methods, the ordinate the Euclidean distances between the methods.The two red squares comprise the two large clusters into which the dm methods have been aggregated according to their Euclidean distances.Specifically, the first box explains the first cluster, containing RF (random forest), DT (decision tree), KNN (k-nearest neighbour) and SVM (support vector machine).The second large box contains the grouping of the remaining methodologies.

Principal Component Analysis (PCA)
The objective of PCA is to identify suitable Y linear transformations of the observed variables that are easily interpretable and capable of highlighting and synthesizing the information inherent in the initial matrix X.This tool is particularly useful when dealing with a considerable number of variables from which one wants to extract as much information as possible while working with a smaller set of variables [73,74].The analysis was carried out on the data matrix that contains 89 individuals corresponding to DM methods and 22 quantitative variables (Appendix D).

Principal Component Analysis (PCA)
The objective of PCA is to identify suitable Y linear transformations of the observed variables that are easily interpretable and capable of highlighting and synthesizing the information inherent in the initial matrix X.This tool is particularly useful when dealing with a considerable number of variables from which one wants to extract as much information as possible while working with a smaller set of variables [73,74].The analysis was carried out on the data matrix that contains 89 individuals corresponding to DM methods and 22 quantitative variables (Appendix D).
Given a matrix X containing n features, it is possible to obtain a matrix of new data Y, consisting of p interrelated variables, which turn out to be linear combinations of the first.Each principal component can be expressed as follows: The generic coefficient l ij is the weight that the variable Xj has in finding the principal component Y i (with i = 1, 2, k, p) [11,38].The larger l ij is (in absolute value), the greater the weight that the values X j (j = 1, 2, k, p) have in deciding a given principal component [74].The data extracted from the articles included in the review were organized into subclasses (Table 3) and grouped according to Appendix D. The linear correlation coefficients l ij between each pair of standardized variables included in Appendix E are the result of the ratio of the covariance to the product of the standard deviation between x i and x j (l ij = σ ij /σ i σ j ).The Pearson correlation coefficient l ij provides the intensity and direction of the linear relationship between the variables.The bold numbers express the significance of the correlation given by p-values below 0.05 (Appendix E).Before conducting PCA, we checked the linear relationship, the correlation between all quantitative variables, and the absence of outliers [74].The correlation matrix suggested the features in Appendix D be grouped for a more effective PCA.Proceeding with successive reiterations of the correspondence of different aggregations of features, we obtained the corresponding performances of the PCA.The tables and images in this paragraph refer to the performance found more concise and consistent with the results of the cluster analysis.

Inertia Distribution
The dataset contains 89 individuals corresponding to DM methods and 20 features.Analysis of the graphs reveals no outlier.The inertia of the first dimension shows whether there are strong relationships between variables and suggests the number of dimensions that should be studied.The first two dimensions of analysis express 69.71% of the total dataset inertia; that means that 69.71% of the individual (or variable) cloud total variability is explained by the plane.This percentage indicates that the first plane effectively represents the data's variability.The first factor is the main one: it expresses 57.36% of the variability of the data (Figure 5).
In this case, the variability relating to the other components may be less significant despite the high percentage.The first axis has a higher amount of inertia than the 0.95 quadrant of the random distribution.This observation suggests that only two axes carry information.Consequently, the description will stick to these axes.
The criteria for selecting dimensions in the final model are threefold: the Kaiser rule where eigenvalues are greater than 1 (Table 4); the proportion of variance explained by the components at least equal to 60-80% of the overall variability (Table 4); and the Cattell rule, according to which the right number of components corresponds to the elbow or change in slope in the component-eigenvalue graph (Figure 5).From these observations, it could be better to also interpret the dimensions as greater or equal to the second one.The above criteria allowed us to assign a "label" to each component.
there are strong relationships between variables and suggests the number of dimensions that should be studied.The first two dimensions of analysis express 69.71% of the total dataset inertia; that means that 69.71% of the individual (or variable) cloud total variability is explained by the plane.This percentage indicates that the first plane effectively represents the data s variability.The first factor is the main one: it expresses 57.36% of the variability of the data (Figure 5).In this case, the variability relating to the other components may be less significant despite the high percentage.The first axis has a higher amount of inertia than the 0.95 quadrant of the random distribution.This observation suggests that only two axes carry information.Consequently, the description will stick to these axes.
The criteria for selecting dimensions in the final model are threefold: the Kaiser rule where eigenvalues are greater than 1 (Table 4); the proportion of variance explained by the components at least equal to 60-80% of the overall variability (Table 4); and the Cattell rule, according to which the right number of components corresponds to the elbow or change in slope in the component-eigenvalue graph (Figure 5).From these observations, it could be better to also interpret the dimensions as greater or equal to the second one.The above criteria allowed us to assign a "label" to each component.(32), knn (49), svm (81), and rf (69) to the right of the graph characterized by a strongly positive coordinate on the axis to individuals such as MCDA C (58), characterized by a strongly negative coordinate on the axis (to the left of the graph).
Dimension 2 opposes individuals such as lstm (54), word2vec (88), nlp (63), and BIM (16), which are located at the top of the graph and characterized by a low positive coordinate on the axis, with individuals such as ann (8), adaboost (3), which have low negative coordinates on the axis and are located at the bottom of the graph (Figure 6).
the left of the graph).
Dimension 2 opposes individuals such as lstm (54), word2vec (88), nlp (63), and BIM (16), which are located at the top of the graph and characterized by a low positive coordinate on the axis, with individuals such as ann (8), adaboost (3), which have low negative coordinates on the axis and are located at the bottom of the graph (Figure 6).Dim1 group 1 (dt, knn, svm, and rf) shares high values for the variables "predicting", "supervised", "monitoring", "frequency", "institutional data", "data project-simulationsignal", "classifying", "best method", and "interview-literature-text" (variables are sorted from the strongest to the weakest).Group 2 is characterized by a negative coordinate on the axis, with the individual MCDA C ( 58) sharing low values for the variables "interviewliterature-text", "classifying", "frequency", "institutional data", "monitoring", "predicting", "supervised", "project-simulation-signal", "best method", and "other methods" (variables are sorted from the weakest to the strongest).The variables "supervised" and "frequency" are highly correlated with this dimension (correlations of 0.94 and 0.98, respectively).These variables could therefore be summarized as dimension 1. Dim2 group 1 shares high values for the variables "not supervised" and "decision making" while group 2 shows the same for "monitoring", "machinery", and "environment" (Tables 5 and  6).According to the correlation method variable and axes, the x-axis (Dim1) can be renamed "Supervised methods" (dt, knn, svm, and rf) applied to institutional data to classify and make inferences (predicting)".The y-axis (Dim2) can instead be renamed "Not-supervised methods" (lstm, word2vec, nlp, and BIM) applied to project, simulation signal, interviews, literature, or textual data to make decisions and classify.

Meta-Analysis
The data from the complete collection of studies selected according to the PRISMA method and aggregated according to the classes defined in Table 1 allowed us to derive a single conclusive result that answered our research question.Through a meta-analysis, we assessed whether supervised methods were more effective than not-supervised ones across the various classes.The forest plot summarizes the results of the meta-analysis, which include the OR with its CIs, the sample size weight, the heterogeneity of the data, and a quantitative, whole-data assessment of the effectiveness of the treatment with supervised methods (Figure 7).
we assessed whether supervised methods were more effective than not-supervised ones across the various classes.The forest plot summarizes the results of the meta-analysis, which include the OR with its CIs, the sample size weight, the heterogeneity of the data, and a quantitative, whole-data assessment of the effectiveness of the treatment with supervised methods (Figure 7).The heterogeneity is null (the sets under study are compatible).The analysis of the groups shows that the CIs intercept the "no effect" line and lose significance when taken individually; however, they consistently overlap and are similar to each other.The heterogeneity is null (the sets under study are compatible).The analysis of the groups shows that the CIs intercept the "no effect" line and lose significance when taken individually; however, they consistently overlap and are similar to each other.Figure 7 shows a generally positive trend toward data treatment with supervised methods (on the left from the "no-effect" line), summarized by OR = 0.71 and the CI (0.53-0.96).

Discussion and Future Directions
Studies focusing on DM techniques applied to the construction industry are recent, dating back to 2014 at the latest and, therefore, the review dates we reviewed were from 2014 to September 2023.The number of articles in this sector increased from 1 in 2014 to 23 in 2023.Similarly, the evolution of the total number of applied DM techniques increased from 5 between 2014 and 2016 to approximately 60 in 2023 (data not yet completed at the time of the survey).
In the construction process field, 20 out of 63 observations were made regarding the construction of buildings, dams, roads, and tunnels.Within this field, 60 out of the 206 observations covered topics such as construction delays [38]; crane, drilling, and excavation tasks [13,21,24,41,44,45,50]; geological conditions [55]; scaffolding collapse [51]; transport delays [56]; tunneling [19,26,27,37,42,58,70]; and worker and machinery location [43,71].According to Erzaij et al., project suspensions are among the most persistent challenges facing the construction sector due to the difficulty of the industry and the essential interdependence between the bases of delay risk.The influence of delays can lead to increased time, costs, disputes, litigation, and overall rejection.The study aims to develop a data prediction tool to examine and learn the sources of delay based on previous data from construction projects, using decision trees and Bayesian naïve classification algorithms.Among the prediction models developed by the authors applied to 97 projects, the decision tree showed the highest accuracy [38].Kumari et al. [71] investigated a machine learning architecture for excavator position detection using a Global Positioning System (GPS), which can guarantee an excavator and driver position remarkably close to the real one.Wang J. et al. [27] used the principal component analysis (PCA) approach to select input factors for the prediction of tunnel-boring machine (TBM) performance, particularly the travel speed.Liu et al. [15] developed a model capable of predicting tunnel-boring machine disc replacements based on a binary classification algorithm of the Gaussian kernel support vector type cutting performance.After being trained using a period of historical data, the proposed model can predict whether cutter disc replacement is necessary, thus reducing the time required for periodic inspections.Lin et al. [43] investigated the feasibility of a real-time location service system using the Wi-Fi fingerprinting algorithm for the safety risk assessment of tunnel workers.A location algorithm based on signal strength (RSS) and an artificial neural network (ann) were used for location analysis and risk assessment.Wei [51] developed wind speed prediction models based on various deep learning and machine learning techniques, in particular deep neural networks, neural networks with short-term memory, support vector regressions, random forests, and k-nearest neighbors.Subsequently, the author analyzed the wind force on the scaffold and assessed the probability of the scaffold collapsing under the action of the wind.
In the Occupational accident field, 16 out of 63 papers dealt with data on accidents and injuries at work from 2014 to 2023.In the class "occupational accidents," 89 out of 206 observations covered the following topics: reporting of accidents [3,12,14,16,17,22,23,25,26,28,29,40,46,52,59] and days away from work.On this topic, Yelda et al. analyzed textual narratives to predict injury outcomes and days off work in a mining operation.For this purpose, they used decision trees, random forests, and ANNs, and the performance of these models was compared with that of logistic regression [47].Lee et al. [16] proposed an optimized data preprocessing method to minimize the variables and main elements in diverse and complex work accident data and built an ML prediction model to achieve this.Specifically, they analyzed the correlations using a flood flow diagram and applied clustering and principal component analysis (PCA) to analyze the relationships between the main variables and to draw broader conclusions.However, accidents are unevenly recorded in narrative form.Construction accident reports hold a wealth of empirical knowledge that could be used to better understand, predict, and prevent the occurrence of accidents in the construction sector.Large construction companies and federal agencies, such as the Occupational Safety and Health Administration (OSHA), hold these reports in the form of huge digital databases [17].Zhang J. et al. [17] utilized accident narrative data obtained from the official OSHA website, presenting a new unified architecture with a bi-directional short-term memory model (BiLSTM) and a convolutional layer for the classification of construction accident causes.Tixier et al. [22] and Zhan F. et al. [25] proved how the study of safety attributes and outcomes can be automatically and accurately processed from unstructured accident reports using natural language processing (NLP).
In the risk management field, 25 out of 63 papers dealt with data on the "risk management process".In this class, 54 observations out of 206 concerned the following topics: awkward working postures [20,30]; compliance with Health and Safety standards [31,32,64]; risk assessment [2,4,48,57,60,65,68]; safe climate [33]; slope instability [18,63]; teachingtraining tasks [34,49,62]; unsafe behaviors [35,36]; worker fatigue-heat stress [24,39,69]; and site image [66,67].Antwi-Afari et al. [20] used deep learning networks to automatically extract relevant features with spatial-temporal dependence acquired by a wearable insole pressure system.The aim was to use deep learning-based networks and sensor data from wearable insoles to automatically recognize and classify types of awkward working postures for construction workers.So, they adopted recurrent neural networks (RNNs) and deep learning models to train time series of plantar pressure data acquired from a wearable insole pressure sensor.Wang F. et al. [60] provided a strategic view of the relationships between different organizational objectives and technical risks that may arise during the construction of a tunnel.They created a systems-based model integrating Systems Dynamics (SD), Bayesian Belief Networks (BBNs), and Smooth Relevance Vector Machines (sRVMs) called the Organizational Risk Dynamics Observer (ORDO).The model was applied to an urban metro project built in Wuhan, China, and was used to provide guidance on effective accident prevention strategies.Mostofi et al. [3] explored the predictive ability of a multilayer GCN algorithm that learns the connection between construction accidents and project types, believing that richer information from existing safety and construction accident datasets by project type would provide better learning for the predictive model adopted.In addition, it would have supplied more information to predict the severity of accident consequences.The authors proved the effectiveness of the network representation of construction accidents in improving the learning capability of the ML model by using a feedforward reference network (FFN) algorithm with parameters like those used in the GCN algorithm to predict severity outcomes.The use of prefabrication is attracting increasing interest in the construction industry due to sustainability aspects, product quality, high production efficiency, and cost-effectiveness.Dealing with this topic, Zhu and Liu [68] developed a prediction and risk assessment model related to the supply chain management of precast buildings.The BP neural network can be used to predict the risk of the prefabrication supply chain.
Soil instability and landslides are major problems in the construction sector that can lead to safety risks for workers and the public, but also to considerable economic damage due to work stoppages.In this regard, Bay et al. [18] evaluated 102 cases of slopes with arcshaped failure modes using eight machine learning regression methods.The slope safety factor prediction models were set up by performing cross-validation and hyper-parameter adjustment of the model.Furthermore, based on objective weighting and TOPSIS methods, a model was developed to evaluate the performance of the machine learning model and find the best FOS prediction model.Sadeghi et al. [48] developed an Ensemble Predictive Safety Risk Assessment Model (EPSRAM) to assess the health and safety risks of workers on construction sites based on the integration of neural networks and fuzzy inference systems.The model introduces innovation in countries/regions such as Malaysia, where there is continued growth in the construction industry but where there is a lack of studies on OHS assessments of workers involved in construction activities.Such circumstances may expose construction workers to the risk of developing fatigue.If workers continue to work under fatigued conditions, they are prone to the development of work-related musculoskeletal disorders (MSDs).Yu et al. [24] and Yan et al. [69] developed a combination of computer vision technology and biomechanical analysis for non-intrusive whole-body fatigue monitoring of construction workers using 3D model data from the motion capture algorithm and biomechanical analysis.
Zhao et al. [62] conducted a study on efficient and parallel DM and machine learning methods and algorithms distributed on a large scale and proposed an experiential teaching model focused on the cultivation of independent learning ability and the subjective initiative of individual learners.The article, which could have been excluded for review, was nevertheless kept as it combined the importance and technical challenges of the algorithms themselves and the context of the practical application needs of the field.It reported research on methods and algorithms for DM and machine learning, distributed on a large scale for training purposes.As an innovative teaching model, the experiential teaching model described in it focuses, among other things, on cultivating individual learners' independent learning ability and subjective initiative, which was found to effectively activate the atmosphere of the working class/environment and improve the teaching effect.It has been included as one of the articles that innovatively deals with the risk management process, including health and safety training in the workplace.Other studies, not included in the review, report analyses based on the effectiveness of combinations of Smart Construction Safety Technologies (SCSTs), potentially able to generate information useful for DM, and the measurement of the effectiveness of the same technologies, both alone and combined [76].Zerman et al. used machine learning to create a predictive model to help detect the most likely factors that affect fatal accidents due to falls from heights in the construction industry in Malaysia.To this end, the authors used institutional data from the Malaysian Department of Occupational Safety and Health Records of Occupational Accidents and applied different machine learning models such as random forest (rf), gradient boosting (gbdt), logistic regression (lr), naïve Bayes (nbb), multilayer perceptron (mlp), and knn.The model obtained from the random forest application was the best [61].
Regarding the type of data used in DM, 39 out of 63 papers dealt with institutional datasets (2016-2023), 8 used signal and video data (2014-2023), 4 used narrative texts (2016-2022), 5 used construction projects (2016-2023), 4 used literature data (2020-2023), 3 used simulations (2015-2023), and 2 out of 63 used an interview report (2023).The data used may have different characteristics in reference to specific aspects of an occupational injury, such as, for example, the body parts affected and the expected probability.Other studies focus on the observation of environmental and meteorological precursors of accidents, e.g., associated with the collapse of scaffolding [19] and slope instability [18,63].Liu et al. analyzed data from sophisticated and technologically innovative machine monitoring, capable of returning and processing geological data and faults and predicting the progress of TBM and maintenance, avoiding downtime and inspections [26].According to Schindler et al. [70] and Leng et al. [42], the use of satellite data has proved to be a winning strategy compared to ground surveys.Data collected by sensors were used to assess the state of effort associated with the awkward working postures of workers while performing work on the construction site [20] or physical fatigue and workers' heat stress [69].Another interesting use of data involves the construction practitioner's interview through which processes and occupational risk information are integrated [4].
According to Gondia et al., the factor that most determines delays in the construction sector is the late payment of the contractor.The authors used naïve Bayes and decision tree algorithms to predict project downtime.The decision tree showed an accuracy of approximately 89%, which is better than that obtained with naïve Bayes due to the type of data string.The model proposed by the authors has been applied to approximately 97 projects and has been found to have the potential to reduce delays and, consequently, also costs [53].Shirazi and Toosi used the literature, interviews, and project data.According to Shirazi et al., delays in construction are among the most important challenges in the sector, especially in the infrastructure sector, where serious socio-economic consequences may occur.The authors identified 65 risk factors associated with delays through data derived from a comprehensive review of the literature and interviews and applied principal component analysis.The resulting dataset was used to develop a deep perceptron neural network (mlp-nn) model to predict project delays.The use of a deep-nn (dnn) model showed that the addition of characteristic project data to the training dataset significantly improved the prediction performance of deep-mlp.[54].
By focusing on health and safety aspects, quality, in terms of the homogeneity and standardization of the various sources of institutional accident data included in the review, can be affected by the different methods of acquisition, from one institution to another and from one country/region to another.It can also be assumed that the data produced by technologies and machines used in the processes have a higher degree of homogeneity and standardization than the former.Liu et al. underlined the significance of employing innovative and efficient safety management technologies, along with new management approaches and automated methods based on artificial intelligence, to promptly detect and eliminate risks.According to the authors, these innovative technologies would mitigate any deficiencies in site management, significantly improve site safety management, and eliminate risks at the source [26].An increasingly widespread orientation towards automated management of the site or parts of it would not only lead to an improvement in the health and safety of the processes but also a significant improvement in the quality of the data coming from the construction field.It can be assumed that soon, accident data collection techniques will not be able to function without innovative technologies capable of automatically acquiring information on near misses, accidents, and injuries in the construction sector.
Intelligent technologies can generate a range of data that pertain to both the individual (e.g., worker) and the interaction and connection between different technologies.The Internet of Things (IoT) is gradually spreading in the construction sector, thus making an important contribution to the production of new data.Robots and collaborative robots play a significant role in technological innovation and data extraction as they can produce quality in terms of productivity, product quality, and the standardization of production processes.Furthermore, these technologies have the potential to produce high-quality data, which could play a significant role in the pre-processing of data required for the use of DM techniques.The use of these technologies in construction sites is still limited due to unresolved difficulties, attributable to the high variability of environmental conditions and the need to protect the secrecy of processes and the privacy of workers.Moreover, to accompany change, workers and enterprises need vocational training and management training [1].
Regarding the "construction processes", "accidents", and "risk management" fields, the results of the PCA are consistent with the literature analysis included in the review.The main component, D1, associates supervised methods such as dt, knn, svm, and rf with the prediction and classification of data without giving indications on the type of data and resource.The D2 main component instead associates not-supervised methods (unsupervised and others) with monitoring and decision making.D2 specifies not only the methods and objectives to be achieved but also the type and source of the data.Therefore, unsupervised methods like lstm, word2vec, nlp, and bim are associated with projects, simulations, signals, interviews, scientific literature, and texts.In addition, D2 combines these methods with data on work equipment (e.g., machines and installations) and the working environment (wind, temperature, geological stability, etc.).The two main components, D1 and D2 (Tables 5 and 6), have the potential to guide the actors involved in the management of the data relative to yards.The latter concerns both construction processes and accidents and the management of health and safety risks on construction sites.Such evidence has been integrated with the results of the meta-analysis where better adaptability of the supervised methods is valued over those not supervised.The proposed approach has been useful in the association between methods (supervised and not) and data types, classified by type, source, field, and resources of the process from which the same data is derived.
The study shows, among other things, part of the technological development present in construction yards that has been intercepted by the scientific literature.However, the study has limitations due to the very origin of the information analyzed.These are differences determined by the object of investigation that characterizes the individual papers included in the review.Another limitation is linked to the small size of the paper sample available for the survey, which could have a bearing on the significance of the results obtained.Any loss of significance of the data and the results obtained could also be attributed to the absence of standardized production protocols and the various levels of technology available on construction sites around the world, in the time frame of reference.In addition to these aspects, we also note the possible risks due to confusion and the correct classification of the type of data and the source, the objectives (predictive, monitoring, decision making, and classification) of missing data, and the extent of the results.
Reducing risks at the source is the most effective measure fpr managing health and safety at work.Unfortunately, it is not always possible to reduce risks at the source and, therefore, safety standards and the reduction in occupational accidents are achieved through prevention and protection measures (ILOSHA).An important aspect that is typical of construction companies is the increase in risks associated with organizational and production objectives.In this context, the use of advanced statistical and technological tools based on data mining can help, both in prevention measures and in measures to protect against occupational risks.The use of predictive techniques such as dt, knn, svm, and rf can be decisive in risk assessment, the key measure for risk prevention.Similarly, techniques such as lstm, word2vec, nlp, and BIM can aid in monitoring and decision making on site, for example, in the integrated management of safety and construction processes.Decision making can be applied to the choice of individual and collective prevention devices, the key measure of protection against occupational risks.
Further future development of the study should focus on a larger and more homogeneous sample of sources where the results are based on standardized and repeatable parameters resulting from data mining techniques.We believe that, despite the limitations of our work, the results obtained have added value to the complex problem of data mining in this sector.

Conclusions
Cluster analysis and PCA were applied to data from articles that met the PRISMA eligibility criteria and were included in the review.The study indicates an association between the types of methods used and objectives, scope, type of data, and resources under investigation.This association, based on correlation, was synthesized onto a single xy-plane (Dim1 and Dim2).The results of the PCA were consistent with those of the cluster analysis.Each of the two axes was assigned a label summarizing the significance of the entire review.The x-axis (Dim1) was labeled "Supervised methods (dt, knn, svm, and rf) applied to institutional data for classification and inference".The y-axis (Dim2) was labeled "Not-supervised methods (lstm, word2vec, nlp, and BIM) applied to projects, simulations, signals, interviews, the literature, or textual data to classify and make decisions".The meta-analysis, with an odds ratio (OR) of 0.71 and a confidence interval (CI) of 0.53 to 0.96, provides an overall estimate of the superior effectiveness of supervised methods compared to not-supervised ones.Appendix E

27 Figure 1 .
Figure 1.PRISMA criteria for the selection of documents and eligibility flowchart.Source: Scopus and ProQuest data.Years: 2014 to September 2023.

Figure 1 .
Figure 1.PRISMA criteria for the selection of documents and eligibility flowchart.Source: Scopus and ProQuest data.Years: 2014 to September 2023.

Figure 2 .
Figure 2. Map of papers included in the review by country/region.Years: 2014 to September 2023.Source: Authors processing from Scopus and ProQuest data.QGIS 3.18.3.

Figure 2 .
Figure 2. Map of papers included in the review by country/region.Years: 2014 to September 2023.Source: Authors' processing from Scopus and ProQuest data.QGIS 3.18.3.

Figure 3 .
Figure 3. Silhouette test.Representation of optimal number of clusters.Years: 2014 to September 2023.Source: Authors processing from Scopus and ProQuest data.

Figure 3 .
Figure 3. Silhouette test.Representation of optimal number of clusters.Years: 2014 to September 2023.Source: Authors' processing from Scopus and ProQuest data.

Figure 5 .
Figure 5. Decomposition of the total inertia by axes.Dimension vs. percentage of variance.

Figure 5 .
Figure 5. Decomposition of the total inertia by axes.Dimension vs. percentage of variance.

Figure 7 .
Figure 7. Funnel plot.Odds ratio (OR) and relative confidence interval (95% CI) for the total number of data mining methods analyzed, by class.The relative weight of each estimate in the analysis is marked with a box.The diamond represents the meta-analytical OR.Years: 2014 to September 2023.Source: Authors processing from Scopus and ProQuest data.R.

Figure 7 .
Figure 7. Funnel plot.Odds ratio (OR) and relative confidence interval (95% CI) for the total number of data mining methods analyzed, by class.The relative weight of each estimate in the analysis is marked with a box.The diamond represents the meta-analytical OR.Years: 2014 to September 2023.Source: Authors' processing from Scopus and ProQuest data.R.

Table 1 .
Query input for document search inclusion criteria.Source: Scopus and ProQuest data.

Table 2 .
Papers included in the review by source.Years: 2014 to September 2023.Source: Scopus, ProQuest data.

Table 3 .
Classification of content included in the 63 articles selected by the PRISMA method.Years: 2014 to September 2023.Authors' processing from Scopus and ProQuest data.

Table 4 .
PCA. Eigenvalues, percentage of variance, and cumulative percentage of variance.

Table 4 .
PCA. Eigenvalues, percentage of variance, and cumulative percentage of variance.
Years: 2014 to September 2023.Source: Authors' processing from Scopus and ProQuest data.R.

Table 6 .
PCA. Axes descriptions, correlations between methods, and axes.Years: 2014 to September 2023.Source: Authors' processing from Scopus and ProQuest data.R.

Table A3 .
DM methods included in the review.Source: Authors' processing by Scopus and ProQuest archives, R.

Table A5 .
Correlation matrix (l ij ) and analysis of significance.The bold numbers express the significance of the correlation with p-values < 0.05).x1-x4 Study objective (classifying, decision making, monitoring, and predicting); x5-x7 Field (construction processes, occupational accidents, and risk management); x8-x14 Data type (project, institutional data, interview, literature, text, signal and video, and simulation); x15-x17 DM type (supervised, unsupervised, and other); x18-x20; Resource type (construction and H&S processes, environment, plant, and machinery).Source: Authors' processing by Scopus and ProQuest archives and R.