Large Occupational Accidents Data Analysis with a Coupled Unsupervised Algorithm : The S . O . M . K-Means Method . An Application to the Wood Industry

Data on occupational accidents are usually stored in large databases by worker compensation authorities, and by the safety and prevention teams of companies. An analysis of these databases can play an important role in the prevention of accidents and the reduction of risks, but it can be a complex procedure because of the dimensions and complexity of such databases. The SKM (SOM K-Means) method, a two-level clustering system, made up of SOM (Self Organizing Map) and K-Means clustering, has obtained positive results in identifying the dynamics of critical accidents by referring to a database of 1200 occupational accidents that had occurred in the wood industry. The present research has been conducted to validate the recently presented SKM methodology through the analysis of a larger data set of more than 4000 occupational accidents that occurred in Piedmont (Italy), between 2006 and 2013. This work has partitioned the accidents into groups of different accident dynamics families and has quantified the severity and frequency of occurrence of these accidents. The obtained information may be of help to Company Managers and National Authorities to better address preventive measures and policies concerning the clusters that have been identified as being the most critical within a risk-based decision-making framework.


Introduction
Occupational accidents have an important effect on the economies of the whole world, as pointed out by Hamalainen et al. [1].
Reporting and analyzing occupational accidents in order to improve the data available for prevention purposes have been safety management requirements since 1923, when the First International Conference of Labor Statisticians firs defined standards for accident classification.Since 1989, the EU has promoted various policies to reduce the frequency of occupational accidents.The Treaty on the Functioning of the European Union (article 153) in fact states: '[ . . .] the Union shall support and complement the activities of the Member States in the following fields: (a) improvement in particular of the working environment to protect workers' health and safety; [ . . .]'.In January 1990, the European Union launched a European Statistics study on Accidents at Work (ESAW), based on the International Labor Organization (ILO) standards.As a result of this project, the 'European Statistics on Accidents at Work-Methodology, was published by Eurostat in 2001 and a revised edition was released in 2013 [2].
ESAW describes each occupational accident by means of several parameters, and provides information about the dynamics, time, place, working situation and workers involved.
This large amount of information is analyzed by the EU National Health and Safety Authorities by means of traditional statistical methods, according to Regulation 1338/2008 and Regulation 349/2001 on Community statistics pertaining to public health and health and safety at work.
The results of this approach are published regularly in official reports, by National Health and Safety Authorities, and they highlight such useful and general information on the trends of occupational accidents as: The classes of workers most exposed to accidents, gender effects, the role of the educational level, the age of the injured and various other parameters.In addition, ESAW data have also been analyzed, with reference to a specific field of activity, through a statistics approach to analyze the cause-effect mechanism [3], and information about the trend of accidents and "typical" accidents have been reported in the recent work by Dzwiarek et al. [4] and Kogler et al. [5].However, these kinds of analyses are only useful to a certain extent to enhance the prevention of accidents in the work environment, as observed by Palamara et al. [6] and Comberti et al. [7], because they do not produce a risk assessment outcome [8].
In addition, the statistical analysis of data characterized by non-numerical variables, such as ESAW data, makes the analysis very difficult, and it requires many a-priori assumptions and tests on the nature of the data distribution (e.g., a CHI-coefficient test).An alternative approach, to overcome the use of statistics, is that of resorting to data mining methods [9,10], which include several different data analysis techniques.Some interesting results, related to ESAW data, have in fact been obtained with Multi Correspondence Analysis (MCA) [8] and Pattern Identification [11], which have allowed the most important accident scenarios to be identified, together with a quantification of the frequency of accidents, but they have not produced a quantification of the associated risk.
A powerful method that has been used in different analysis fields to support risk assessments is the SOM (Self Organizing Map): An unsupervised learning algorithm that is used to generate topologies, while preserving transformations from a high-dimensional data vector space to a low-dimensional map space.In other words, with SOM, it is possible to view a set of multiple-dimension data in a 2-dimensional space.This possibility facilitates data analysis.
SOM algorithms have been applied to different risk-classification problems.Gevrey used SOM to estimate the risk of the establishment of invasive species [12], Liang [13] proposed SOM to classify pipeline sections with the same risk level into different risk patterns, and Asgary [14] used SOM to classify and assess the risk levels of structural fire accidents.
Palamara et al. [6] proposed combining SOM with a clustering algorithm, as previously proposed by Vesanto and Alhoniemi [15], to identify the most critical groups of occupational accidents from ESAW data.This work produced promising results, but suffered from several numerical stability problems-the results were strongly fluctuant when the analysis was repeated.
In 2015, these limits were solved by Comberti et al. [7], who published a sensitivity analysis and set up a revised method named "SKM" (SOM K-Means method).SKM also allows a quantification of the risk, made on the basis of clustering partition, to be associated to the qualitative figures that are represented by SOM maps, and allows the results to be used as a decision making support for prevention purposes, as suggested by Demichela et al. [16], and adopted by Murè et al. [17] and Comberti et al. [18].
This paper describes a research project that has focused on the application of SKM to a large database of occupational accidents that have occurred in the wood industry.The aims of the work have been to test the effectiveness of SKM with a larger data set than in the previous works and to identify occupational accident families, together with a quantification of an awareness of their occurrence and frequency.As discussed in Top et al. [19], the wood industry is mainly characterized by small and medium-sized enterprises-SMEs-whose operators are exposed to multiple hazard factors.
The analysis of the dynamics of accidents that have occurred can help support occupational risk managers identify which hazard have led to the most occupational accidents, and which factors have contributed to the different dynamics-thus guiding prevention actions.Accident-dynamics data are in fact crucial for risk assessments and risk-based decision making, as discussed in Leva et al. [20] and Demichela et al. [16] for high voltage equipment; in Darabnia and Demichela [21,22] for the analysis of human and organizational factors pertaining to maintenance optimization; in Gerbec et al. [23,24] for the design of critical operations, or more in general, for a total safety management, as dealt with in Leva et al. [25,26].
A description of the methodology is given in Section 2. Its application to the wood industry data and the relevant results are shown in Section 3. A discussion and conclusions complete the paper.

The SKM Method
SOM is applied in SKM to coded data obtained from an occupational accident database.SOM can represent the occupational data set in a two-dimension map.This process reflects the data similarity within occupational databases: Accidents with similar descriptive parameters are projected into the next units and very different accidents are projected into distant units.
SKM has here been implemented in Matlab ® 7.0 (7.0, MathWorks, Natick, MA, USA) coding with an interface designed in Excel ® (Excel 2013, Microsoft, Redmond, WA, USA).SKM has been structured in three phases:

1.
A pre-processing procedure that pre-treats available data for the subsequent numerical processing; 2.
SOM elaboration, which returns a visual map of the occupational accident domain; 3.
K-Means calculation, which leads to the final clusters and accident partition.
The SKM structure is shown in Figure 1.

Pre-Processing Phase
The data set used in this study was taken from the INAIL (Italian institution for insurance against accidents at work) database, where accidents are reported according to the ESAW taxonomy.
Each accident is described by more than 20 variables, that is: Geographical location of the accident, time of occurrence, details about the injured party (activity, age . . .), dynamics of the accident (deviation from normal procedures, contact and mode of injury) and circumstances of the accident (workstation, working environment).
The combination of the number of elements and the huge number of descriptive variables requires a great calculation effort.Furthermore, most of the variables are categorical elements, whereas the algorithms for SOM and K-Means calculation require numerical ones.
The method requires a pre-processing phase to adapt the data from the occupational accident database to the algorithm characteristic.The pre-processing phase overcomes these two drawbacks by means of a two-step coding procedure.
The first step is focused on the construction of an Accident Matrix (AM).The AM contains the occupational accidents that have to be processed; this matrix has a dimension D, which is obtained from: where n is the accident number, and m is the number of variables selected from among those available in the ESAW classification to describe each accident.Each variable can assume different values but, to limit the computational efforts, these values are limited with respect to the hierarchical structure of the ESAW classification.Table 1 shows part of the ESAW taxonomy for the "Activity" variable: According to the coding procedure, the labels from 41 to 49, pertaining to "handling of objects", will be replaced by the upper level label 40, while the labels from 61 to 69, pertaining to "movement", will be replaced by label 60.The second step involves numerical coding; each accident is coded from a sequence of categorical information to a sequence of numbers.The second step involves numerical coding; each accident is coded from a sequence of categorical information to a sequence of numbers.As reported in Palamara et al. [6], each parameter is coded in a numerical vector that contains a sequence of zeros and a single 1.The union of the vectors that describe the variables used for the analysis leads to the complete coding of each accident.The resulting vector will have as many 1s as the variables and as many 0s as the total number of categories for all the variables, less the number of variables.
The "Input matrix" (IM) contains all the accidents coded into numerical vectors; its dimension (D input ) is obtained from: where n is the number of accidents and p is obtained from the number of variables multiplied by the number of categories used to describe them.Let us assume that an accident is described by 4 variables and each variable can have 5 possible different categories.The parameter p will thus have a value of 20.
This coding procedure is run automatically through the use of conversion tables that allow an univocal correspondence between categorical values and numerical vectors to be achieved, as shown in Table 2.
At the end of the pre-processing phase, the AM that originally contained a group of selected occupational accidents is coded into the IM that contains an equivalent number of numerical vectors.

SOM Elaboration
With reference to Figure 1, the first level of SKM contains the Self Organizing Map (SOM) algorithm, which allows multidimensional vectors to be represented in a two-dimensional space, while preserving the topology of the multidimensional space.
SOM is based on a neural network scheme that is formed by two layers: The first layer is made up of the input vectors; the second layer is a map that is characterized by several units that are set by the user.
There are several ways of calculating SOM; SKM is configured with the "batch SOM" approach [27], which guarantees faster and more efficient performances for complex data sets than the traditional approach.
This approach uses an iterative calculation of matrices and it depends on the initial condition, as will be discussed later on.
The input data are fed as a single block, that is, "batch" [27], and the algorithm assigns a random vector of equal size as the input data, called "weight", to each unit during the initialization phase.
In the training phase, the algorithm calculates the Hamming distance [28] between IM elements and all the unit weights.
This is an iterative process in which, at each iteration, the input data set is presented as a batch to the SOM, and the algorithm calculates the distance between each input vector and each unit weight vector.As in a competitive learning algorithm, the units in the map layer compete to represent the input data and, for each input data, the unit whose weight vector is closest to it wins the competition.This unit is called the 'Best Matching Unit' (BMU).
The weight vector values of the winning units are updated, at each iteration, in order to make each output unit representative of a particular kind of input [29], together with those of the surrounding units.The magnitude of this update depends on the distance between the winning unit in the network and the other units, according to the Gaussian neighborhood function.
The value of the neighborhood function decrees with the distance from the winning unit.In this way, the weight of the units around the winner is modified, while it remains almost unaltered for distant units.
This ensures that the data projected into the next units are similar.
The process ends when each input data is coupled with a BMU.
As mentioned above, this iterative process depends on the initial condition; in order to deal with this dependency, the SKM allows several independent initializations, named seeds, to be made, and these produce several different rough maps.
SKM evaluates, for each map, the topology preservation accuracy that describes how well the data, which are close in the input space, are projected to close units in the SOM.
The topology preservation accuracy is pointed out by the topographic error, which is given by the following equation: where N is the data number, x i is the ith input data and u(x i ) is equal to 1, if the first and the second best matching units are not adjacent units, otherwise it is zero.
The topographic error minimization leads to the identification of the best map among all those generated.
At the end of the training process, the map has organized itself by mapping input data into SOM units and, in particular, by connecting similar input data to neighboring units.
The number of units has to be chosen by the user.There is not an objective criterion to set it up and, as discussed in Comberti et al. [7], a rule of thumb is to set it with a lower value than the number of analysed occupational accidents.
The output of the training process is a bi-dimensional map and a numerical output that is represented by a matrix called SMap.
SMap contains the numerical code of the map and the dimension of this matrix, which is obtained from: where U is the number of the unit of the map and p is the same as for Equation (2).Each element is characterized by a sequence of real numbers that represent the weights of each unit, which is also called prototype vector [15].The weights are basically proportional to the number and type of data that are projected into the corresponding unit, consequently, all the units without projected data are characterized by a similar prototype vector.
SKM defines a new matrix, called Clustering Matrix (CM), from SMap.CM contains a number of elements that is equal to the number of IM elements, and the prototype vector of the corresponding activated unit defines each element.
The CM matrix and the Cluster number, evaluated from the SOM map interpretation, are the input data for the second level of the method.

K-Means Elaboration
As mentioned in the introduction, the second level of clustering is based on a K-Means algorithm.K-Means is based on the concept of cluster centers, which are called 'centroids'.A centroid is a point in the data space that represents a cluster.The algorithm finds the positions of the cluster centroids in the input space, and minimizes an objective function E, the 'square-error distortion'.
After each data has been assigned, the centroid of each cluster has clearly changed, on the basis of the positions of the data in the space and on the random initial position of the centroid.Therefore, a new cluster centroid is calculated in such a way that the sum of the squared distances is minimized.
The process continues with the calculation of the new distances between each input data and each centroid and re-assigning the data to the nearest centroid.This process is repeated until no more changes occur.In other words, the algorithm ends when all the data have been assigned to their nearest centroids.
The K-Means algorithm requires three user-specified parameters: A number of clusters K, cluster initialization and a distance metric.
The most critical choice is K.Although no perfect mathematical criterion exists, several heuristics criteria [30] are available to choose K.
The value of K in SKM is obtained from a SOM map visual evaluation.The CM matrix constitutes the input data for the K-Means algorithm.
The clustering phase provides a data partition that is summarized in a chart, where each occupational accident is attributed to a specific cluster, and a graphical output, dedicated to clustering visualization, is drawn, as shown in Figure 2.
The graph shows the distribution of activated units in the SOM map domain.Each unit is described by different colors, depending on the membership cluster.Each unit is marked by its own number (see the green circle in Figure 2), the number of projected elements (blue circle), and the cluster to which the unit belongs (red circle).
As mentioned in the introduction, the second level of clustering is based on a K-Means algorithm.
K-Means is based on the concept of cluster centers, which are called 'centroids'.A centroid is a point in the data space that represents a cluster.The algorithm finds the positions of the cluster centroids in the input space, and minimizes an objective function E, the 'square-error distortion'.
After each data has been assigned, the centroid of each cluster has clearly changed, on the basis of the positions of the data in the space and on the random initial position of the centroid.
Therefore, a new cluster centroid is calculated in such a way that the sum of the squared distances is minimized.
The process continues with the calculation of the new distances between each input data and each centroid and re-assigning the data to the nearest centroid.This process is repeated until no more changes occur.In other words, the algorithm ends when all the data have been assigned to their nearest centroids.
The K-Means algorithm requires three user-specified parameters: A number of clusters K, cluster initialization and a distance metric.
The most critical choice is K.Although no perfect mathematical criterion exists, several heuristics criteria [30] are available to choose K.
The value of K in SKM is obtained from a SOM map visual evaluation.The CM matrix constitutes the input data for the K-Means algorithm.
The clustering phase provides a data partition that is summarized in a chart, where each occupational accident is attributed to a specific cluster, and a graphical output, dedicated to clustering visualization, is drawn, as shown in Figure 2.
The graph shows the distribution of activated units in the SOM map domain.Each unit is described by different colors, depending on the membership cluster.Each unit is marked by its own number (see the green circle in Figure 2), the number of projected elements (blue circle), and the cluster to which the unit belongs (red circle).This graphical elaboration makes the comparison between several partitions easier, thus the evaluation of clustering accuracy becomes more immediate and intuitive.
With this visualization, it is also possible to carry out a comparison with the corresponding SOM map.

Case Study
This work has focused on the analysis of the occupational accident domain of the wood manufactory industry in the north of Italy (the Piedmont Region).
The occupational accident data set was provided by INAIL (Italian National Compensation Authority) and was made up of more than 6000 elements.
Unfortunately, some reports were inaccurate as a great deal of information was missing, and this required a preliminary check of all the available data.
The analysis of the accident database related to the wood manufacturing sector was carried out according to the following criteria: 1.
The scope of the study was linked to the accident dynamics analysis in order to define preventive measures and, as a result, the selected descriptive variables were: The first five variables were selected because they are closely linked to the accident event; the "Age of worker" was selected to investigate whether there was a possible correlation between the worker's age and the dynamics of the accident.

2.
In order to be selected for the AM matrix definition, it was necessary for the first four variables to all be populated at the same time in the accident record.
On the basis of these two criteria, the original data set provided by INAIL led to an AM matrix of 4600 acceptable events.

Coding
The second step involves the transition from AM to IM matrix with the coding phase.According to the criteria described at Section 2.1.1., 9 possible values were assumed for each variable and they were coded in a numerical sequence, as shown in Table 2; the whole coding table is reported in Appendix A (Ref.Table A1).
The dimension of the IM matrix, according to Equation (2), is:

SOM Elaboration and Analysis
The SOM was generated, according to the strategy to maximizing the map accuracy, as summarized hereafter: 1.
The number of map units was set lower than the number of IM elements; 2.
Several initialization seeds were tested, and the map was selected on the basis of a topographic error minimization criterion; 3.
A balance between the elaboration time and accuracy was considered, according to the analyst's experience.
The SOM obtained for the case study with 25,000 seeds and 10,000 map units is shown in Figure 3.The visual analysis suggests the presence of at least 18 groups of similar occupational accidents.This value was used to set the K value required for the K-Means algorithm.The SOM obtained for the case study with 25,000 seeds and 10,000 map units is shown in Figure 3.The visual analysis suggests the presence of at least 18 groups of similar occupational accidents.This value was used to set the K value required for the K-Means algorithm.

K-Means Clustering and Cluster Identification
As discussed above, numerical clustering is an iterative process, and it was here started from the K value that was obtained from the SOM visual analysis.
The final result is a chart of all the accidents clustered into groups on the basis of their numerical similarity; furthermore, a graphic view of the partition is obtained, as shown in Figure 2.
Several independent repetitions of clustering can provide results with a level of variability in the accident cluster attribution that generally involves 8-15% of the data.
In order to manage this numerical variability, two indices can be adopted, as defined in Comberti et al. [7]: "Sequence stability" (Ss) and "sequence membership" (Sm).
The Sm index is calculated for each element.It represents the cluster attribution sequence of that element related to multiple repetitions.
The Ss index represents the number of elements that have the same Sm index.Table 3 shows an example of the calculation of the Sm and Ss indices for a five element cluster.

K-Means Clustering and Cluster Identification
As discussed above, numerical clustering is an iterative process, and it was here started from the K value that was obtained from the SOM visual analysis.
The final result is a chart of all the accidents clustered into groups on the basis of their numerical similarity; furthermore, a graphic view of the partition is obtained, as shown in Figure 2.
Several independent repetitions of clustering can provide results with a level of variability in the accident cluster attribution that generally involves 8-15% of the data.
In order to manage this numerical variability, two indices can be adopted, as defined in Comberti et al. [7]: "Sequence stability" (Ss) and "sequence membership" (Sm).
The Sm index is calculated for each element.It represents the cluster attribution sequence of that element related to multiple repetitions.
The Ss index represents the number of elements that have the same Sm index.Table 3 shows an example of the calculation of the Sm and Ss indices for a five element cluster.

Record
Clustering Repetition The Sm index for record n. 5 is: AAAAAAA, while the Sm for record n. 2 is AAAABAC.All the elements that have an Sm without any changes in attribution are represented by an Ss level of 100%.In other words, all the elements that are denoted by a stable sequence of clustering, have an Ss of 100%.
An Ss level equal to 85% corresponds to the number of elements that have an Sm with at least one variation in the cluster attribution.
A total of 85% of the examined data with a stable attribution had an Ss index level of 100%; the amount of stable attribution reached a coverage of 93% of the data for an Ss level equal to 85%.
The use of these indexes allows the clustering stability to be quantified and helps the analyst in the clustering identification.This process leads to a new definition of the clusters as a "group of elements with an assigned sequence stability".
Considering the AM matrix of 4600 occupational accidents in the wood manufacturing sector, and the SOM map obtained that suggested 18 clusters, the K-Means algorithm phase run on three repetitions led to a cluster identification of 21 groups on the basis of an Ss index of 85%, which is represented in Figure 4.
An total of 93% of the data were automatically included in the identified clusters.Most of the remaining 7% was collocated by the SKM user in the different groups, depending on their level of similarity (272 element), and 78 elements, which were characterized by a very unstable attribution, were all included in a specific cluster called "Other".The Sm index for record n. 5 is: AAAAAAA, while the Sm for record n. 2 is AAAABAC.
All the elements that have an Sm without any changes in attribution are represented by an Ss level of 100%.In other words, all the elements that are denoted by a stable sequence of clustering, have an Ss of 100%.
An Ss level equal to 85% corresponds to the number of elements that have an Sm with at least one variation in the cluster attribution.
A total of 85% of the examined data with a stable attribution had an Ss index level of 100%; the amount of stable attribution reached a coverage of 93% of the data for an Ss level equal to 85%.
The use of these indexes allows the clustering stability to be quantified and helps the analyst in the clustering identification.This process leads to a new definition of the clusters as a "group of elements with an assigned sequence stability".
Considering the AM matrix of 4600 occupational accidents in the wood manufacturing sector, and the SOM map obtained that suggested 18 clusters, the K-Means algorithm phase run on three repetitions led to a cluster identification of 21 groups on the basis of an Ss index of 85%, which is represented in Figure 4.
An total of 93% of the data were automatically included in the identified clusters.Most of the remaining 7% was collocated by the SKM user in the different groups, depending on their level of similarity (272 element), and 78 elements, which were characterized by a very unstable attribution, were all included in a specific cluster called "Other".

Results
The application of SKM to the described data set led to the identification of 21 clusters.It was possible to describe all of the clusters according to the level of homogeneity of the data contained within each cluster.For example, Cluster 3 (CL3), which is summarized in Table 4, contains 486 accidents, 94% of which are characterized by "Working with hand tools" as their "Activity".

Results
The application of SKM to the described data set led to the identification of 21 clusters.It was possible to describe all of the clusters according to the level of homogeneity of the data contained within each cluster.For example, Cluster 3 (CL3), which is summarized in Table 4, contains 486 accidents, 94% of which are characterized by "Working with hand tools" as their "Activity".A total of 91% of the "Deviation" variables is focused on "Losing control" and 73% of the Deviation Material" variables is focused on "Hand Tools".A total of 83% of the "Contact" variables is focused on "Contact with Cutting Tool" and 82% of the "Injured Body Part" is represented by "Hands".
Tables that show the clustering descriptions with a measure of their homogeneity are reported in the annex (Appendix B, Tables A2-A7): The most frequent values of the six descriptive variables selected in the problem definition phase are shown for each cluster.
Some other results could be found by analyzing the number of events of each cluster and the related average days of prognosis.
Figure 5 shows the number of occupational accidents allocated to each cluster.This parameter falls between a minimum value of 40 for cluster 1-1 to a value of 486 for cluster 3.This parameter can be used to estimate the major or minor frequencies of the accident dynamics pertaining to each cluster.
The "Other" label contains a set of heterogeneous accidents that were not assigned to any of the defined clusters.A total of 91% of the "Deviation" variables is focused on "Losing control" and 73% of the Deviation Material" variables is focused on "Hand Tools".A total of 83% of the "Contact" variables is focused on "Contact with Cutting Tool" and 82% of the "Injured Body Part" is represented by "Hands".
Tables that show the clustering descriptions with a measure of their homogeneity are reported in the annex (Appendix B, Tables A2-A7): The most frequent values of the six descriptive variables selected in the problem definition phase are shown for each cluster.
Some other results could be found by analyzing the number of events of each cluster and the related average days of prognosis.
Figure 5 shows the number of occupational accidents allocated to each cluster.This parameter falls between a minimum value of 40 for cluster 1-1 to a value of 486 for cluster 3.This parameter can be used to estimate the major or minor frequencies of the accident dynamics pertaining to each cluster.
The "Other" label contains a set of heterogeneous accidents that were not assigned to any of the defined clusters.Figure 6 shows the average number of days of prognosis calculated for each cluster.This parameter showed a variability that ranged from 14.8 days/event for the "CL1-1" cluster to 54.5 days/event for the "CL17" cluster.The average days of prognosis may be used to express the severity of the accidents associated to each cluster, while the frequency of accidents and severity may be used to address preventive measures and policies for those clusters that are characterized by a higher risk.

Acidents number
Figure 5. Number of events per cluster.
Figure 6 shows the average number of days of prognosis calculated for each cluster.This parameter showed a variability that ranged from 14.8 days/event for the "CL1-1" cluster to 54.5 days/event for the "CL17" cluster.The average days of prognosis may be used to express the severity of the accidents associated to each cluster, while the frequency of accidents and severity may be used to address preventive measures and policies for those clusters that are characterized by a higher risk.Figure 7 shows the average age of the workers.It allows the accident types to be associated with the age of the workers.Company managers could thus focus on preventive (as training) or protective (as personal protective devices) measures according to the average age of the workers on the basis of the most relevant accident dynamics characterizing the cluster.

Opportunities for Prevention: SKM Data Clustering
The results reported in the previous sections highlighted useful information about the ability of SKM to group occupational accidents into clusters.
As far as the cluster descriptions are concerned, Tables A2-A7 show that most of the 21 clusters can easily be characterized by 1 or 2 values of three of the six descriptive parameters, according to their numerousness within the element descriptors.Figure 7 shows the average age of the workers.It allows the accident types to be associated with the age of the workers.Company managers could thus focus on preventive (as training) or protective (as personal protective devices) measures according to the average age of the workers on the basis of the most relevant accident dynamics characterizing the cluster.Figure 7 shows the average age of the workers.It allows the accident types to be associated with the age of the workers.Company managers could thus focus on preventive (as training) or protective (as personal protective devices) measures according to the average age of the workers on the basis of the most relevant accident dynamics characterizing the cluster.

Opportunities for Prevention: SKM Data Clustering
The results reported in the previous sections highlighted useful information about the ability of SKM to group occupational accidents into clusters.
As far as the cluster descriptions are concerned, Tables A2-A7 show that most of the 21 clusters can easily be characterized by 1 or 2 values of three of the six descriptive parameters, according to their numerousness within the element descriptors.
Workers Years

Opportunities for Prevention: SKM Data Clustering
The results reported in the previous sections highlighted useful information about the ability of SKM to group occupational accidents into clusters.
As far as the cluster descriptions are concerned, Tables A2-A7 show that most of the 21 clusters can easily be characterized by 1 or 2 values of three of the six descriptive parameters, according to their numerousness within the element descriptors.
Activity, Deviation and Contact are generally polarized in one value, and in some cases, they can cover even 90% of the cluster elements, for example, the "CL10" cluster where 99% of the occupational accidents showed the "Handling" label for the "Activity" variable."CL11" is less polarized: The "Working with machinery" label covers 69% of the occupational accidents, while the "Manual transport" label covers 23%.
A more distributed division was observed for the "Deviation material", "Age" and "Injured body part" variables.
The results reported in the tables in Annex B suggest that SKM may be used to identify families of occupational accidents that differ according to their accidental dynamics, even though they share the same "Activity".For example, clusters 4, 5 and 16 had the same "activity" value: "Motion".
"CL4" grouped accidents characterized by "Stress movements" as main "Deviation" and "Physical effort" as "Contact"."CL5" grouped accidents characterized by "Fall" as major "Deviation" and "Crushing" for "Contact" and "CL16" identified accident dynamic similar to "CL5", but characterized by a "Contact" value that was polarized to "Contact with cutting tool".
The provided clustering description can easily be compared with additional information calculated for each cluster, with reference to the specific phenomenology of the wood industry.
Figure 5 shows the number of elements for each cluster."CL3", "CL14" and "CL15" are characterized by the highest number of accidents.
This parameter can be assumed as an estimation of the frequency of accidents and, consequently, can be used to decide on the resources and measures necessary for those clusters identified as the most critical.Another piece of useful information that can be used to support Safety Managers is the average days of prognosis, as summarized in Figure 6.
As far as the above described "CL4", "CL5" and "CL16" clusters, which are taken as an example, are concerned, the average days of prognosis passed from 36.8 days/event ("CL4"-stress movements due to physical efforts) to 39.5 ("CL5"-Falls), and showed a maximum value of 52.8 days/event for "CL16", that is, occupational accidents due to contact with cutting tools.On the other hand, the "CL4" and "CL16" clusters are only moderately populated, while "CL5" is one of the most populated, thus the dynamics therein are among the most frequent in the wood industry.Moreover, with reference to Figure 7, it appears that the accidents resulting from contact with tools can be ascribed to older operators, while those related to movement can be attributed to the younger workers, thus the prevention and protective measures may also be addressed according to age.
According these results, the SKM method is able to distinguish groups of occupational accidents, characterized by different dynamics, and it is able to associate a different quantification of occupational accident frequency and seriousness to each group.
As a consequence, a Risk index was calculated according to the following equation: where R is the risk, F is the frequency of occurrence, calculated as number of occupational accidents divided by day, and S is the seriousness, calculated as the average days of prognosis.Equation ( 5) in Table 5 summarizes the Risk estimation for all the identified clusters.Risk shows a wide range of variation, that is, from 1.6 for "CL1-1" to 12 for "CL3".SKM has been able to identify clusters of accidents in the wood industry and to classify them, in terms of minor or greater risk levels.For example, the most critical clusters were "CL3" and "CL5", which are related to manual work with hand-tools ("CL3") and to falls during manual transport or movements ("CL5").The association of a Risk assessment to each cluster may in fact represent a support to any decision-making process focused on preventive measurement planning.
For example, the high risk of "CL5" suggests there is a need to review the design of the workplace organization in order to optimize the workers' movements inside the working area.

Opportunities for Prevention: Traditional Data Analysis
Economic and technical resources can be defined to prevent occupational accidents on the basis of the information achievable with the SKM method.
This result cannot be achieved directly with a traditional statistical approach, as mentioned in the introduction.In fact, a statistical analysis performed on an occupational accident database pertaining to the wood industry [31] provided many diagrams and graphical views of the distribution of all the variables used in an ESAW classification.However, this large amount of information did not lead to the identification of occupational accident clusters and did not have the purpose of drawing up a risk quantification, as SKM did.An example of this is shown in Figure 8, where the distribution of three variables that affected the accident dynamics is reported.
Compared to other ESAW data mining techniques, such as MCA [8], the use of SKM offers two main advantages: 1.
The here performed SKM analysis was based on six parameters (as described in Section 2.2), but all the other accident details included in the database remained linked to each single accident and could be used to describe the identified clusters.This was done, in the proposed study, with "days of prognosis parameters" and it led to a risk assessment classification, but it could also be done with all the other connected parameters, such as "number of workers employed", "time of accident occurrence", and so on, thus making it possible to conduct several quantified analyses.2.
SKM is a friendly-user method, as it does not require any specific expertise in statistics or data analysis.In fact, once the data set has been coded automatically to the SKM required format, the SKM user simply has to set the number of "SOM units", the number of interaction cycles, and the number of clusters into which dividing the data set should be divided on the basis of the SOM map.This makes the SKM method easier to apply to ESAW data than other more complex data mining techniques.

Conclusions
This paper has focused on the validation of a numerical methodology to deal with an occupational accident database (DB) in order to better address the data analysis, to achieve a reduction in risks and to support the definition of preventive measures.
A data set of more than 4000 occupational accidents that had occurred in the wood industry was selected as a case study, and it was analyzed with the SKM method.SKM was able to successfully identify a set of 21 clusters of accidents based on six variables related to the occurrence dynamics, the injured body part and the age of the involved workers.
The variable distribution of each cluster highlighted that the partition was steered by the four dynamic-related ones, while the variable distributions of the age of the workers and of the injured body part were observed to be more scattered.Some other parameters related to the consequences of each accident (number of days of prognosis) and the number of events (number of accidents) were calculated and associated to each cluster, and this allowed a Risk assessment evaluation to be made.
The two most critical clusters, according to the risk assessment, were related to "manual activity with hand tools" and to "free movements/manual transport" in the working area.This information suggests, for example, the need to design a different working organization in order to reduce the workers' movements inside the working place.

Conclusions
This paper has focused on the validation of a numerical methodology to deal with an occupational accident database (DB) in order to better address the data analysis, to achieve a reduction in risks and to support the definition of preventive measures.
A data set of more than 4000 occupational accidents that had occurred in the wood industry was selected as a case study, and it was analyzed with the SKM method.SKM was able to successfully identify a set of 21 clusters of accidents based on six variables related to the occurrence dynamics, the injured body part and the age of the involved workers.
The variable distribution of each cluster highlighted that the partition was steered by the four dynamic-related ones, while the variable distributions of the age of the workers and of the injured body part were observed to be more scattered.Some other parameters related to the consequences of each accident (number of days of prognosis) and the number of events (number of accidents) were calculated and associated to each cluster, and this allowed a Risk assessment evaluation to be made.
The two most critical clusters, according to the risk assessment, were related to "manual activity with hand tools" and to "free movements/manual transport" in the working area.This information suggests, for example, the need to design a different working organization in order to reduce the workers' movements inside the working place.
The results highlight that the proposed methodology represents an advancement in the analyses of occupational accident DBs, since it allows not only the distribution of single parameters, such as statistics, to be identified, but also to rank the dynamics of families of accidents according to such relevant parameters as severity or risk.
More in general, the SKM can help Company Management and the National Authorities to address preventive measures and policies pertaining to those clusters that have been identified as the most critical on the basis of the risk quantification.This additional information represents a useful piece of knowledge that can be used to support risk-based decision-making processes, because it represents a quantification of risk linked to the defined occupational accident groups.

Appendix B
In this Appendix the clusters description is reported.

Safety 2018, 4 ,
x FOR PEER REVIEW 4 of 23

Figure 3 .
Figure 3. Self Organizing Map (SOM) of the Accident Matrix (AM) matrix based on 10,000 units.

Figure 3 .
Figure 3. Self Organizing Map (SOM) of the Accident Matrix (AM) matrix based on 10,000 units.

Figure 4 .
Figure 4. Cluster identification on the basis of the Sm and Ss indices.

Figure 4 .
Figure 4. Cluster identification on the basis of the Sm and Ss indices.

Safety 2018, 4 ,
x FOR PEER REVIEW 11 of 23

Figure 5 .
Figure 5. Number of events per cluster.

Figure 7 .
Figure 7. Average age of the workers of each cluster.

Figure 7 .
Figure 7. Average age of the workers of each cluster.

Figure 7 .
Figure 7. Average age of the workers of each cluster.

Figure 8 .
Figure 8. Distribution of the dynamics variables of the wood industry for the Veneto occupational accident database.

Figure 8 .
Figure 8. Distribution of the dynamics variables of the wood industry for the Veneto occupational accident database.

Table 1 .
European Statistics study on Accidents at Work (ESAW) hierarchical classification, the upper and lower levels.

Table 2 .
Coding table for the ESAW "contact" variable.