A Comparative Study on the Prediction of Occupational Diseases in China with Hybrid Algorithm Combing Models

Occupational disease is a huge problem in China, and many workers are under risk. Accurate forecasting of occupational disease incidence can provide critical information for prevention and control. Therefore, in this study, five hybrid algorithm combing models were assessed on their effectiveness and applicability to predict the incidence of occupational diseases in China. The five hybrid algorithm combing models are the combination of five grey models (EGM, ODGM, EDGM, DGM, and Verhulst) and five state-of-art machine learning models (KNN, SVM, RF, GBM, and ANN). The quality of the models were assessed based on the accuracy of model prediction as well as minimizing mean absolute percentage error (MAPE) and root-mean-squared error (RMSE). Our results showed that the GM-ANN model provided the most precise prediction among all the models with lowest mean absolute percentage error (MAPE) of 3.49% and root-mean-squared error (RMSE) of 1076.60. Therefore, the GM-ANN model can be used for precise prediction of occupational diseases in China, which may provide valuable information for the prevention and control of occupational diseases in the future.


Introduction
Occupational diseases are any health conditions that are primarily due to exposure to risk factors arising from workrelated activities [1]. According to WHO, the occupational population currently accounts for around 50.0% of the global population [2]. ILO reports that 2.34 million deaths were from work-related accidents or diseases worldwide yearly, of which 2.02 million were from work-related diseases. In addition, 160 million people suffer from nonfatal work-related diseases. Occupational diseases have become the leading cause of death among workers [3]. e economic losses caused by work-related diseases and accidents account for 4.0%-6.0% of the gross domestic product of the countries and regions concerned in the world [4]. In 2017, China's total population was 1.39 billion. With the largest labor force in the world, its occupational population was 776 million (55.8%) with 286 million (20.6%) being migrant workers [5]. According to incomplete statistics, about 200 million workers in China are exposed to various occupational hazards. Among them, more than 16 million are workers in toxic and harmful enterprises, involving more than 30 different types of industries [6]. ey are exposed to various occupational hazards during the process of occupational activities, which cause occupational health damage and even occupational disease-related death. However, occupational diseases are latent and easily neglected. e number of new occupational diseases was almost tripled from 2003 to 2016, with numbers increasing from 12,511 in 2003 to 31,789 in 2016 in China [7,8]. It also accounts for an estimated 50,000 to 70,000 deaths and 350,000 new cases of illness each year in the United States [9]. e occupational health problem is worldwide, but relatively more workers are under risk in China due to the relatively larger proportion of the occupational population. e best way to prevent and control disease is to predict ahead of time. In contrary to the field of medicine where prediction research is well-established [10][11][12][13], it is relatively new in the field of occupational health [14][15][16]. Accurate forecasting of occupational diseases can be achieved by analyzing sufficient historical data. However, data collected by current public health surveillance system do not cover detailed essential information, as it is often difficult to obtain in China and in most of other developing countries. Limited data will affect the establishment of predicting models and result in large prediction bias. erefore, how to build an accurate predictive model with limited data is very challenging in practice.
A solution on how to use limited data to predict was proposed by Deng in 1982. He established the grey systems theory that shows great capability for studying uncertainty problems with poor information, small sample size, uncertain system, and lack of data. is model focuses on poor information systems with partially unknown information [17]. It has been widely and successfully applied in many fields such as social, scientific, industrial, managerial, agricultural, technological, geological, and medical system [18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36], but it is rarely used in occupational health, especially in the prediction of occupational diseases.
Prediction accuracy comes from appropriate model selection with relative features. At present, most good prediction models were contributed by data mining methods. Data mining is a popular interdisciplinary scientific research field. It mainly includes mathematics, statistics, computer, and other related disciplines, including statistical sampling, estimation, hypothesis testing, artificial intelligence, machine learning, pattern recognition, modeling technology, model optimization, and visualization technology. It involves statistical methods such as classification, estimation, prediction, association, and clustering. It also requires enough features to build models.
erefore, how to model and forecast with limited data is a challenging task, as in the case of occupational diseases.
In this study, we combined the grey systems theory and machine learning methods to solve this issue. e GM models contain five models: even grey model (EGM), original difference grey model (ODGM), even difference grey model (EDGM), discrete grey model (DGM), and Verhulst.
e fitted values from the GM models using occupational diseases data were used as training data to train the machine learning models. Five state-of-art machine learning models were used in this study including K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Random Forest (RF), Gradient Boosting Machine (GBM), and Artificial Neural Network (ANN). To the best of our knowledge, this is the first time that those five hybrid algorithm combing models were used to predict occupational diseases. e effectiveness and applicability of the models were assessed based on its ability to predict the incidence of occupational diseases in China.

Methodology
2.1. Data. Cases of occupational diseases from 2005 to 2017 were obtained from national health commission of the people's republic of China.

Data
Normalization. e incidence of occupational disease for year 2006 was the statistical summary of 29 provinces nationwide; however, the cases of occupational diseases from year 2015 to 2017 were the summary of 31 provinces nationwide. e other years were the statistics of 30 provinces across the country. In order to improve the prediction accuracy, we standardize the data by dividing the incidence with the number of provinces for that year, so that the number of occupational diseases in different years during 2005-2017 was comparable. Figure 1 illustrates the raw data. e X-axis represents the year, and the Y-axis represents the number of occupational diseases. We split the data into two parts: the first 2/3 of the data (from 2005 to 2014) were used as the training set and the rest 1/3 were used as the testing set.

Methods.
e proposed method was established based on the grey systems theory and the five state-of-art machine learning models, i.e., K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Random Forest (RF), Gradient Boosting Machine (GBM), and Artificial Neural Network (ANN) theory. All the models were run under the R programming language (version 3.6.1). Table 1 illustrates the models, programming languages, libraries, and parameter adjustment used in this study. e steps of the hybrid algorithm combing models can be described as follows.
Step 1. Training the GM models: in order to obtain the training set for the KNN, SVM, RF, GBM, and ANN models, the five GM models, i.e., even grey model (EGM), original difference grey model (ODGM), even difference grey model (EDGM), discrete grey model (DGM), and Verhulst were used to fit the input of the five hybrid algorithm combing models with the training set of China occupational diseases data from 2005 to 2014.
Step 2. Training the five hybrid algorithm models: training the KNN, SVM, RF, GBM, and ANN models with different parameters of the training set obtained from step 1 fitting values. Validating the five models with the testing set of the China occupational diseases data from 2015 to 2017.
Step 3. Model validation and selection: we compared different models using the mean absolute percentage error (MAPE) and root-mean-squared error (RMSE) as key performance indicators (KPIs). e flowchart of the method is shown in Figure 2.

Metrics.
We compared different models using the mean absolute percentage error (MAPE) and root-mean-squared error (RMSE) as key performance indicators (KPIs): where A t is the actual value and F t is the forecasted value.   Table 2 presents the number of occupational diseases form 2005 to 2017 and the fitted values from the five GM models, respectively. Figure 3 presents the real values and the fitted values of GM models. Compared to the real values, the fitted values of GM models are not accurate enough although they can predict the general trend. Table 3 shows the prediction accuracy of GM models. We can see that the MAPE from all GM models are very similar; therefore, in order to keep all information from the original dataset, we adopted all the fitted values as the training set to train the KNN, SVM, RF, GBM, and ANN models.

GM Models.
In order to verify the performance of model selection based on the MAPE and RMSE of the GM models, we selected the training data from the GM models which provides the least MAPE and RMSE values. However, after verification by permutations and combinations, we found that the best model was the one using all the fitted values from the GM models regardless of their MAPE and RMSE values.
is process can be tested with the Occupational Diseases Prediction Online Analysis Platform (http://predict. xjyg.net:666).

GM-KNN Models.
We used both KNN conventional method and weighted method to build the model, respectively. In the conventional KNN method, we chose the most suitable parameter k � 2 for cross-validation. In KNN weighted method, we chose inversion weighting and k � 2 for cross-validation and grid scan. Figure 4 presents the real values and the fitted values of the two GM-KNN models. Compared to the real values, the fitted values of GM-KNN models can predict the general trend well for the training set, but not accurate enough for testing set, so GM-KNN models were not further considered.

GM-SVM Models.
We built four SVM models with linear, polynomial, radial, and sigmoid kernels, respectively, and the cross-validation method was also applied. Figure 5 presents the real values and the fitted values of the four GM-SVM models. As shown in Figure 5, the fitted values of the GM-SVM (radial) model showed better fit for the training set, but it predicated much lower values than the real values for the testing set. Among all the models, the predicted values of GM-SVM (polynomial) model were the closest to the real values, but they were still far away from accuracy. erefore, the GM-SVM models were not considered as good models for prediction.

GM-RF, GM-GBM, and GM-ANN Models.
We built the GM-RF model with the optimum parameters of mtry � 1 and ntree � 30 after selecting from 500 trees, the GM-GBM model with α � 0.1 and c � 0.5 by the resampling method, and the GM-ANN model with error accuracy of 1 × 10 − 8 , 10,000 learning times, and 5 neurons. Figure 6 presents the fitted values of GM-RF, GM-GBM, and GM-ANN models. Comparing to GM-RF and GM-GBM, the GM-ANN model has the best fit with the fitted and forecasting values being closest to the real values. In addition, GM-ANN model achieved the lowest mean absolute percentage error (MAPE) (3.49%) and the lowest RMSE (1076.60) among all the models (Table 4).
Although the GM-RF and GM-GBM models achieved lower MAPE (6.99%, 8.45%) and RMSE (2090. 13, 2661.27) and their forecasting values were following the general trend and the closest to the real values, the fitted values of these two models were not accurate enough when compared to the real values. GBM is a machine learning technique widely  Computational and Mathematical Methods in Medicine 7 used for regression and classification problems. It produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. Similar to other boosting methods, it builds the model in a stagewise fashion and generalizes them by allowing optimization of an arbitrary differentiable loss function. Both GBM and RF models demonstrate good performances in big data mining, but they need enough training data to train the model to achieve good predictions. In our study, we only used 10 years of data as the training set; therefore, the model may have been under fitting, which may be the main reason for the inaccurate prediction of the testing set using these two models. ANN is one of the main tools used in machine learning. It is composed of input and output layers, as well as a hidden layer consisting of units that transform the input into information that the output layer can use. Similar to the synapses in a biological brain, ANN is based on a collection of connected units or nodes called artificial neurons that can transmit signal from one artificial neuron to another. Although ANN are excellent tools for finding patterns that are far too complex, the main issue is that the neural networks are "black boxes", in which the user feeds in data and receives answers without understanding or access to the exact decision making process. is problem is still the orientation that scientists are exploring at present.
Compared to infectious diseases, occupational diseases have different pathogenesis, relatively few cases, and no obvious seasonal and periodic time series attributes. During the process of disease monitoring, data of occupational diseases generally do not cover the detailed essential information except the collection of the number of cases. It is difficult to build predictive models such as time series model and machine learning models with the limited information. erefore, Grey model is the best choice for prediction with poor information, small sample size, uncertain system, and lack of data as in the case of occupational diseases. However, in this study, the Grey model did not show significant predictive power being largely deviated from the actual incidence although it could simulate the general trend of incidences erefore, it can be concluded that single Grey model cannot predict occupational diseases accurately. In order to make up for this shortcoming, we used the simulation results of the grey models as the training data for the five state-of-art machine learning models (KNN, SVM, RF, GBM, and ANN). By comparing to the actual situation, we found that hybrid algorithm combing models performed much better than the single Grey model, where the GM-ANN model had the best performance and achieved the lowest mean absolute percentage error (MAPE) of 3.49% and root-mean-squared error (RMSE) of 1076.60.
In the field of occupational disease, there is no effective predictive method at present. e establishment of hybrid algorithm combing models provides an efficient way for appropriate occupational disease prediction. Most importantly, it provides scientific basis for the prevention and control of occupational diseases and theoretical basis for administrative decision making. It is a scientific method that can be adopted and applied in practical work in the future. It also provides research ideas for other related disciplines.

Conclusions
In this study, five hybrid algorithm combing models were applied to predict occupational diseases in China. e effectiveness and applicability of the models were assessed based on its ability to predict the incidence trend of occupational diseases in China. To the best of our knowledge, this is the first time that those five hybrid algorithm combing models were used to predict occupational diseases. rough model validation and selection, we found that the GM-ANN model had the best performance and achieved the lowest mean absolute percentage error (MAPE) of 3.49% and rootmean-squared error (RMSE) of 1076.60.
erefore, the precise prediction of the occupational diseases with the GM-ANN model may provide valuable information for prevention and control of the occupational diseases in China. However, further studies and validations with more data are needed in order to put this model prediction method for occupational diseases into practical use.

Data Availability
e data used to support the findings of this study are obtained from National Health Commission of the People's Republic of China and are included within the article.

Conflicts of Interest
e authors declare no conflicts of interest. grey model (EGM), original difference grey model (ODGM), even difference grey model (EDGM), discrete grey model (DGM), and Verhulst which were the fitted values from occupational diseases as training data to train the machine learning models. ey include five state-of-art models: K-Nearest Neighbor (KNN), Support Vector Regression (SVM), Random Forest (RF), Gradient Boosting Machine (GBM), and Artificial Neural Network (ANN). To the best of our knowledge, this is the first time that those five hybrid algorithm combing models are used to predict occupational diseases. e effectiveness and applicability of the models were assessed based on its ability to predict the incidence of occupational diseases in China. (Supplementary Materials)