Good times bad times: Automated forecasting of seasonal cryptosporidiosis in Ontario using machine learning

Background: The rise of big data and related predictive modelling based on machine learning algorithms over the last two decades have provided new opportunities for disease surveillance and public health preparedness. Big data come with the promise of faster generation of and access to more precise information, potentially facilitating predictive precision in public health (“precision public health”). As an example, we considered forecasting of the future course of the monthly cryptosporidiosis incidence in Ontario. Methods: The traditional statistical approach to forecasting is the seasonal autoregressive integrated moving-average (SARIMA) model. We applied SARIMA and an artificial neural network (ANN) approach, specifically a feed-forward neural network, to predict monthly cryptosporidiosis incidence in Ontario in 2017 using 2005–2016 data as a training set. Both forecasting approaches are automated to make them relevant in a disease surveillance context. We compared the resulting forecasts using the root mean squared error (RMSE) and mean absolute error (MAE) as measures of predictive accuracy. Results: Cryptosporidiosis is a seasonal disease, which peaks in Ontario in late summer. In this study, the SARIMA model and ANN forecasting approaches captured the seasonal pattern of cryptosporidiosis well. Contrary to similar studies reported in the literature, the ANN forecasts of cryptosporidiosis were slightly less accurate than the SARIMA model forecasts. Conclusion: The ANN and SARIMA approaches are suitable for automated forecasting of public health time series data from surveillance systems. Future studies should employ additional algorithms (e.g. random forests) and assess accuracy by using alternative diseases for case studies and conducting rigorous simulation studies. Difference between the forecasts from the machine learning algorithm, that is, the ANN, and the statistical learning model, that is, the SARIMA, should be considered with respect to philosophical differences between the two approaches.


Introduction
Cryptosporidiosis is a potentially lethal diarrheal disease that affects humans and animals. It is caused by the protozoan parasite Cryptosporidium spp. (1). Some 20 of the known 26 species have been associated with human infections (2). The majority of human infections are caused by C. hominis and C. parvum, which are mostly related to anthropogenic and zoonotic transmissions, respectively (3). The main infection route for humans is through consumption (including while swimming) of water contaminated with the parasites' oocysts.
Cryptosporidiosis is often asymptomatic but can result in mild-to-severe gastrointestinal disease and even mortality. Human infection prevalence in North America ranges between 1% and 4% annually, but can be up to 20% elsewhere (4). While

IMPLEMENTATION SCIENCE
cryptosporidiosis is likely underreported, it is known to occur more frequently in children and immunocompromised people. No prophylactic treatment is available, making public health preparedness based on surveillance an important preventive option.
New opportunities for statistics, epidemiology and disease surveillance in public health have emerged over the last two decades since the advent of big data (5,6). Eysenbach introduced the term "infodemiology" for the use of big data (and specifically social media use and behaviour data) in health surveillance (7). A prominent example of infodemiology is the Google Flu Trends project, which predicted regional outbreaks of influenza 7 to 10 days ahead of conventional surveillance methods by the Center for Disease Control and Prevention (CDC) but was grossly overestimating influenza prevalence (8). That project is a valuable example of the opportunities as well as the risks of big data, termed "big data hubris" (8,9).
Big data are often characterized by the five V's: volume, variety, velocity, veracity and value (9). Big data hubris refers to the veracity or truthfulness of the data. The promise of big data is that vast amounts of data (volume) of different types and from different sources (variety) provide a more complete and precise representation of reality, hence leading to "precision public health" (10). However, when the data are not representative of the population of interest, predictive inferences are biased.
Disease surveillance results in a big data situation due to data velocity and volume: data are constantly updated and growing in size. The dynamic nature of disease surveillance data requires an automated approach to analysis and forecasting. The traditional statistical time series modelling approach is the seasonal autoregressive integrated moving-average model (SARIMA) proposed by Box and Jenkins (11). A widely used machine learning algorithm for time series forecasting is the (feed-forward) artificial neural network (ANN) (12). We applied both forecasting approaches to predict monthly cryptosporidiosis incidence in Ontario in 2017 using 2005-2016 data as a training set. We compared these forecasting approaches using the 2017 incidence as test data, with the root mean squared prediction error (RMSE) and the mean absolute prediction error (MAE) as measures of accuracy.
Similar comparisons have been reported in the literature. Zhang and Qi (13) compared SARIMA and ANN using simulations and showed that the ANN is consistently better at forecasting than the SARIMA model, when data are appropriately preprocessed. Kane et al. (14) compared forecasts of avian influenza H5N1 outbreaks by the SARIMA model to those from the random forest algorithm and concluded that machine learning provides enhanced predictive ability over the time series modelling. Similarly, in a study of typhoid fever incidence in China, Zhang et al. compared SARIMA modelling to three different ANN architectures; the researchers concluded that all three neural network algorithms outperform the statistical model (15).
The goal of this study is to compare the two approaches to automating forecasting of monthly incidence rates of cryptosporidiosis in Ontario for the year 2017. The specific objectives were (1) to compare the accuracy of forecasts using the RMSE; (2) to compare forecasts using the MAE; and (3) to visually compare the forecasted incidence rates to the observed time series.

Methods
The data we used were monthly incidence counts of cryptosporidiosis in Ontario for the years 2005 to 2017 as reported to Public Health Ontario and available from the respective homepages (16). For analysis, we split the dataset into training data (monthly incidences in 2005 to 2016) and test data (monthly incidences in 2017).
For exploration purposes, we reported ranges of annual and monthly mean incidence in the training data and inspected the data with the seasonal and trend decomposition using Loess (STL) method (17). The seasonal component was assumed to be time invariant or periodic, while the trend component was found using a moving window of length 73 months, or six years plus one month.
A SARIMA model (11) is a data-generating model that includes seasonal and trend components. It is used to describe autocorrelations within a time series and to predict future values. It is described by the order of filters applied to remove seasonal and trend components as well as by the order of lagged correlations in the filtered series. The filtered series is assumed to be stationary and Gaussian. A brief description of the SARIMA model is: SARIMA(p,d,q)(P,D,Q) S , where S denotes the length of the season (here 12 months), d and D denote nonseasonal and seasonal difference filters to remove trend and seasonal components, respectively. Furthermore, p and P are orders of the nonseasonal and seasonal autocorrelation parameters, respectively. Finally, q and Q denote the nonseasonal and seasonal order of moving-average parameters. The SARIMA modelling approach was automated by using maximum likelihood estimation and stepwise backward model selection with the Bayesian information criterion (BIC). The SARIMA model as fit to the 2005-2016 training data was then used to forecast monthly incidences for 2017 test data.
The ANN is a data-driven and automated algorithm to forecasting time series data. While a variety of ANN architectures exist (18,19), we applied the staple feed-forward multilayer neural network with a single hidden layer in this study (12). More specifically, the ANN is described as ANN(p,P,k) S , where p, P and S have the same meanings as in the SARIMA model, and k denotes the number of nodes in the hidden layer. Automatic selection of the ANN's order values was as follows: S=12 is known; k was the rounded value of (p+P+1)/2, where P was set to P=1 to accommodate linear seasonality; and p was selected as the optimal order of an autoregressive model fit to the remainder of term of the STL decomposed series.
We applied the ANN algorithm as follows: linear combinations of input data were subjected to the nonlinear sigmoid activation function 1/(1+exp(−z)) as output from a hidden layer, and the output from the hidden layer was then aggregated in the form of linear combinations, which resulted in the final output. The ANN was trained using 100 repetitions, that is, 100 different random starting values for the weight parameters of the linear combinations between input and hidden layer as well as the hidden and output layers. Furthermore, the input series (i.e. the 2005-2016 data) was preprocessed using an automatic selection of the Box-Cox transformation parameter (by the Guerrero method (12)) followed by studentizing (i.e. centring and scaling). For each repetition, the algorithm was trained by an iterative experimental process of optimizing a loss function. The resulting set of forecasts, or ensemble, was averaged over all iterations.
Both forecasting approaches provide prediction intervals. The SARIMA prediction interval was based on estimated model parameter. The ANN prediction interval was based on 1,000 bootstrapped sample paths (12), that is, using resampled past residuals. In addition, both forecasting approaches were compared by their accuracy measures (RMSE and MAE) for the monthly forecasts and the observed test data of the year 2017.

Results
The The automatically selected ANN is of order ANN(11,1,6) 12 , that is, the last 11 observations plus the first seasonal observation are linearly combined into six nodes of a single hidden layer. The input series was Box-Cox transformed with an automatically chosen parameter λ=−0.21. The forecasts from the ANN are visualized together with 80% and 95% prediction intervals in Figure 4.
The observed monthly incidences and rounded forecasts are presented in Table 1 and Figure 5 for both models. Table 2 shows the summaries of the RMSE and MAE from the 2017 forecasts for both approaches.

Discussion
The monthly cryptosporidiosis incidence in Ontario is characterized by a dominant seasonal pattern that generally peaks in August. The short peak in incidence may support the concept of human behaviour as a main driver for infection since environmental conditions (e.g. ambient temperature) do not vary in a pattern similar to the incidence. No increasing trend was identified, meaning that the incidence is not emerging.
Neither the machine learning algorithm (i.e. the ANN) nor the statistical learning method (i.e. SARIMA) were found to have a superior performance in predicting monthly cryptosporidiosis incidence. While the ANN forecasts were closer to the observations for six months, the SARIMA performed better for a different group of five months; both methods were tied for the month of September of 2017 (see Table 1). However, the accuracy measures RMSE and MAE indicate a slight advantage for the SARIMA forecasts: the ANN's RMSE and MAE were higher by 0.9 and 0.7 units, respectively (see Table 2).
This slight advantage for the SARIMA is interpreted as follows: the SARIMA forecasts are, on average, almost one case per month more accurate than ANN forecasts. Although this result is unexpected with respect to the cited reports (13)(14)(15), it is in line with a systematic review (22) that found no evidence for more accurate predictions from machine learning alternatives to statistical logistic regression modelling. However, it should be noted that this is a case study and results are specific to this example. While the SARIMA model assumes white noise residuals and an additive seasonal component, this was not checked here using the automated modelling approach. Similarly, the ANN is optimized using backpropagation, which is known to have difficulties finding the optimal parameter estimates (19). Therefore, the ANN employs ensemble forecasting to guard against individual erroneous forecasts.
Proper data preprocessing is important for machine learning algorithms (23). This means a time series needs to be scaled and centred (i.e. studentized or normalized) prior to analysis. Data preprocessing is a natural part of the autoregressive integrated moving-average modelling approach, as trend and seasonal effects are filtered out before the model is fitted to the time series. In our study, stepwise model selection led to filtering out a seasonal effect, but a trend effect was neither identified nor removed. The ANN was preprocessed by a Box-Cox transformation, followed by centring and scaling.
Big data analysis is often presented together with machine learning algorithms for inference, that is, predictive modelling. The reason for doing so might originate from the impression that traditional statistical methods are inappropriate for the challenges of big data. For example, the variety of data expressed by the number of covariates could render traditional statistical inference less attractive and impractical. On the other hand, machine learning algorithms are designed around modern statistical methods for dimension reduction and regularization (e.g. Lasso regression). The training of machine learning algorithms is what is otherwise known as parameter estimation in statistical modelling and is no different from statistical learning methods, being based on cross-validation and bootstrapping.
In summary, to a certain degree statistical learning and machine learning do not differ. However, in public health, applications of big data analysis, namely predictive modelling including time series forecasting, differ from traditional biostatistical data analysis in terms of risk factor identification and assessment. Breiman distinguished this as "the two cultures" of statistical modelling: the data modelling culture and the algorithmic modelling culture (24). He argued that statistical theory is irrelevant if modelling assumptions are not met in real-data situations. However, he also admitted that machine learning algorithms are often based on little theory, and modelling assumptions are replaced by properties of the algorithms, that is, whether these converge and deliver good predictions.
From a philosophical point of view, machine learning is based on a "black box" that is not open to interpretations or explanations. In the current example, the ANN(11,1,6) 12 algorithm included a nonlinear combination of the time series data and 85 parameters (23). On the other hand, the SARIMA model describes how past observations affect the future course of a process; this characteristic might propose causal hypotheses (25). Therefore, it is not entirely correct to simply compare the forecasting methods by their predicted values or accuracy measures as the approaches are philosophically different and not entirely comparable: the ANN is a predictive algorithm, while the SARIMA is a descriptive and predictive model.

Limitations
A limitation of this study is the lack of adjustment for the population at risk. Indeed the Ontario population is steadily increasing, but at an annual rate below 0.5%, which is negligible in this context, where underreporting is of greater concern. No trend in the monthly cryptosporidiosis incidence rates was indicated by either the SARIMA or ANN approaches.

Conclusion
Cryptosporidiosis is a strongly seasonal disease, leading to good times and bad times of varying caseloads for public health. Machine learning methods suitable for forecasting of public health time series data from surveillance systems are becoming more popular; they have been demonstrated to be more accurate than traditional statistical methods. However, in this particular case study, the SARIMA model resulted in slightly lower RMSE and MAE and thus greater accuracy than the ANN. Both forecasting approaches captured the seasonal pattern of cryptosporidiosis well.

IMPLEMENTATION SCIENCE
Future studies should employ additional algorithms (e.g. random forests) and assess accuracy in different setting, either by using alternative diseases for case studies or employing a more systematic approach and conducting simulation studies.