Agglomerative Clustering of Enteric Infections and Weather Parameters to Identify Seasonal Outbreaks in Cold Climates

The utility of agglomerative clustering methods for understanding dynamic systems that do not have a well-defined periodic structure has not yet been explored. We propose using this approach to examine the association between disease and weather parameters, to compliment the traditional harmonic regression models, and to determine specific meteorological conditions favoring high disease incidence. We utilized daily records on reported salmonellosis and non-specific enteritis, and four meteorological parameters (ambient temperature, dew point, humidity, and barometric pressure) in Barnaul, Russia in 2004–2011, maintained by the CliWaDIn database. The data structure was examined using the t-distributed stochastic neighbor embedding (t-SNE) method. The optimal number of clusters was selected based on Ward distance using the silhouette metric. The selected clusters were assessed with respect to their density and homogeneity. We detected that a well-defined cluster with high counts of salmonellosis occurred during warm summer days and unseasonably warm days in spring. We also detected a cluster with high counts of non-specific enteritis that occurred during unusually “very warm” winter days. The main advantage offered by the proposed technique is its ability to create a composite of meteorological conditions—a rule of thumb—to detect days favoring infectious outbreaks for a given location. These findings have major implications for understanding potential health impacts of climate change.


Introduction
In health-related fields, including life sciences and epidemiology, the relationship between weather patterns and diseases represents a well-documented example of temporal interactions with highly anticipated predictive capacity [1,2]. Enteric infections caused by Salmonella, Cryptosporidium, Giardia, and many other pathogens are well known for their strong associations with ambient temperature and precipitation [3][4][5][6]. These infections are notorious for seasonal outbreaks which are difficult to track and control. The link between meteorological factors and water-and foodborne infections has been an important subject of biomedical research aiming to identify proxies to predict potential outbreaks and ultimately to develop strategies that minimize risks [7,8]. In temperate climates, bacterial enteric infections such as salmonellosis have been shown to peak in summer months due to an increased probability of exposure and transmission [6,9]. However, the temporal patterns of unspecified infections (i.e., infections for which the etiological cause is not determined) could vary substantially, making disease forecasting difficult. As we demonstrated in our previous work [3], the seasonal increases in unspecified infections matched to the peaks of well-established infections can provide insights into the etiology, risk factors, and transmission patterns of unspecified infections, and help recommend strategies to fine tune disease diagnostics and forecasting.
Harmonic regression models (HRMs) adapted to time-referenced data are commonly used to describe temporal patterns in epidemiology of infectious diseases, focusing on defining trends, seasonal peaks, effects of social events, and other cyclic behaviors. This class of models has also been applied to determine the associations between the incidence of infection and weather parameters, assuming a specific form of relationship is known or at least correctly specified [6,10,11]. HRMs show the best performance in explaining the relationship between health outcomes and weather parameters when the correlation structure among predictors is weak, which is rare in temperate climates. HRMs assume a well-defined periodic structure of the data, which may not be warranted for some diseases and weather parameters. With a continuous change in climatic conditions and extreme weather profiles, rapid and unusual aperiodic swings in the weather have been observed worldwide [12,13]. Climate change affects local weather, influences the dynamics of infectious pathogen transmission, and ultimately triggers changes in infectious diseases morbidity and mortality. In these dynamic situations, the descriptive, explanatory and predictive capacities of traditional HRMs are limited and alternative solutions need to be explored.
Methods involving clustering of time-series data are expanding in many scientific fields due to a potential versatility in creating a complete set of nested cluster solutions [14][15][16][17]. These methods aim to extract valuable information from complex datasets and to discover temporal patterns, adding new tools to an array of classic statistical approaches operating on time or frequency domains. Hierarchical clustering is gaining popularity in many fields of life sciences as an effective technique for identifying distinct homogeneous clusters [14,15]. These methods may reduce the burden of initial assumptions on the underlying data structure. Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom, which makes them suitable for building taxonomies. In agglomerative clustering, each observation is assigned to its own cluster, and then the two most similar clusters are merged based on a similarity metric (e.g., distance) between each of the clusters. The process is repeated until there is only a single cluster left. Each of the clusters can be further explored to characterize their typical features, to build a classification system, and to discover anomalies. Clustering of records in a single time series illustrates characteristic disease patterns, like seasonal oscillations. Clustering of records in multiple time series (e.g., disease, temperature, and precipitation) may allow to determine groups or variables with similar temporal patterns. If the patterns are synchronized in some specific way, it might be possible to detect the features that have a capacity to predict high or low disease incidence. These features of hierarchical clustering make them suitable for exploring the association between disease and weather parameters, especially when studying disease dynamics affected by climate change.
In the presented study, we propose a hierarchical agglomerative approach for cluster identification in multiple time series. We demonstrate its potential to identify infectious outbreaks based on environmental predictors, specifically a relevant set of weather conditions. We considered enteric infections transmitted by food and water (salmonellosis and bacterial intestinal infection) and metrological factors, such as atmospheric temperature, barometric pressure, relative humidity, and dew point in Barnaul, Russia, in 2004-2011. The daily records for infections and weather parameters were maintained by the Climate-Water-Diseases-Infections (CliWaDIn 1.0) database [8,18]. The preliminary analysis and visualization of the data structure was performed using the t-distributed stochastic neighbor embedding (t-SNE) method. The optimal number of clusters was selected based on Ward distance using the silhouette metric. The selected clusters were further assessed with respect to their density and homogeneity. We clustered the time series data into groups of days with similar values of meteorological factors and disease counts. By presenting this goal as an unsupervised machine learning task, we identified time periods when meteorological conditions "favor" infections. The main advantage offered by the proposed technique is its ability to create a composite of meteorological conditions-a rule of thumb-to detect days with "unseasonable weather" favoring infectious outbreaks for a given location.

Data and Methods
The study utilizes meteorological records of the public weather server "The Weather of Russia" (http://meteo.infospace.ru/) and passive surveillance records curated and maintained by the CliWaDIn version 1.0 database [8]. The database contains records of enteric waterborne infections for six cities in Russia (Novosibirsk, Barnaul, Ekaterinburg, Chelyabinsk, Krasnoyarsk, Vladivostok). In this analysis, we used data from the city of Barnaul from 1 January 2004 to 31 December 2011. Barnaul, the administrative center of Altai Krai, Russia, is located on the left bank of the Ob River at the confluence with the Barnaulka River (Latitude: 53 • 21 38.02" N; Longitude: 83 • 45 49.00" E), in the forest steppe zone of the West Siberian Plain. As of the 2010 Census, the city of Barnaul had 612,401 residents.

Health Outcomes
Daily counts of infections, recorded as "other specified salmonella infections" and "Bacterial intestinal infection or unspecified bacterial enteritis" and defined by International Classification of Diseases (ICD-10) as A02.8 and A04.9, are referred thereafter as salmonellosis (1901 cases) and enteritis (21,019 cases), respectively. Figure 1 shows the original time series of counts, illustrating the temporal dynamics of infections, and the histograms, reflecting the distributional shape of the temporal process. meteorological conditions-a rule of thumb-to detect days with "unseasonable weather" favoring infectious outbreaks for a given location.

Data and Methods
The study utilizes meteorological records of the public weather server "The Weather of Russia" (http://meteo.infospace.ru/) and passive surveillance records curated and maintained by the CliWaDIn version 1.0 database [8]. The database contains records of enteric waterborne infections for six cities in Russia (Novosibirsk, Barnaul, Ekaterinburg, Chelyabinsk, Krasnoyarsk, Vladivostok). In this analysis, we used data from the city of Barnaul from 1 January 2004 to 31 December 2011. Barnaul, the administrative center of Altai Krai, Russia, is located on the left bank of the Ob River at the confluence with the Barnaulka River (Latitude: 53° 21' 38.02'' N; Longitude: 83° 45' 49.00'' E), in the forest steppe zone of the West Siberian Plain. As of the 2010 Census, the city of Barnaul had 612,401 residents.

Health Outcomes
Daily counts of infections, recorded as "other specified salmonella infections" and "Bacterial intestinal infection or unspecified bacterial enteritis" and defined by International Classification of Diseases (ICD-10) as A02.8 and A04.9, are referred thereafter as salmonellosis (1901 cases) and enteritis (21,019 cases), respectively. Figure 1 shows the original time series of counts, illustrating the temporal dynamics of infections, and the histograms, reflecting the distributional shape of the temporal process. Enteritis has a higher mean incidence rate (7.38 vs. 0.67 cases per day) as compared to salmonellosis. As shown in the histogram (Figure 1a), the first bar indicates a high proportion of days (58%) with no cases of salmonellosis. The plots and descriptive statistics show that salmonellosis has prolonged periods with no cases and periodic outbreaks of infection in summer months, whereas the daily counts of enteritis fluctuate around the mean with loosely defined seasonality and occasional spikes of high counts. The daily counts are presented in Supplemental Table S1.

Meteorological Records
The study region has a continental relatively dry climate (Köppen Dfb) with a wide range of temperature extremes from below −45°C (−49°F) in the winter (defined as three consecutive months: Enteritis has a higher mean incidence rate (7.38 vs. 0.67 cases per day) as compared to salmonellosis. As shown in the histogram (Figure 1a), the first bar indicates a high proportion of days (58%) with no cases of salmonellosis. The plots and descriptive statistics show that salmonellosis has prolonged periods with no cases and periodic outbreaks of infection in summer months, whereas the daily counts of enteritis fluctuate around the mean with loosely defined seasonality and occasional spikes of high counts. The daily counts are presented in Supplemental Table S1.

Meteorological Records
The study region has a continental relatively dry climate (Köppen Dfb) with a wide range of temperature extremes from below −45 • C (−49 • F) in the winter (defined as three consecutive months: December, January and February) to above +35 • C (95 • F) in the summer (defined as three consecutive months: June, July and August). We abstracted the average daily values of temperature ( • C), atmospheric pressure (hPa), relative humidity (%) and dew point ( • C) for 2848 days of the study period from 1 January 2004 to 15 November 2011 (Supplemental Table S1). The proportion of missing records is low (1% or 28 days out of 2849 days) with 2 weeks of missing data in late February and March of 2009. The time series of weather parameters have pronounced seasonal variations with no clearly marked trend. Table 1 contains summary statistics for health outcomes and meteorological data. The time series of daily averages for weather parameters and their distributions are shown in Figure 2.

Agglomerative Clustering
The construction of hierarchical clustering can be performed in a divisive or in an agglomerative manner. In divisive clustering, all the observations are assigned to a single cluster, which is then partitioned into two least-similar clusters. The process is repeated recursively on each cluster until there is one cluster for each observation. Agglomerative clustering offers a viable alternative to the k-means algorithm, another popular classification approach. In agglomerative clustering, clusters are created by linking each data point to its nearest neighbor using a distance measure, as opposed to creating a cluster based on centroids as for the k-means algorithm. The number of clusters in agglomerative clustering is determined by reiterating the algorithm with different linkage methods, so all the available points converge into a rapidly diminishing number of clusters until they all comprise a single group. While agglomerative clustering algorithms are more cumbersome and computationally demanding than the k-means methods, they create a complete range of nested cluster solutions. This feature is especially attractive when we need to classify finite series of events that occurred under specific conditions. For example, in epidemiologic investigations we often need to define conditions that are favorable for disease transmission or harmful exposures.
We denote the daily time series of disease counts, z 1t and z 2t , for salmonellosis and enteritis, respectively, and a set of time series of weather parameters, w mt , where m = {1, . . . ,4} indicate a parameter, t = {1, . . . ,T} is a study day, and T is the effective length of time series, T = 2848 days. As illustrated in Figures 1 and 2, z 1t and z 2t are assumed to be derived from a Poisson distribution with parameters λ l and λ 2 , respectively; and w mt , are assumed to be derived from a multivariate normal distribution, N(µ, Σ).
The task of clustering the time series of disease counts and weather parameters consisted of multiple steps. First, we consider a time series as a sample X T = {x 1 , . . . , x T }, where an object x i is a certain day with the centered daily values of disease counts and meteorological parameters, say x i = {z 1i , w 1i , w 2i , w 3i , w 4i }. To each day x i ∈X T , we assign a label, or a cluster identifier, of class y i . Next, to describe and visualize the multidimensional data, we examine the clustering structure using the t-Distributed Stochastic Neighbor Embedding (t-SNE) method [19], which relies on the conditional probabilities, as follow (Equation (1)): The conditional probability | conveys how close the values of day xj are to those of day xi, given a normal distribution centered on xi with variance σi 2 [19]. The variance σi 2 of each day was selected considering the perplexia estimate, such that the days with the most common values of disease counts and weather parameters, or in other words days with higher density, have a smaller variance [19]. For days with similar disease counts and meteorological parameters, | should be relatively high, whereas for days with very different counts and meteorological parameters, | should be almost infinitesimal.
To perform the agglomerative clustering, we split the segments of the time series into disjoint clusters starting with T clusters of size 1 and continue until all the days are included into one cluster with the size of T. In the first step of the algorithm, T-1 clusters are formed, one of size two and the remaining of size 1. The rule to join two days into one cluster is based on a Ward distance metric, R [20] aiming to minimize the R(C, H) value. Thus, at each step of the algorithm we combine clusters with the minimum distance value. The algorithm stops when all sample units are combined into a The conditional probability p j|i conveys how close the values of day x j are to those of day x i , given a normal distribution centered on x i with variance σ i 2 [19]. The variance σ i 2 of each day was selected considering the perplexia estimate, such that the days with the most common values of disease counts and weather parameters, or in other words days with higher density, have a smaller variance [19]. For days with similar disease counts and meteorological parameters, p j|i should be relatively high, whereas for days with very different counts and meteorological parameters, p j|i should be almost infinitesimal.
To perform the agglomerative clustering, we split the segments of the time series into disjoint clusters starting with T clusters of size 1 and continue until all the days are included into one cluster with the size of T. In the first step of the algorithm, T-1 clusters are formed, one of size two and the remaining of size 1. The rule to join two days into one cluster is based on a Ward distance metric, R [20] aiming to minimize the R(C, H) value. Thus, at each step of the algorithm we combine clusters with the minimum distance value. The algorithm stops when all sample units are combined into a single large cluster of size T. At each step, Ward distances between clusters were calculated, as follow (Equation (2)): where C and H are clusters; ρ 2 is the Euclidian distance between the geometric centers G C and G H of the clusters C and H; and |C| and |H| are the numbers of objects (or days) in the clusters C and H, respectively. Ward's distance was chosen because it has tensile properties and shows good results in preliminary experiments on clustering [20][21][22]. These properties were ensured by applying normalization and standardization to each variable before the clustering procedure. In addition, the standard procedure implemented in the adapted package used PCA to transform data (the forward transformation-before clustering and reverse transformation-after clustering) to minimize the correlation across the features. Next, to assess the quality of clustering and to choose the number of clusters, we maximized the silhouette metric [20] (Equation (3)): where ρ 2 is the Euclidian distance from a given object to objects within the same cluster, and ρ 2 is the Euclidian distance from a given object to objects from the nearest cluster (different from the own cluster). The silhouette metric is calculated separately for each object x i of the sample (i.e., each day of the time series). In the context of a daily time series, "distances from one day to another" can be interpreted as similarities among days based on disease counts and the averages of meteorological factors. Thus, the silhouette metric for time series data can be interpreted as a similarity profile of time series segments (clusters).
To determine the number of clusters, we used computational experiments to calculate the silhouette metric with the number of clusters ranging from 4 to 15. The lower limit of four clusters captures the four calendar seasons characteristic for cold temperate climates. The upper limit of 15 clusters was chosen to warrant at least two major seasons over the study period. For each experiment, we calculated the silhouette metric and chose the optimal number of clusters according to the maximum value of the silhouette metric.
Statistical characteristics (mean, standard deviation, interquartile range, and kurtosis) were calculated for all clusters. We used the values of cluster interquartile range (IR) and kurtosis as metrics of clustering quality, with an emphasis on clustering density and homogeneity. The interquartile range, a robust analogue of the variance, was used to indicate clustering density or diffusion (the higher the IR the more diffuse is the cluster). We used kurtosis as a marker of homogeneity. Thus, the clusters with smaller values of both metrics are preferred.

Log-Linear and Harmonic Regression Models
To ease the interpretation of clustering results we also performed the traditional negative binomial harmonic regression and estimated seasonal peak timing for each investigated health outcome [5,6]. The basic Model A was formulated as follows (Equation (4)): where z t is the daily counts for z 1t and z 2t , e.g., salmonellosis and enteritis, respectively; seasonality in disease cases was assessed based on the significance of the two harmonic terms with t as consecutive days of the study period, and ω = 1/365.25. Estimates obtained from fitting Model A can be used to compute phase shift φ based on join signs ofβ s andβ c : where k = 0 when bothβ s andβ c are positive, k = 2π whenβ s < 0 andβ c >0, and k = π otherwise. Peak timing is calculated by multiplying the phase shift by 365.25/2π. Using the δ-method [6], variance (Var) and standard deviation σ of theφ estimates can be estimated as: where σ 2 β s and σ 2 β c are the variances of the estimates of β s and β c , respectively, and σ β s β c is the covariance. We then explored the sensitivity of the peak timing estimates and repeated the modeling by sequentially adding the first order autocorrelation term (Model A1) and indicator variables for day of the week (Model A2).
We examined the potential form of associations among daily values of health outcomes and all predictors using loess smoother with the span parameter of 0.75 and estimated pair-wise Spearman correlation coefficients. We then fitted a standard generalized negative binomial regression to quantify the association between daily counts of the health outcome, z t , and weather parameters, w jt (Model B): where β m is the regression coefficient for the m-weather parameter. Since ambient temperature, pressure, and dew point values are highly correlated, we ran Model B with each weather parameter separately. Relative risks (RR) associated with a 10 unit of change in each weather parameter and their 95th confidence intervals (CI 95% ) are estimated as RR j = exp{10 β m } and CI 95% = exp{10(β m ± 1.96σ β m )}, respectively. The Model B was then further explored with respect to its stability by adding sequentially the first order autocorrelation term, seasonal harmonics, and indicator variables for day of the week.
The following software was used in the experiments: programming language Python 2.7 and Scikit-learn framework version 0.18 (http://scikit-learn.org/) to implement t-SNE and hierarchical clustering methods; R version 3.5.1 (R Core Team, Vienna, Austria) and R-Studio version 1.1.463 (RStudio, Boston, MA, USA) to run harmonic and log-linear modeling.

Results
Results of the t-SNE method showed the preliminary cluster structure ( Figure 3) in a multidimensional space projected on two dimensionless axes. These axes reflect the clustering distances, or normalized composites of all records from the multiple time series, forming a distance matrix that allows us to identify individual days that form a similar pattern based on the weather parameters and health outcome values. Colors represent health outcome counts from 0 to 12 for salmonellosis and 0 to 27 for enteritis. We considered the results of t-SNE visualization as a supporting tool for preliminary assessment of clustering and of the number of clusters in the original time series. Visually, the time series of salmonellosis counts may contain from 8 to 12 clusters. The clusters of days with nonzero counts of salmonellosis are concentrated in the upper left part of the panel; and clusters of days with no cases (shown in black) can be clearly distinguished. For enteritis, there appear to be four major clusters, which are likely to refer to four climatic seasons, governed by the weather characteristics, but not by disease outcomes. Days with higher counts (shown in light orange and yellow) are concentrated in distinct areas in each of those four clusters. Yet, the clustering is not as apparent and requires further partitioning and quantification, as presented in the next section.
To determine the optimal number of clusters describing the original data, we used the silhouette metrics calculated for each sample. The silhouette metric values for agglomerate clustering with the Ward distance are shown in Figure 4. The maximum value of the metric for salmonellosis was 0.224 and indicated the presence of six clusters, and for enteritis it was 0.192 and indicated the presence of eight clusters. In addition, a local maximum of the silhouette value indicated the presence of eleven clusters for both infections. These results justify additional analysis to characterize the clusters corresponding to the global and local maximum values of the silhouette metric associated with two sets of clusters.  The results of further analysis described below focus on salmonellosis and enteritis infections by presenting the clustering as a sequence of two sets, based on the global and local maximum values of the silhouette metric. For salmonellosis, there were six clusters (Set 1) in the global maximum and 11 clusters (Set 2) in the local maximum ( Table 2). The description of selected clusters consists of the average values of daily morbidity with their 95% confidence intervals, the average values of meteorological characteristics, and the number of days in the clusters belonging to each of four seasons, defined by three consecutive months (starting with December-February as winter). Cluster 5 in Set 1 has high disease counts (greater than 3 cases per day, representing the 98 th percentile) mainly during summer days with a relatively high average temperature of 16.09 °C, low average humidity of 65.82%, and low pressure of 989.65 hPa. This cluster also picks 50 days and 12 days that were unseasonably warm for spring and fall, respectively. Set 2 consisted of eleven clusters for salmonellosis ( Table 2). The two clusters with high counts of salmonellosis (on average 3 or more cases per day) were Cluster 4 and Cluster 10. Cluster 4 mainly contains days of the summer season (similarly to Cluster 5 in Set 1), while Cluster 10 includes 66 days the spring, summer and fall seasons with low relative humidity values of 48.32%. Thus, these clusters of high disease counts occurred during unseasonably warm days in spring and fall, and during summer days with relatively high average temperature, low average humidity, and low pressure and (Table 2).  The results of further analysis described below focus on salmonellosis and enteritis infections by presenting the clustering as a sequence of two sets, based on the global and local maximum values of the silhouette metric. For salmonellosis, there were six clusters (Set 1) in the global maximum and 11 clusters (Set 2) in the local maximum ( Table 2). The description of selected clusters consists of the average values of daily morbidity with their 95% confidence intervals, the average values of meteorological characteristics, and the number of days in the clusters belonging to each of four seasons, defined by three consecutive months (starting with December-February as winter). Cluster 5 in Set 1 has high disease counts (greater than 3 cases per day, representing the 98 th percentile) mainly during summer days with a relatively high average temperature of 16.09 °C, low average humidity of 65.82%, and low pressure of 989.65 hPa. This cluster also picks 50 days and 12 days that were unseasonably warm for spring and fall, respectively. Set 2 consisted of eleven clusters for salmonellosis ( Table 2). The two clusters with high counts of salmonellosis (on average 3 or more cases per day) were Cluster 4 and Cluster 10. Cluster 4 mainly contains days of the summer season (similarly to Cluster 5 in Set 1), while Cluster 10 includes 66 days the spring, summer and fall seasons with low relative humidity values of 48.32%. Thus, these clusters of high disease counts occurred during unseasonably warm days in spring and fall, and during summer days with relatively high average temperature, low average humidity, and low pressure and (Table 2). Cluster 5 in Set 1 has high disease counts (greater than 3 cases per day, representing the 98th percentile) mainly during summer days with a relatively high average temperature of 16.09 • C, low average humidity of 65.82%, and low pressure of 989.65 hPa. This cluster also picks 50 days and 12 days that were unseasonably warm for spring and fall, respectively. Set 2 consisted of eleven clusters for salmonellosis ( Table 2). The two clusters with high counts of salmonellosis (on average 3 or more cases per day) were Cluster 4 and Cluster 10. Cluster 4 mainly contains days of the summer season (similarly to Cluster 5 in Set 1), while Cluster 10 includes 66 days the spring, summer and fall seasons with low relative humidity values of 48.32%. Thus, these clusters of high disease counts occurred during unseasonably warm days in spring and fall, and during summer days with relatively high average temperature, low average humidity, and low pressure and (Table 2). Colored rows refer to clusters with the high disease counts.
We compared the histograms of the clusters identified in Sets 1 and 2 ( Figure 5). The clusters with the highest number of cases (Cluster 5 in Set 1 and Cluster 4 in Set 2) showed very similar histograms (Figure 5a,b, respectively) and were identifiable with both the local and global maxima of the silhouette metric. Table 3 shows the interquartile range (IR) and coefficient of kurtosis of disease counts and meteorological characteristics for the clusters with high disease counts: Cluster 5 in Set 1 and Clusters 4 and 10 in Set 2. The IR values for disease counts show that in general the structure of clustering is similar, yet for weather parameters the IR values are smallest in Cluster 4, suggesting a dense and homogeneous arrangement of this cluster. The kurtosis values for Cluster 5 were consistently high, indicating a less defined or more diffused clustering. For Cluster 10, which was the smallest cluster by size, the kurtosis values were small, yet the IR values were high. This suggests that Cluster 10 was on average more diffuse with respect to temperature, dew point, and pressure parameters. We compared the histograms of the clusters identified in Sets 1 and 2 ( Figure 5). The clusters with the highest number of cases (Cluster 5 in Set 1 and Cluster 4 in Set 2) showed very similar histograms (Figure 5a and 5b, respectively) and were identifiable with both the local and global maxima of the silhouette metric. Table 3 shows the interquartile range (IR) and coefficient of kurtosis of disease counts and meteorological characteristics for the clusters with high disease counts: Cluster 5 in Set 1 and Clusters 4 and 10 in Set 2. The IR values for disease counts show that in general the structure of clustering is similar, yet for weather parameters the IR values are smallest in Cluster 4, suggesting a dense and homogeneous arrangement of this cluster. The kurtosis values for Cluster 5 were consistently high, indicating a less defined or more diffused clustering. For Cluster 10, which was the smallest cluster by size, the kurtosis values were small, yet the IR values were high. This suggests that Cluster 10 was on average more diffuse with respect to temperature, dew point, and pressure parameters. For enteritis, there were eight clusters (Set 1) in the global maximum and 11 clusters (Set 2) in the local maximum ( Table 4). The clustering results of Set 1 reveal two clusters, Cluster 2 and Cluster 8, with high disease counts, which were on average exceeding 10 cases per day (representing the 81st percentile). The histograms for these clusters are shown in Figures 6a and 6b. Cluster 2 includes primarily spring and winter days with an average temperature of −2.11 °C, relative humidity of  For enteritis, there were eight clusters (Set 1) in the global maximum and 11 clusters (Set 2) in the local maximum ( Table 4). The clustering results of Set 1 reveal two clusters, Cluster 2 and Cluster 8, with high disease counts, which were on average exceeding 10 cases per day (representing the 81st percentile). The histograms for these clusters are shown in Figure 6a,b. Cluster 2 includes primarily spring and winter days with an average temperature of −2.11 • C, relative humidity of 75.62%, and pressure of 993.98 hPa, which are locally considered as "warm winter" or "cold spring" days. Cluster 8 includes the days of the spring season (March and April), with high pressure values of 1004.68 hPa, low relative humidity of 56.44%, and relatively low temperature values of −4.6 • C. Clustering of enteritis in Set 2 revealed four clusters (Clusters 7, 8, 10, and 11) with high disease counts (on average above 10 cases per day). The histograms of the cluster features are shown in Figure 6c-f. Cluster 8 in Set 2 is identical to Cluster 8 in the Set 1 (Figure 6b,d). Clusters 7,8,and 11 includes days that are unseasonably warm for local winters with an average daily temperature of −4.52 • C, pressure of 995.25 hPa, and humidity of 76.84%, yet are also colder than typical spring weather. Cluster 11 includes "very warm" winter days with an average temperature of 3.87 • C, high mean pressure values of 1000.93 hPa and average humidity of 68.93%. Cluster 2 from Set 1 and Cluster 10 from Set 2 contain days representing all four seasons, which indicates potentially sporadic outbreaks, deserving further evaluation.
Int. J. Environ. Res. Public Health 2019, 16, x; doi: www.mdpi.com/journal/ijerph Clustering of enteritis in Set 2 revealed four clusters (Clusters 7, 8, 10, and 11) with high disease counts (on average above 10 cases per day). The histograms of the cluster features are shown in Figure  6c to 6f. Cluster 8 in Set 2 is identical to Cluster 8 in the Set 1 (Figure 6b and d). Clusters 7,8,and 11 includes days that are unseasonably warm for local winters with an average daily temperature of −4.52 °C, pressure of 995.25 hPa, and humidity of 76.84%, yet are also colder than typical spring weather. Cluster 11 includes "very warm" winter days with an average temperature of 3.87 °C, high mean pressure values of 1000.93 hPa and average humidity of 68.93%. Cluster 2 from Set 1 and Cluster 10 from Set 2 contain days representing all four seasons, which indicates potentially sporadic outbreaks, deserving further evaluation.  Table 5 offers a comparison of clusters with a high incidence of enteritis for both sets based on measures of interquartile range and kurtosis. Cluster 8 in Set 1 and Set 2 are identical, which is explained by the mechanism of hierarchical clustering. Clusters 7 and 11 have the smallest values of interquartile range for all weather parameters, indicating a smaller dispersion of these features for the days in those clusters. Overall, Cluster 7 is the most homogeneous cluster in terms of  Table 5 offers a comparison of clusters with a high incidence of enteritis for both sets based on measures of interquartile range and kurtosis. Cluster 8 in Set 1 and Set 2 are identical, which is explained by the mechanism of hierarchical clustering. Clusters 7 and 11 have the smallest values of interquartile range for all weather parameters, indicating a smaller dispersion of these features for the days in those clusters. Overall, Cluster 7 is the most homogeneous cluster in terms of meteorological factors and has the smallest values of kurtosis. Cluster 10 is the smallest cluster with only 47 days, yet it is a very noisy cluster that includes sporadic days with high disease counts. Using the clustering approach, we estimate the specific days and the boundaries of each selected cluster (Figures 7 and 8). Figure 7 illustrates the annual boundaries for the largest cluster constructed for salmonellosis, Cluster 5, along with the time series highlighting the days selected for the cluster. Cluster 5 was centered around early July (186th ± 21 days), supporting the marked 127 summer days of this cluster. Similarly, Figure 8 shows the annual boundaries for Cluster 8 constructed for non-specific enteritis with the annual centroids near late March and early April (95th ± 13 days). The peak timing estimated based on the results of the basic harmonic regression model (Model A, Table 6) confirms these findings. The seasonal increases for salmonellosis and enteritis were in early July (187th ± 7 days) and mid-April (103th ± 10 days), respectively. The sensitivity analysis demonstrates the stability of peak timing estimation (Table 6). The associations between ambient temperature or dewpoint and salmonellosis are relatively weak yet significant, which agrees with the salmonellosis summer peak timing ( Figure 9). As expected for temperate climates, the associations between ambient temperature or dewpoint and barometric pressure are quite strong and inverse. The results of the univariate log-linear regression Model B shown in Table 7 agree with the agglomerative clustering results and Spearman correlation. The relative risks of an increase in daily salmonella counts associated with 10 • C in ambient temperature and dew point were 27.8% (CI 95%:22.8-33.0) and 30.9% (CI 95%:25.2-37.0), respectively. For both health outcomes, low humidity and barometric temperature were associated with an increase in disease counts. However, after adjusting for autocorrelation, seasonality, and effects of the day of the week, these results no longer hold (data not shown).

Discussion
The presented study illustrated a hierarchical agglomerative approach for clustering the time series data as allocation of days with similar values of meteorological factors and disease counts. We detected a well-defined cluster for salmonellosis, which occurred primarily during summer months and unusually warm days in spring. For the counts of enteritis, or non-specific bacterial intestinal infection, we detected two clusters during unseasonably warm winter days. The proposed method offers several important insights for understanding temporal variations and meteorological conditions favoring the transmission of enteric infections. The main advantage offered by the proposed techniques is its ability to create a composite of meteorological conditions similar to a rule of thumb to suspect a day with potentially high disease counts for a given location. This method can also tolerate low periodicity and autocorrelation, and thus be useful for unstable weather periods, especially during transitional seasons like spring and fall in temperate climates. The ability to identify the environmental predictors of infectious outbreaks, specifically a relevant set of weather conditions is essential for building predictive models to inform public policy [9,23]. Conventional weather forecasts can be used to test whether the proposed method has sufficient predictive capacity, especially in situations when the weather changes abruptly. These findings have major implications for understanding potential health impacts of climate change and extreme weather effects.
In comparison to classic harmonic regression models that offer the estimates of seasonal characteristics for well-defined periodic processes, the clustering approach can help in fine tuning for unseasonable weather. As weather instability has become more prominent with climate change, the predictive capacity of typical harmonic prediction that assumes a well-defined cyclical behavior is likely to diminish. Using the clustering approach, we can estimate the specific days and the boundaries of each selected cluster with high disease counts (as shown in Figures 7 and 8). The boundaries of the cluster fluctuated depending on the number of days with the pre-specified weather pattern in each year. In fact, in the last year of the study period the upper boundary of the cluster was substantially shifted to the right, picking up the warm weather pattern during the unusual late summer and fall of 2011.

Discussion
The presented study illustrated a hierarchical agglomerative approach for clustering the time series data as allocation of days with similar values of meteorological factors and disease counts. We detected a well-defined cluster for salmonellosis, which occurred primarily during summer months and unusually warm days in spring. For the counts of enteritis, or non-specific bacterial intestinal infection, we detected two clusters during unseasonably warm winter days. The proposed method offers several important insights for understanding temporal variations and meteorological conditions favoring the transmission of enteric infections. The main advantage offered by the proposed techniques is its ability to create a composite of meteorological conditions similar to a rule of thumb to suspect a day with potentially high disease counts for a given location. This method can also tolerate low periodicity and autocorrelation, and thus be useful for unstable weather periods, especially during transitional seasons like spring and fall in temperate climates. The ability to identify the environmental predictors of infectious outbreaks, specifically a relevant set of weather conditions is essential for building predictive models to inform public policy [9,23]. Conventional weather forecasts can be used to test whether the proposed method has sufficient predictive capacity, especially in situations when the weather changes abruptly. These findings have major implications for understanding potential health impacts of climate change and extreme weather effects.
In comparison to classic harmonic regression models that offer the estimates of seasonal characteristics for well-defined periodic processes, the clustering approach can help in fine tuning for unseasonable weather. As weather instability has become more prominent with climate change, the predictive capacity of typical harmonic prediction that assumes a well-defined cyclical behavior is likely to diminish. Using the clustering approach, we can estimate the specific days and the boundaries of each selected cluster with high disease counts (as shown in Figures 7 and 8). The boundaries of the cluster fluctuated depending on the number of days with the pre-specified weather pattern in each year. In fact, in the last year of the study period the upper boundary of the cluster was substantially shifted to the right, picking up the warm weather pattern during the unusual late summer and fall of 2011.
As expected, the traditional harmonic regression is not sensitive enough to capture departures from stable oscillations. The observed summer peak is typical for well documented seasonal increase in salmonellosis in temperate climates [6]. While the single summer peak in salmonellosis and the major spring peak in enteritis were detected by both methods, the harmonic regression missed the irregular spikes of salmonellosis in spring and spikes of non-specific enteritis occurred during summer months. For enteritis, the annual centroids of Cluster 8 fluctuated near late March and early April (95th ± 13 day of the year). Similarly, the harmonic regression model predicts the peak on the 103rd day of the year. These similarities reassure the potential for applying both methods simultaneously to better characterize the detected increase: by estimating seasonal peak timing offered by harmonic regression and by defining the homogeneity of a seasonal increase offered by the agglomerative clustering method. The classic log-linear regression models to assess the association between daily counts of infection and weather characteristics confirmed the relationships presented by the agglomerative clustering but offered low predictive capacity, with Model B having a variability below 10%. Even after adjusting for first order autocorrelation and weekly cycle, the traditional model was still insufficient to capture temporal variations associated with subtle changes in weather variables.
We anticipated that a comparison of the seasonal clusters in two health outcomes, salmonellosis and non-specific enteritis, will give us some clues whether some cases of non-specific enteritis were likely to be untested bacterial infections. We were able to make such assertions in our early study of medical claims [7]. We found that most nonspecific gastrointestinal diseases peak concurrently with viral enteritis, suggesting a lack of diagnostic testing for viruses, which may adversely affect the efficiency of prevention, surveillance, and treatment efforts. It is also possible that the seasonal increases in non-specific infections could be indicative of emerging strains that are rarely tested in routine settings. The detected partial overlap in two infections indicate that the suggested clustering approach could complement existing methodology for outbreak detection and characterization [23][24][25], and might have the potential to estimate the rate of under-reporting and provide recommendations to test for infections known to peak in certain times.
The suggested approach also offers characteristics to measure cluster compactness and homogeneity. We used the interquartile range and the kurtosis coefficient to estimate the features distribution of the obtained clusters. The interquartile range, which is a robust analogue of the variance, allows us to estimate the spread of the features of the objects in a cluster. Kurtosis allows us to estimate the sharpness of the distribution peak and the presence of "popular values" of features as well as the homogeneity or the flatness of the distribution. As the results show, splitting the data into smaller clusters can yield more compact subsets of data describing morbidity and weather factors. In fact, for salmonellosis Cluster 4 (Set 2) turned out to be more compact in comparison with Cluster 5 (Set 1) for weather factors: the values of interquartile range are smaller, while the excess value for pressure and relative humidity is higher. A similar situation is observed for bacterial infection, where Clusters 7 and 11 (Set 2) had lower interquartile range values as compared to the largest Cluster 2 (Set 1).
In studying the temporal variations in infectious disease, the outliers and missing data points often complicate the classical time series analysis. Typically, if there are missing values in the time series, then perhaps standard methods for imputing or recovering missing values, most suitable for time series, should be used at the development and application stages [26]. While in the presented example the proportion of missing data was low, it is possible that potential underreporting had resulted in overall reduced counts. Both HRMS and agglomerative clustering can tolerate a small fraction of values missing at random. However structural or systematic missingness might affect cluster detection, and the extent of such effect should be further investigated.
For infectious diseases characterized by periods of outbreaks alternating with periods of low incidence, deciding whether a spike in fact is an outlier is difficult. It is unlikely that a true process governing the observed distribution is known, so we often rely on common assumption of Poisson or negative binomial distributions typical for time series of counts. In our example, only one value (12 cases of salmonellosis) warrants suspicion and when we tested its impact this single outlier did not affect the result of clustering. We argue that the understanding of how a substantial fraction of outliers might influence the clustering process, such repeating the analysis with and without outliers [27], or by applying robust methods of clustering [28], should be a standard practice.
The proposed example tackled the time series of infections only in one location and it would be valuable to know if the detected clusters will hold in cities with similar population structures and similar climates. The CliWaDIn database contains data for six Russian cities and this research will be further extended to test whether the number and characteristics of clusters vary geographically and also whether it is possible to build a "universal model" [29]. In addition to meteorological characteristics, any future analysis should consider water quality parameters, demographic structure, and social factors such as holidays, typical vacation times, workdays, and weekends. In our current model, we did not consider the incubation period or lag effects of a disease and included only weather characteristics for the day of disease counts. We envision that at the next stage an investigation of the lag structure and delayed effects is essential for revealing the patterns that might precede an outbreak [11]. The poor performance of traditional log-linear models could stem from potential non-linearity, auto-correlation, and presence of lagged effects [30]. The inclusion of additional adjustments to capture the delayed effects of weather variables on the health outcome could potentially improve the model fit. In fact, in 2008 we published a method that allowed us to quantify the time-distributed effect of exposure on a spread of infection with a known incubation period [11]. We modeled the health effect of high temperature on reported cryptosporidiosis caused by a pathogen with~7-day incubation period and demonstrated the model's ability to recover the effect with a nonlinear distributed lag structure. Future studies should determine a structure of the delay and consider various reasons for the delay (including an incubation period, a differential delay due to disease severity, an administrative delay due to testing and reporting procedures), and compare the delayed effects recovered by both log-linear model and machine learning methods with a broad range of lags and permutations for weather parameters.
Understanding of temporal patterns along with accurate estimation of seasonal peak timing are essential for building reliable predictive models. Foodborne and waterborne enteric infections caused by pathogens sensitive to environmental conditions are difficult to track and control. While their seasonal peaks could be well synchronized, outbreaks triggered by high temperature may exhibit peak delays [7]. Improvements in surveillance system detection and reporting, implementation of public health policies along with environmental and climatic changes may result in shifts of seasonal peaks and amplitudes [10]. An investment into routine monitoring, tracking and forecasting of infections sensitive to environmental and climate conditions is reasonable and cost effective considering the damage cost of climate change [12,13,23]. While the proposed model is not designed to be continually updated with real time meteorological forecast because of peculiarities of the hierarchical clustering mechanism, future studies should explore opportunities for developing methods for near-term forecasting and early outbreak detection based on emerging methodology and data sources.

Conclusions
With the growing availability of routine disease monitoring, identification of environmental conditions that are predictive of disease occurrence requires novel ways to explore time-referenced data. Presenting this problem as an unsupervised machine learning task allowed us to detect and describe the features indicative of time periods with high counts of infections. We demonstrated the process of cluster identification and results of clustering by using two enteric infections that were likely transmitted by food and water, and were affected by metrological factors such as atmospheric temperature, barometric pressure, relative humidity, and dew point. We identified time periods with unseasonable meteorological conditions that "favor" infections. These findings had major implications for understanding the potential health impacts of climate change. The main research objective in identifying environmental predictors of infections should shift to designing robust algorithms to predict health outcomes based on meteorological, environmental, demographic parameters that govern the underlying dynamic system.  ,  Table S1. The daily counts of health outcomes and average daily values of temperature ( • C), atmospheric pressure (hPa), relative humidity (%) and dew point ( • C) from 1 January 2004 to 15 November 2011.
Author Contributions: E.N.N., I.N.Y. and P.S.S. designed the data analysis protocol; E.N.N. and I.N.Y. oversaw data abstraction; P.S.S. and T.M.A.F. conducted data analysis; P.S.S. and E.N.N. drafted the manuscript with input from T.M.A.F. and I.N.Y. All authors have reviewed and approved the manuscript.
Funding: E.N.N. was supported by the grant on "Ethical Dimensions of Data Access, Sharing, and Communication" from the Jonathan M. Tisch College of Civic Life at Tufts University.