Assessing Seasonality Variation with Harmonic Regression: Accommodations for Sharp Peaks

The use of the harmonic regression model is well accepted in the epidemiological and biostatistical communities as a standard procedure to examine seasonal patterns in disease occurrence. While these models may provide good fit to periodic patterns with relatively symmetric rises and falls, for some diseases the incidence fluctuates in a more complex manner. We propose a two-step harmonic regression approach to improve the model fit for data exhibiting sharp seasonal peaks. To capture such specific behavior, we first build a basic model and estimate the seasonal peak. At the second step, we apply an extended model using sine and cosine transform functions. These newly proposed functions mimic a quadratic term in the harmonic regression models and thus allow us to better fit the seasonal spikes. We illustrate the proposed method using actual and simulated data and recommend the new approach to assess seasonality in a broad spectrum of diseases manifesting sharp seasonal peaks.


Introduction
Understanding temporal changes in disease occurrence in human populations is one of priorities in epidemiology, public health, and life science related disciplines. This lofty goal implies ability to describe, quantify, and examine temporal patterns, which include increasing or declining trends, seasonal patterns, unusual spikes associated with outbreaks or disappearance of periodic episodes that mark disease eradication. These temporal characteristics are often explored in order to detect emerging trends in populations of concern, determine the success of intervention programs on a population level, and to forecast the future temporal behaviors [1][2][3]. The temporal analyses are also performed to better understand the effects of contributing factors to changes over time.
We define 'seasonality' as systematic, repetitive, periodic fluctuations in disease incidence over the course of one year. Disease seasonality is characterized by the magnitude, timing, and duration of of the Christian Medical College and Hospital, in Vellore, India over a decade, and monthly pneumonia and influenza death counts in the U.S. for 11 years, from 1968 to 1978 [18]. We also used simulated data with predefined trend and seasonal pattern to illustrate the proposed two-step approach.

The Base Model
The conceptual framework to describe periodic oscillations is expressed as Z t = µ + γ cos(2πωt + ϕ) + ε t (1) where, Z t is a time series of an outcome of interests measured at time t, t = 1, 2,...., N with N-An effective length of a time series (number of observations); µ is the constant reflecting the general baseline of Z t ; the periodic component has a frequency of ω, an amplitude of γ, and a phase angle of ϕ; and ε t , are independently and identically distributed normal random variables with E[ε t ] = 0 and Var[ε t ] = σ 2 . This model describes seasonal behavior by a cosine function with symmetric rise and fall over a period of a full year. The locations of two points, the seasonal curve peak and nadir (lowest point), can be determined using a shift, or phase angle parameter, which reflects the timing of the peak relative to the origin. The shift parameter is expressed in the time units of a time series and can be used for seasonality comparison. The amplitude of fluctuations between two extreme points is controlled via parameter γ. If γ = 0, there is no seasonal increase. We assume that a period of oscillation or a cycle is known; thus, the frequency, the reciprocal of the period in t units, is a fixed number. Therefore, the model has three parameters-the constant, amplitude and phase. To ease the estimation, the model can be re-formulated as [17] where, β S = −γ sin ϕ and β C = γ cos ϕ, the model parameters or beta coefficients. The temporal resolution of actual data can be reflected by ω = 1/M, where M depends on the unit of analysis, and is 4 for quarterly data, 12 for monthly data, 52.25 for weekly data, and 365.25 for daily data. A general framework of a regression model is sufficient to estimate the model parameters µ, β S , and β C . Furthermore, by using the δ-method, the estimates of peak timing and amplitude can be supplemented by the uncertainty measures [17]. This simple harmonic regression model can be applied to variety of scenarios and satisfy various forms of actual data in practical settings. For example, to model monthly counts, the model can be written as where, Z ti is the count in the tth month of ith year; t values range from 1 to 12; i values range from 1 to L, where L is number of years under observation. In the context of the model ω reflects the period within every year, and for the monthly data, ω = 1/12. Thus, the above equation can be rewritten as A linear combination of sine and cosine functions fits the seasonal variation in the outcome as a regular wave with a single, equally spaced peak and over the calendar year, with the actual position of the peak and trough guided by the data. The model parameters or beta coefficients β 0 , β S , β C can be used to estimate peak timing and amplitude.
The model can be extended to capture the trend of time series data, while adjusting for seasonality with the sine/cosine pair. For the example described above, including 'i' into the model can help to capture a long-term linear trend. Now, the harmonic regression model is rewritten as where, β Year is the model parameter or beta coefficient of the trend variable. We adapted the above equation (5) for the Poisson-distributed outcome to form the base model which we applied to our examples to fit monthly counts and estimate the peak time to enable the model extension.
We estimated the peak timing, θ and the amplitude, α using the δ-method [17,25] using the following transformations: θ = M {arctan( β S / β C ) + k}/2π and α ={β c 2 + β s 2 } 1/2 , where β C and β S were obtained from fitting Model A. The estimate of θ depends on join signs of β C and β S ; so k = 0, when both β C and β S are positive, k = 2π, when β C < 0 and β S > 0, and k = π, otherwise. Furthermore, standard deviations for amplitude α and peak timing θ can be also estimated.

Model Extensions
We extend the basic Model A to improve the fit by replacing cosine and sine functions with two wave functions: 2{1-cos(u)}/u 2 and sin(u)/u, which are the Fourier transform/characteristic function of the symmetric triangular density and the uniform density, respectively. The advantage of using these functions is that their maximum value is 1 at a predefined time. Similarly to using a linear and quadratic terms in a simple regression model, the proposed cosine Fourier transform function 2{1-cos(u)}/u 2 can be treated as a squared term of the sin(u/2)/(u/2). This quadratic form can be interpreted as acceleration to fit a sharper peak than the ordinary sine function. The derivation part is 2{1-cos(u)}/u 2 = 2(2 sin 2 (u/2)/u 2 ) = sin 2 (u/2)/(u/2) 2 = {sin(u/2)/(u/2)} 2 .
Based on this property, we transform time t i in Model A with u i = 2πω(t i −θ) = 2π(t i −θ)/M, where θ is peak timing. For simplicity, the value θ can be estimated from the actual time series, as described in Equation (7) using the basic Model A.
The proposed Models B, C, and D captures the variations in year, seasonal variation, and might be better tuned-up to capture the variations in seasonal amplitudes. The introduced terms based on Fourier transform functions can accommodate patterns with the sharp increase to reach the peak as shown in Figure 1. These models are likely to better describe the actual data.

Data
To illustrate the ability of the proposed models to capture trends and seasonal patterns we are using four examples based on the three actual datasets and one simulated dataset representing various infections occurred in specific populations. The datasets are presented in Supplemental Table S1. Below we provide a general description of infection's etiology, epidemiology, and an applied data set.

Example 1: Hospitalizations Due to Salmonellosis in U.S. Elderly
Every year, Salmonella infection is estimated to cause over one million foodborne illnesses in the United States, with 19,000 hospitalizations and 380 deaths annually. Majority of infected with Salmonella develop diarrhea, fever, and abdominal cramps 12 to 72 hours after infection. The illness usually lasts 4 to 7 days, and most persons recover without treatment. In the frail elderly, however, the infection may be so severe that the patient needs to be hospitalized. The 25,367 hospital records of salmonellosis (ICD-9-CM 003.X) were extracted from the U.S. Centers of Medicare and Medicaid Services (CMS) database from 1991 to 2002. Each individual record contained age, admission date, and diagnosis codes [2]. In order to conduct time series analysis, records were organized as monthly counts observed among patients aged 65 and older. The aggregation of records into monthly time series of counts was based on patient admission date.

Example 2: Laboratory-Confirmed Cases of Shigellosis in Christian Medical College and Hospital, India
Shigellosis is an infectious disease caused by a group of bacteria called Shigella (shih-GEHL-uh) with a common fecal-oral transmission route via contaminated food or water. Most of the people who are infected with Shigella develop diarrhea, fever, abdominal pain, and dysentery (stools with blood and mucus) starting a day or two after they are exposed to the bacteria. Shigellosis usually resolves in 5 to 7 days. There may be asymptomatic carriers of the bacteria who are a source of infection to others. Effective and frequent handwashing, provision of safe drinking water and hygienic methods of food handling can stop transmission of shigellosis. The Department of Microbiology at CMC, Vellore receives stool samples that are sent for culture of common enteric pathogens. Stool samples of patients attending the emergency or outpatient departments or admitted to the hospital with a history of passing loose, frequent stools were collected and registered for culture. The diagnosis of shigellosis is made by successfully isolating the organism by conventional culture methods and identifying using specific antisera and appropriate biochemicals. 1242 records of positive cultures for Shigella were extracted from laboratory records between January 2003 and December 2013 and organized as monthly time series.

Example 3: Monthly Records of Pneumonia and Influenza Death in US
Influenza (flu) is a highly contagious viral infection which is one of the most severe illnesses of the winter season. Influenza is spread easily, when an infected person coughs or sneezes. Pneumonia is a serious infection or inflammation of the lungs, which can lead to death. Influenza is a common cause of pneumonia, especially among younger children, pregnant women, individuals with certain chronic health conditions, and frail elderly. While, in healthy individuals, flu rarely leads to pneumonia, those that do tend to be more severe and deadly. In fact, flu and pneumonia were the eighth leading cause of death in the United States in 2014. Monthly records of pneumonia and influenza death per 10,000 population in U.S. for 11 years between 1968 and 1978, representing 3,855 death events were abstracted from the public source [18]. The monthly rates were converted as per 1,000,000 population for computational convenience. ). The set of ((p, d, q), (P, D, Q), and S) defines the properties of a simulated sequence, where S is the time span of repeating seasonal pattern, thus for a monthly time series S = 12. To obtain a sequence of values with an increasing trend and apparent seasonality, the AR and SAR parameters are taken to be 0.6 and 0 respectively. The MA and SMA parameters control the past error, which are taken as 0.6 and 0 respectively; the I and SI parameters are taken as 0 and 1 respectively. Data were generated under a Poisson distribution assumption to simulate counts.

Results
The inference based on case studies depends on size, prevalence, and seasonality of the specific diseases. Thus the case studies or data driven evidence of new model is unlikely to be robust. Therefore with the simulation we have introduced seasonality, trend using auto regression (AR) and moving average (MA) parameter values and the performance of the new model was compared with commonly used model.
In general the number of cases reported with Salmonelosis at the hospital has been declining. The rate of decline per year was by 7.4 counts, and Shigellosis infection is increasing over years at the rate of 0.6 counts per year. Flu data shows a declining trend with a rate of 0.9 counts per year. The dataset simulated showing an increasing trend with the rate of 7.8 counts per year.
The summary statistics for monthly values representing four examples are shown in Table 1. In addition to typical statistics, such as minimum, maximum, mean, standard deviations, first and third quartiles for monthly values and overall, we provide the estimates of coefficients of skewness and kurtosis ( Table 2).  Table 3 shows the results of root-mean-square error (RMSE), mean absolute deviance (MAD), Bayesian information criterion (BIC), and the regression coefficients for annual trend, sine and cosine terms for Models A, B, C, and D for four examples. Overall, all models provide relatively good fit to the data, yet the three applied statistics demonstrate potential model preference. While models are performing equally well in terms of RMSE and MAD, we consider BIC as a better measure for comparing between models. Root-mean-square error (RMSE); mean absolute deviance (MAD); Bayesian information criterion (BIC). Table 2 also contains the estimates of peak timing using the results of Model A. High values for skewness and kurtosis indicate the presence of sharp peaks in the studied time series of counts. Figure 2 shows the time series of monthly counts for salmonellosis for 12 years of the study period.  Table 3, Model B offers the best fit with RMSE (24.28) and BIC (1451.72) as compared to other models. Figure 3 shows the time series of monthly counts for Shigella-related infections for 12-year study period. Counts of shigellosis have been slowly increasing from 2003 through 2013. The range of observed values increased over time with more observations ranging from 5 (first quartile) to 12 (third quartile) cases per month. Counts reached their maximum value of 33 in June 2010 and kept minimum value of 1 case in many months. The time series exhibit a clear seasonality with high fluctuations in June (SD = 8.47) and low fluctuations in November months (SD = 2.05). On average the peak occurs in June and the dip is in October. The results of Model A indicate that on average the counts peaked at 6.29 month, e.g., early-mid June. While visually the trend is not apparent and seasonality is hard to depict, all models had detected a modest but significant upward trend and significant seasonal component. As shown in Table 3 all models have overall low values for MAD. Model C had lowest RMSE (4.85) and BIC (842.96) values as compared to other models.    1.89). In general, the peak occurs in January and the dip was observed in August. The results of Model A indicate that on average death counts peaked at 1.47 month, e.g. mid-January. The time series plot of predicted values shows a downward trend and well-defined seasonal behavior, yet with somewhat irregular peaks, fluctuating between December and March. All models detected downward trend and seasonal patterns. Again, Model C had the best fit with lowest RMSE (7.11), MAD (4.15), and BIC (851.78) as compared to other models.

Discussion
To capture strong seasonal behavior with sharp peaks, we offered a two-step process when we first build a basic model and estimate the seasonal peak. We then apply an extended model using sine and cosine transform functions. These newly proposed functions mimic a quadratic term in the harmonic regression models and thus allow us to better fit the seasonal spikes. We illustrated the proposed method using actual and simulated data and can recommend the new approach to assess seasonality in a broad spectrum of diseases manifesting sharp seasonal peaks.
In epidemiological and medical research, the regression methods are broadly used for the analysis of the time series data. The adaption of harmonic regression methodology is now well accepted to explore the trend and seasonality of diseases. Though three types of distributions: Gaussian, Poisson, and negative binomial assumed, most often Poisson harmonic regression is applied to accommodate the skewed nature of counts. The main drawback of harmonic regression is that it assumes a symmetrical nature of a harmonic process with the same rate of an increase and decrease in disease incidence from nadir to peak and vice versa [26]. Thus, by assuming a symmetric well-defined periodic structure, traditional harmonic models may not be ideal to capture the departures from stable oscillations [27]. The proposed approach could mitigate this problem.
There are difficulties involved in understanding and examining the concept of seasonality and its patterns as this need data to be collected over long time and over large spatial units. In the absence of such qualities, the evaluation may be affected by time-dependent and space-dependent confounders, which could possibly be improved by using a systematic approach to evaluation of seasonal curves [28], including parametric and non-parametric procedures of modeling [6], and non-linear methods developed with periodic functions in biology and climatology [25]. The proposed approach could further characterize disease seasonality.
For example, the seasonal pattern of influenza infection is not completely understood due to the heterogeneity of infection transmission and manifestation. Few ideas of modeling complex influenza dynamics was explored, including the pyramid structure of disease burden with respect to severity of disease [14]. Wenger et al., analyzed 13 influenza seasons by developing methods to measure seasonality characteristics and to quantify the uncertainty of relevant parameters [5]. Wenger et al., also noted the seasonal peaks in varying heights which could be because of variation between individual years and detected a positive correlation between peak timing and amplitude, meaning that the early flu season arrival was typically high in intensity. While the uncertainty of seasonality parameters was assessed with delta-methods in [5,22], the bootstrap method was applied to find the confidence intervals [7]. In Eilers et al., to account for the varying annual seasonality in the disease counts, the coefficients of harmonics (sine and cosines) were allowed to vary smoothly over the age and time plane in the modelling in analyzing monthly deaths of respiratory diseases, in US female for years 1958-1998 and for ages 51-100. The over-dispersion encountered during tuning the smoothness parameter, handled with selective weighting and by using quasi-likelihood instead of Poisson [29]. Negative binomial regression model with harmonic terms could be preferred over Poisson model due to the overdispersion confirmed by a statistical test. Chui et al., introduced graphical tools, so-called multi-panel graphs, to visualize simultaneously the population structure and temporal trend and to link the graphs to models like harmonic regression to ease the interpretation of models results. This graphical approach was applied for four datasets: Influenza and Salmonella associated hospitalizations, confirmed salmonellosis cases, asthma-related hospital visits in USA [30]. The spatiotemporal patterns of influenza associated hospitalizations were also analyzed using harmonic regression with an additional squared time term to account for a quadratic trend component along with the linear trend component [31]. The limitation of this study was that the simulation should have been done with various sample sizes and means. However, this study indicated that harmonic regression was meant for sharp peaks based on corroborative evidences from the case studies In the proposed study, we had illustrated that by adapting a two-step approach to commonly accepted models that can be easily implemented with existing statistical open-source software, the fit of the models can be further improved. The proposed approach can be broadly adapted to a wide range of scenarios when researchers are looking to statistical tools to formally compare the time series in different populations or across different time periods. In comparing time series, characterization of trend and seasonality are the key components of the analysis. For example, the outcome of interest might be the time series of disease incidence in a specific location and the task is to determine whether there is an overall decline in incidence in presence of strong seasonal variations with a likely complex form. In fact, we know that monthly cholera occurrences observed over 11 years exhibit a decreasing trend and a strong seasonality with high incidence in July and August [4,[32][33][34][35][36]. Our method would allow to find a more refined fit to existing data and offer better interpretation of the obtained results as compared to the traditional approaches.
The suggested two-step approach can be further improved by exploring harmonic terms with additional sine/cosine pairs at shorter or longer wavelengths, which should be able to accommodate a more complex temporal behavior. The typical periodic oscillations well defined in epidemiology occurred on a weekly, monthly, or quarterly basis and therefore these cycles are observed within a year. We recognize the limitation in using monthly values, which offer a coarser estimate than could be offered by refined time units, like days or weeks [26]. In order to improve the descriptive power of the regression models adapted to time series of counts to detect these cycles, it would be valuable to examine how the degree of temporal aggregation affect the accuracy and precision. It is likely that the proposed two-step procedure will improve with increasing temporal resolution, e.g., by replacing monthly counts with weekly or daily time series of counts. In our study, we use counts as an outcome without adjusting for population (except for pneumonia and flu deaths rates). We assumed the population is unlikely to meaningfully change over the study periods and affect the seasonality estimates. Further studies could explore the factors that affect trends, including changes in exposed population.
The suggested two-step approach provides a solution for fitting sharp peaks with simple transforms when peak timing is unknown. In general, peak timing can be roughly estimated by superimposing monthly counts over years of the time series data for the study period. One can also assume a discrete probability model for θ, so the probability masses can be estimated as the frequency ratios of θ t , the corresponding model would represent a harmonic mixture regression model.
Another direction for further adaptation of the proposed model is to expand the model by allowing additional factors that may help to explain trend and seasonal variation or account for confounding. As we limited our focus in introducing a model building strategy with new Fourier terms, we did not explore potential exposure variables. That could possibly be explored further. Thus, the interest of a study might be to identify factors that influenced a trend of a disease. Such factors could be represented by environmental, clinical, or demographic variables [7][8][9][10]26,27,31,37,38]. For example, in investigating the factors associated with seasonal peaks of cholera various environmental, climatic, and meteorological factors demonstrated potential links and provide an insight to underlying mechanisms [4]. We also assumed that the strong autocorrelation in the outcome time series is not of major concern after controlling for seasonality and trend patterns [39]. Yet, such an assumption should be further tested along with the assumption of a true annual period [26].
Applications of time series analysis is gaining momentum and the growing interest of public health professionals, epidemiologists, and clinicians for a better understanding and quantification of temporal variations in disease incidence require proper tools to conduct such research with increased accuracy.

Conclusions
Though numerous well proved complex models are available for time series data analysis, researchers prefer and use regression-based methods because of their diversity and flexibility in adopting amendments during model building procedures. While using these methods for convenience, unintentionally many inherent qualities of time series data are neglected in modeling. We already have well accepted harmonic regression model that provide good fit for trend and stable periodic patterns with relatively symmetric rises and falls. The new proposed model is compared and evaluated with the existing model using real time and simulated datasets, based on the fit statistics and other values. The newly developed model can handle time series data with sharp sporadic peaks and prolong periods of low incidence and could offer an advantage over the traditional harmonic regression model. Table S1: Monthly counts of salmonellosis, shigellosis, pneumonia, and influenza infections and simulated data. The following are available online at http://www.mdpi.com/1660-4601/17/4/1318/s1.