Data modelling recipes for SARS-CoV-2 wastewater-based epidemiology

Wastewater based epidemiology is recognized as one of the monitoring pillars, providing essential information for pandemic management. Central in the methodology are data modelling concepts for both communicating the monitoring results but also for analysis of the signal. It is due to the fast development of the field that a range of modelling concepts are used but without a coherent framework. This paper provides for such a framework, focusing on robust and simple concepts readily applicable, rather than applying latest findings from e.g., machine learning. It is demonstrated that data preprocessing, most important normalization by means of biomarkers and equal temporal spacing of the scattered data, is crucial. In terms of the latter, downsampling to a weekly spaced series is sufficient. Also, data smoothing turned out to be essential, not only for communication of the signal dynamics but likewise for regressions, nowcasting and forecasting. Correlation of the signal with epidemic indicators requires multivariate regression as the signal alone cannot explain the dynamics but – for this case study – multiple linear regression proofed to be a suitable tool when the focus is on understanding and interpretation. It was also demonstrated that short term prediction (7 days) is accurate with simple models (exponential smoothing or autoregressive models) but forecast accuracy deteriorates fast for longer periods.


•
Data preprocessing is an essential step in a modeling framework for wastewaterbased epidemiology

•
Equal spacing of the scatted data is essential but downsampling is sufficient Since early in the appearance of the SARS-CoV-2 pandemic in the beginning of 2020, wastewater-based epidemiology (WBE) was investigated as possible tool for early warning of Covid-19 outbreaks (Medema et al., 2020).It has been demonstrated that SARS-CoV-2 ribonucleic acid (RNA) is found in wastewater at the influent of treatment plants with the genetic material being introduced into the sewer systems by viral shedding of Covid-19 patients (e.g., Wölfel et al., 2020).Based on those references, WBE has been applied in pandemic management early on, see e.g., Ahmed et al., 2020;Medema et al., 2020.In the meantime, the information from WBE is displayed in (national) dashboards (Naughton et al., 2021) and used for subsequent assessment of pandemic development (e.g., Robotto et al., 2022;Lastra et al., 2022).
Despite the successful application, there is still remarkable uncertainty in analysis and prediction of the viral signal.Key problems are the inherent errors and uncertainties in the raw virus signal as measured at the treatment plant.Other issues of concern are the link of the virus signal to epidemic indicators such as incidence values and the prediction of pandemic dynamics, which proved to be not straightforward but highly dynamic during different phases.
While a stringent methodology regarding data modelling cannot explain the inherent uncertainties of WBE, it avoids to introduce new and additional error sources.Accordingly, this paper aims to establish a modelling framework for WBE signal analysis, interpretation and prediction.Key emphasis is given to test and demonstrate mathematical methods of various complexity.
As data source we will use the case study Vienna, Austria where genetic material in the wastewater have been sampled at the main wastewater treatment plant for appr.18 months.The timeseries thus encompasses the main phases of the Covid-19 pandemic.However, no emphasis will be put here on the details and uncertainties arising from sampling and laboratory processes (PCR test deriving genetic material).This is not to disregard the issue but the presented investigation starts with the timeline of the virus signal derived therefrom.

Wastewater based epidemiology for SARS-Cov-2
WBE aims to derive conclusions on substances occurring in the watershed of a sewer system by sampling -usually at the influent of a treatment plant.The concept has been originally developed for the detection of drug use (Daughton, 2001) but is in the meantime likewise applied for spreading of diseases (Sim and Kasprzyk-Hordern, 2020).Adapting the basic formulation for Sars-Cov-2, the timeseries from WBE is usually expressed as (measured) virus load related to the population drained with the sewer system: where Lvirus is gene copies/P/d; Q = flow volume in L/d; cvirus = virus concentration in the sample in gene copies/L and P = number of persons in the watershed.While not directly comparable to an epidemic indicator, the virus load serves as quantitative information on pandemic dynamics in the watershed.
Assuming that each infected person is shedding a certain load of genetic material per time (Lshed in Gene copies/Pinf/d) into the sewer system as well as neglecting uncertainties, distortions from variations of shedding behaviours and losses in the transport phase we get the relation: with Lshed in gene copies/Pinf/d; Pinf = infected persons in the watershed and finf = Pinf/P = fraction of infected persons.Note that finf is actually expressing prevalence (defined as ratio of infected persons to total population) -one of the key pandemic parameters.

Wastewater surveillance -the Vienna case study
Like many other countries, Austria has initiated wastewater-based epidemiology already early in the pandemic i.e., in late spring 2020.While the program started initially as research program, results from initially bi-weekly surveillance (three-weekly since November 2020) have subsequently been taken up in pandemic management.For this paper we will utilize the timeseries of SARS-Cov-2 measurements of one specific treatment plant as case study, i.e., of Vienna (1.9 Mio inhabitants) as the capital city of Austria.The length of the time series is app.
18 months, including three pandemic waves, several lockdown periods and the bulk of the vaccination campaign.
Figure 1: Timeline of SARS-Cov-2 measurement (raw data in copies/ml) and daily documented new infections.

Modeling framework
Key aspect in modelling epidemic dynamics is the understanding of the inherent information and statical properties of the signal.The virus signal -derived from WBE -is a proxy for the overall prevalence (with prevalence here defined as the fraction of infected persons in the total population).But while prevalence is a valuable information in pandemic management, the quantitative determination of the value requires huge monitoring effort (Rippinger et al., 2021) and is thus rarely done.Incidence, on the other hand, is the common pandemic indicator based on documented infection cases by human/individual tests (see Figure ).The value is usually given as sum of documented infections in a 7-day period per 100 000 Persons, thus applying a gliding mean smoothing filter to the time series.However, note that incidence deviates from the actual infection status (i.e., prevalence) as the huge number of undocumented cases are not accounted for.Moreover, the incidence values are subject to test numbers and strategies and thus contain a significant uncertainty both in assessment and also in interpretation.From the above we can derive two typical tasks for time series analysis in relation to WBE, that is first correlating the signal to pandemic indicators and second forecasting the pandemic: • Correlating the signal to pandemic indicators relates to the standard statistical modelling problem of regression analysis, i.e., estimating the relation of the dependent variable y (e.g., incidence values) to the independent variable x (WBE signal).
Applying a linear relation and fitting the parameters by the least-squares method is denoted univariate linear regression.Adding more independent variables xi (e.g., the time series of daily individual tests) leads to multivariate regression.However, the more independent variables are added the more complex the interpretation of results.
Next to the linear estimation there are a huge number of other methods availablemost of them from the field of machine learning.Aberi et al., (2021) compare 8 different methods in the context of WBE regression and find 3 rd order polynomial regression as outperforming.
• Predicting epidemic development is key in pandemic management, in order to plan strategic measures such as restrictions and public interventions.WBE can hardly give a full picture of infection trends under multiple influences but the forecast of the signal is adding to the general trend prediction.Hence, for prediction modelling we concern ourself with the virus signal itself.As for the above-mentioned regression problem, there are a plethora of methods and algorithms available.Parmezan et al. (2019) categorize the models into parametric ones, i.e., exponential smoothing, autoregression (AR) and moving average (MA) and non-parametric ones, i.e., machine learning approaches such as artificial neural networks and support vector machine.

Model performance metrics
Time series analysis and data modeling as discussed require metrics and methods for performance assessment.Basic issues concern the assessment of the fit of a data model, similarity of series expressed as correlation, cross-validation and model comparison.The metrics used and their background are described in detail in Appendix A.

Data Preprocessing
It is obvious from the complex processes and degrees of freedom associated with WBE that the raw signal contains both insufficient information and inherent variation, thus rendering it unsuited for further analysis as is.In the following, we propose a 4-step procedure for data pre-processing -adapted from Rauch et al, 2021, Markt et al., 2021and Arabzadeh et al, 2021: 3.1.1Normalization For subsequent analysis with respect to epidemic indicators the accurate estimation of the population in the catchment is essential.While the relative fluctuation of persons in the watershed is usually small in large cities -despite commuters and tourists -, the population is changing widely in smaller settlements and touristic regions.The use of population biomarkers is thus standard practice to account for such variations (Been et al., 2014).While there is a range of biomarkers possible to compensate for dilutions in case of stormwater events or fluctuations in the shedding population (Choi et al., 2018), the choice is usually limited to routinely monitored water quality parameters such as COD, NH4-N and Ntot applying a standard load of 120g COD, 8g NH4-N or 11 g N respectively per PE (person equivalent) per day for estimation.Rauch et al, 2021 andArabzadeh et al, 2021 advocate the use of NH4-N as more reliable parameter in this context and derive the normalized signal as in gene copies/PE/day where cNH4 is the concentration of NH4-N in gN/L and fNH4 = specific NH4-N load in gN/PE/d.(Use of COD and Ntot is similar).However, normalisation is hardly ever straightforward in a long-term monitoring campaign, as specific data are occasionally missing, plain wrong or heavily influenced by sewer flushing, differences in industrial wastewater share and alike (Amoah et al., 2022).One obvious recipe is -for those instances -to either switch automatically from one water quality parameter to another (e.g., from NH4-N to COD) or to use the mean of the (other) available parameters instead.But before applying such an algorithm note that the above mention standard load values are assuming standard wastewater composition and should (in theory) result in similar population numbers when used for a specific sample, with where, cbm is the concentration of biomarker in g/L and fbm = specific biomarker load in g/PE/d.Thus, prior to interchangeable use of the standard loads it is advised to check if that relation holds -and in case not -derive better load estimates.E.g., for the timeseries of Vienna the deviation of the mean of each parameter against the averaged mean of all 3 quality parameters is COD = 1.10;NH4-N = 0.939 and Ntot = 0.962 (when the standard load estimates are applied as indicated above).For switching parameters in the normalization process, such factors are to be considered.

Outlier detection
Outliers in the signal timeline (that clearly distort the subsequent information and need to be removed) can derive from various sources and not all of those errors can be tracked by means of a coherent strategy.A manual inspection of the data thus remains an essential step.One potential source of erratic data stems from extreme discharge conditions as given by heavy rain events in combined sewer system.In a pragmatic approach Rauch et al, 2021 andMarkt et al.,2021 concluded that samples taken while the inflow is higher than the 90percentile of the recorded inflow data are potentially erroneous and to be treated as outliers.The shortcoming is that the determination of the threshold value (90percentile) requires at least one full year of flow data prior to the SARS-Cov-2 monitoring.
Intuitively one would apply a standard outlier detection method -such as a simple box plot (Tukey, 1970) -consecutively for the normalized signal as derived from above normalization.
Once a data point is labeled as outlier (i.e., being outside the 1.5 interquartile range), alternatives would apply, such as replacing the signal value with the mean of neighboring values or appropriate quartiles.Alas, this simple approach does not work in an algorithmic setting: in the rising limb of a pandemic wave, each signal is higher as the previous one -thus all values would be labeled as outliers.
However, the strategy can be used for the biomarker value instead for the signal itself.I.e., if a (NH4-N) measurement is identified as outlier, the use of other biomarkers (e.g., COD, Ntot) is advocated and -if that fails too -plausible values are to be used instead.In the latter case e.g., the 95 th percentile concentration from the data series could be used as surrogate value, if the concentration values are excessively high.For the Vienna case study, the 95 th percentiles would be 859 mg COD/l, 63.1 mg N/l and 45.3 mg NH4-N/l.But note, the above only works for biomarker measurements not for outlier in the virus data itself.

Resampling and Interpolation of scattered data
Most standard analysis methods require equally spaced samples in the time series.In WBE on the other hand samples are taken in different frequency (usually between daily to weekly) according to pandemic necessity, thus resulting in irregular data intervals (a so-called scattered series).The reality of irregular sampling requires to apply resampling routines from an equally spaced data grid (Benedict et al., 2000;Broersen andBos, 2006, Rehfeld et al., 2011).Two simple approaches can be applied in the context of WBE: • Downsampling is to condense the data into larger time segments as being sampled, here into a weekly series.An intuitive yet robust approach is applying block-wise averaging, i.e., to average all measurements in a time slot (here one week) to a mean value.The shortcoming of this approach is the loss of information due to the simple averaging, but on the other hand, the averaging can easily be done manually (even if it is tedious).The application of more complex methods such as finite impulse response (FIR) filters is not necessary as the WBE signal lacks higher frequencies that could cause aliasing.
• The second approach is to apply a common interpolation procedure in order to derive an equidistance set of values (here WBE signals).Note that interpolation can in fact be used both for upsampling (i.e., deriving daily values) and downsampling (weekly signals as above).Lepot et al., 2017, give an exhaustive overview on interpolation methods.A basic approach is linear interpolation which is easily done as being implemented as function in spreadsheets.Given a high frequency of measurements (as in the Vienna series) the method exhibits nearly identical results as applying a more complex method such as the Shepard filter (Shepard, 1969) for interpolation of the scattered data (see Figure 2).(Shepard Filter) to derive an equally spaced weekly time series for the Vienna case study.Right: Upsampling by means of Interpolation (Shepard Filter) towards a daily series.
We will discuss the influence of the grid spacing (daily versus weekly series) to further data analysis (regression and prediction) below.At this point it is to be noted that both methods (Downsampling and Interpolation) imply some loss of information but are not to be confused with data filtering (smoothing) methods as the aim is different.

Data smoothing
Even after normalization, outlier detection and grid spacing the resulting time series contains a significant amount of noise, thus blurring the actual information.In order to differentiate noise from the actual information in the signal, data filtering (smoothing) is frequently applied in science and engineering (see Arabzadeh et al. 2021 for a detailed investigation in relation to WBE).Regression and prediction methods inherently cope with noise but there are two issues that make data smoothing necessary: First, for communicating the result of WBE to pandemic management and public, as the display of raw signal adds to confusion.And second, the correlation of the signal to pandemic indicators that are filtered in itself as e.g., incidence values, requires likewise to smooth the signal prior to the regression (Arabzadeh et al. 2021).
The smoothing problem is specified for a series of n observations as: where y are the actual observations, x is the vector of covariates, e the random error with zero mean (µ(e) = 0) and b the sought smoothing function.This is in fact standard regression with the independent values x being derived from the observations y.
For data smoothing two different types of methods need to be distinguished: First, methods that require an equally spaced series (as discussed above) and second methods that can cope with irregular spaced datasets and thus provide both interpolation and smoothing features.An example of the first group is the simple moving average (SMA) while locally estimated scatterplot smoothing (LOESS) is a standard procedure for the latter (Cleveland and Devlin, 1988).To emphasize again, when applying LOESS (and alike) the resampling of the scattered dataset is not necessary.
Smoothing methods can be further classified in parametric and non-parametric ones, where for the first type the performance of the smoothing -i.e., the shape of the fitted curve -is a function of one or more parameter values.SMA is the most popular example of a parametric smoother which is calculated centralized for the estimation yi as: ",(+.*)/1 23".(+.*)/1 with x = input signal, y = estimation or output signal and k = number of points in average.
The parameter in SMA is k (also denoted as window size), with the smoothing effect increasing with increasing size of k.Similar LOESS is a parametric method with q specifying the fraction of the data used in a local estimate (kL = q*n is the number of neighboring data points taken into account).
Non-parametric smoothing methods have the benefit that parameter values are estimated inherently, thus providing optimal curve fitting from the point of the estimator.Typical examples are Friedman's super smoother and Generalized additive models (see e.g, Arabzadeh et al. 2021).In Figure 3 we compare 2 methods of significantly diverging complexity: The first approach is to use the block-averaged weekly data (see Figure 2) and apply a 3-point moving average filter (k=3).The method is intuitive, robust and requires the use of spreadsheets at most.The second approach is the use of LOESS for daily data points, using 11 nearest neighbors for each local estimation (kL=11).Also, LOESS is not overly complex as algorithm but definitely requires coding or the use of standard statistical routines as implemented in R or Python.
(Note that there is a plethora of other and more sophisticated smoothing methods available, however the result will not differ significantly - Arabzadeh et al., 2021.)In the case study, the optimal parameter for SMA has been determined by means of crossvalidation, i.e., by minimizing LOOCV (see Appendix A) for different values of k.The parameter kL for LOESS has then been determined for highest similarity and correlation with the SMA result.Two points are to be made based on the above example: First, both methods use a kernel for deriving the estimate, which is problematic for the smoothing of the latest data (end of series).
If such estimate is necessary, other methods (leaning towards prediction) must be used.And second, simple averaging methods can easily do the job -when understanding both background and restrictions of the approach.Using more complex methods does not harm, but is neither necessary nor needed for more representative outcomes (see also Smith, 1997).

Regression
Fundamental to mathematical regressions is the choice of the data in terms of interpolation, smoothing and variants (number of independent variables).For estimation of a suitable method, we will test 3 different datasets.Type 1 is raw data without smoothing -accordingly we hereby use the daily number of documented infections (Figure 1) as dependent variable instead of incidence values (sum of new infections over a 7-day period per 100 000 P). Type 2 are consistently smoothed values in upsampled daily grid spacing and Type 3 applies the smoothed but weekly downsampled data set.For each type the following 3 independent variables are available: first, the normalized signal itself (Lvirus), second, the number of daily tests and third, the dominance of the specific virus variant B.1.1.7 alpha (Carcereny et al., 2022) in the WBE signal in percent.The last information is given by whole genome sequencing of the WBE samples (Amman et al., 2022).Exemplarily Figure 4 shows the whole dataset for Type 3 (weekly downsampled).One more step is necessary before investigation regression methods, that is the investigation of cross-correlation between the information series (WBE signal and incidence) in order to determine a possible time lag.Among others, Aberi et al., 2021 found that the incidence signal lags 2 -7 days behind the wastewater signal which is essential to be considered.The two dominant reasons for that behavior are (Olesen et al., 2021): first, an infected person starts virus shedding before the onset of symptoms and thus before being urged to test.
Second, incidence is calculated as gliding sum over 7 days which inherently delays the signal.
Table 2 plots Pearson's correlation coefficient r for the 3 data sets (all are significant in terms of p < 0.05).But note that r is fairly constant in the range 1-8 days and thus also the time lag is a range rather than a specific value.As regression methods we compare multivariate linear regression and 3 rd order polynomial regression by plotting the coefficient of determination (R 2 ).From the results summarized in Table 3 we can derive the following: first, it is due to the high variation of the raw signals (Type 1) that smoothing makes for a substantial improvement of correlation, second, spacing of the series (daily versus weekly data) does not make a substantial difference and third, polynomial regression is superior to linear regression (thus corroborating Aberi, et al., 2021).An important consideration in regression is understanding the cause-effect relation, especially in the case of multiple variants.Despite the fact that polynomial regression (and the same applies for other machine-learning based methods) gives a better correlation with the dependent variable (incidence value), the simple linear regression method is likely the superior option for analysis as being both robust and easy to interpret.We will thus use the linear regression method for further analysis of the Type 3 dataset (weekly -smoothed).
For testing robustness, we apply cross-validation and find here close values for LOOCV as compared with the RMSE value for the whole series (see Table 3).This indicats that the model predicts missing data points quite well -the deviation of the root means residuals is app.6%.Another point is to test if the amount of information used (multiple variants) is actually necessary.A standard t-test for each of the regression coefficients (relating to xi) indicates that all 3 independent variables are essential for the regression.Last, we can additionally use the 5% -95% confidence interval for displaying the result (Figure 5).One last point: In this example we aimed for correlating the whole data series, i.e., including the occurrence of virus mutants and highly varying number of tests -which proofed to be difficult and requiring multivariate regression for a suitable result.However, regression is substantially easier to achieve when investigating shorter periods e.g., each of the 3 pandemic waves individually.In this case even simple univariate linear regression might easily do the job.

Forecast
Forecast in time series analysis is a well understood topic (De Gooijer and Hyndman, 2006) and has received already much attention in relation to the SARS-CoV-2 pandemic (see e.g., Cao and Francis, 2021).In the following we outline the procedure for the WBE signal (Lvirus), thus aiming to univariate short-term forecast of appr.one to two weeks ahead.While many complex methods and models are described in the literature (Parmezan et al., 2019) in our approach we will focus on two simple yet robust linear prediction models, that is first, exponential smoothing and second, autoregressive modelling (AR).Both provide understanding of the problem at hand and are easily implemented in WBE surveillance routines.
Simple exponential smoothing (SES) computes -for a time series with N observations y -the forecast  3 from previous observations and forecasts as where a (and of minor importance the starting value  3 5 ) are the parameters.Note that the above describes a one-step-ahead forecast.Forecasting over several periods (e.g., 7 days ahead in a daily time step series) is here provided by iteration, i.e., the forecasted values are taken as observations for the desired number of forecast periods (Marcellino et al., 2006).
Autoregressive modelling assumes the forecast to be linear dependent on the previous observations: where p is the order of the model, written as AR(p), f are the parameters, c a constant and e the white noise.The estimation of the parameter values is straightforward by non-linear optimization, however to determine p is not direct but e.g., based on autocorrelation of the series (Box and Jenkins, 1970).Multistep prediction is accomplished as above for SES by iteration.
Prior to the application of linear prediction models, we need to check stationarity in the data.Box and Jenkins (1970) propose the following method: Stationarity regarding the variances (i.e., normality) is conveniently determined visually by means of a Q-Q plot (Ben and Yohai, 2004) -easily performed in any standard spreadsheet.As the Vienna data series (we investigate the both the daily and weekly smoothed series in the following) does not exhibit normality, we thus need to apply a Box-Cox transformation (Box and Cox, 1964): where l is to be derived from maximizing the following log-likelihood function aiming for normal distribution:  Next step is to check for stationarity in the mean of the time series, i.e., if the series exhibits trend and/or seasonality.We apply here a unit root test for stationarity, namely the augmented Dickey-Fuller (ADF) test (Fuller, 1976) in the following form: with ∆ 9 =  9 −  9.* being the differencing operator.The above equation relates to a bivariate regression problem and the null-hypothesis (i.e., the series is stationary) is given if b = 0. We apply the ADF test for the transformed time series (both daily and weekly samples) and find that b is significant for the t-test and thus both series are non-stationary.In order to stabilize the mean of a time series and therefore eliminating (reducing) trend and seasonality we apply additionally differencing (see above operator).The transformed series are backtransformed after forecasting.
While the Box-Jenkins method towards stationarity is widely applied, it has also received criticism and the usefullness of the transformation is debated (Chatfield and Prothero, 1973).
Hence, we use a post-sample approach (Makridakis and Hibon, 1997) and check for the • Forecast with the daily raw data series resulted in substantially worse performance as compared with the smoothed series and was thus not further analyzed.
• Data transformation is necessary but SES works best with differencing alone, while AR is optimal for BoxCox transformation plus differencing • Optimal parameter values are dependent on spacing of series: SES a is 0.7 for daily and 0.1 for weekly series, while AR p is 3 for daily and 5 for weekly series • AR gives better results for all situations but SES is applicable for 7-day forecasts • Both SES and AR forecasts deteriorate significantly for 14 days prediction • For 7-day forecasts both daily and weekly series give similar results (see Figure 7)

Recommendation
For quantitative assessment of SARS-CoV-2 wastewater surveillance data the following sequence of steps is recommended: First -and irrespective of the final aim -data preprocessing is necessary in the following order: a) normalization of the signal -including the choice of an appropriate water quality parameter as biomarker that is available as metadata, b) check for outlier values c) equal spacing of the scattered data -with preference for weekly spaced data and d) smoothing of the data series.As outlined, some smoothing methods can be applied directly to scattered data, making step c redundant.Note that smoothing (filtering) of the data series aims to capture the important patterns in the series which is not only helpful for communication but essential for subsequent data analysis.
Regression analysis aims to understand the relation of the wastewater signal with specific indicators used in pandemic management, e.g., incidence.All data series used in the statistical analysis need -prior -to be equally spaced and filtered.Next, the cross-correlation of the wastewater signal with the indicator values allows to estimate the time-lag between the data sets, which is to be taken into account.For actual regression it is recommended to apply linear regression as long as performance is sufficient in terms of the metrics AIC or coefficient of determination.Subsequent cross-validation builds confidence in the model.Last, it is important to determine the amount of additional information (additional series of independent variables) needed to explain the phenomena.Parsimony is key and the significance of each series added is to be verified (e.g., by Student's t test).
As for regression, forecasting requires an equally spaced and filtered time series of the wastewater-based signal.The use of an autoregressive model is suggested but -if a dedicated statistical software is not available -also exponential smoothing works as method.Data transformation in terms of differencing is required for both methods but AR models additional have to apply a BoxCox transformation.And last, do not forecast for more than 7 days with the simple univariate models discussed in this paper.
In the present study, we systematically investigated data modelling in the framework of SARS-CoV-2 wastewater-based epidemiology.Key emphasis was put on suitable recipes and methods necessary to process the signal for subsequent display in dashboards as well as correlating to pandemic indicators and for prediction.The analysis was made for the case study of an 18month data series from Vienna, Austria, providing the WBE signal itself as well as wastewater metadata and pandemic indicator values.Key findings are as follows: • Careful data preprocessing is essential.This implies not only normalization (by means of biomarkers) but also to provide for equal spacing of the scattered data and appropriate smoothing methods.The latter proofed to be essential for regression and forecast.
• Equal spacing of the scattered data is key but there is little to gain from detailed upsampling to e.g., daily values.Simple downsampling towards a weekly series gives nearly similar result and is substantially easier to handle.
• Correlation of the WBE signal with the pandemic indicator "incidence" required multivariate regression when using the whole time series.Albeit regression by polynomial is superior, simple linear regression is likely the better option in terms of understanding and interpretation.
• Univariate signal forecasting works well for short-term (7 day) prediction but deteriorates fast for longer time spans.Both AR and SES models can be used but AR revealed consistently better results.
• Last, standard WBE data modelling hardly ever requires the use of complex and specialized software -and most can be done in spreadsheets.As usual, it is more important to understand the underlying concepts and reasoning than applying sophisticated algorithms.

Figure 2 :
Figure 2: Left: Downsampling (block-wise averaging) and Interpolation(Shepard Filter)  to derive an equally spaced weekly time series for the Vienna case study.Right: Upsampling by means of Interpolation(Shepard Filter)  towards a daily series.

Figure 3 :
Figure 3: Smoothing of the Vienna time series.Comparison of SMA (weekly) and LOESS (daily) with measurements (Sample).

Figure 4 :
Figure 4: Weekly spaced -smoothed dataset (Type 3) for regression analysis of the Vienna case study.

Figure 5 :
Figure 5: Regression analysis of the Vienna case study -estimating incidence (y) with multivariate linear regression (ye).

I
With l being determined by grid search optimization as 0.117 (for both the daily and weekly series), the transformed data series are (close to) linear in the Q-Q plot (Figure6), thus demonstrating normality.

Figure 6 :
Figure 6: Left: Box-Cox transformed Lvirus data series (daily values) and Right: Q-Q plot of the transformed series.
optimal procedure in terms of methods (SES versus AR), data transformation (none, differencing, Box-Cox plus differencing), parameter a (SES) and order p (AR) respectively, daily or weekly timeseries and length of the forecast (7 versus 14 days).For the investigation we go through the total time series (N entries) and consecutively compute the difference between the forecast and the observation for any point i [pmax -(N-fp)] where fp = forecast periods and pmax = maximum order of the AR model (here 10).The term post-sample relates to the forecasts, where the observations (samples) are only used for determining the error but not for training the forecast model itself.As metric we use AIC (Akaike information criterion -see Appendix A) but the display of results (see detailed data in Appendix B) is done for RSME instead -as the actual values are comparable.The result of the post-sample analysis can be summarized as follows:

Figure 7 :
Figure 7: Post-sample analysis of autoregressive models.The consecutive 7-day forecasts are plotted against the observation for both the daily and weekly spaced series.

Table 1 :
Key parameters of the 2 smoothing methods for the Vienna case study

Table 3 :
Polynomial and linear regression applied to 3 different datasets (Type 1-3) of the Vienna case study