Prediction of hospitalisations based on wastewater-based SARS-CoV-2 epidemiology

Wastewater-based epidemiology is widely applied in Austria since April 2020 to monitor the SARS-CoV-2 pandemic. With a steadily increasing number of monitored wastewater facilities, 123 plants covering roughly 70 % of the 9 million population were monitored as of August 2022. In this study, the SARS-CoV-2 viral concentrations in raw sewage were analysed to infer short-term hospitalisation occupancy. The temporal lead of wastewater-based epidemiological time series over hospitalisation occupancy levels facilitates the construction of forecast models. Data pre-processing techniques are presented, including the approach of comparing multiple decentralised wastewater signals with aggregated and centralised clinical data. Time‑lead quantification was performed using cross-correlation analysis and coefficient of determination optimisation approaches. Multivariate regression models were successfully applied to infer hospitalisation bed occupancy. The results show a predictive potential of viral loads in sewage towards Covid-19 hospitalisation occupancy, with an average lead time towards ICU and non-ICU bed occupancy between 14.8-17.7 days and 8.6–11.6 days, respectively. The presented procedure provides access to the trend and tipping point behaviour of pandemic dynamics and allows the prediction of short-term demand for public health services. The results showed an increase in forecast accuracy with an increase in the number of monitored wastewater treatment plants. Trained models are sensitive to changing variant types and require recalibration of model parameters, likely caused by immunity by vaccination and/or infection. The utilised approach displays a practical and rapidly implementable application of wastewater-based epidemiology to infer hospitalisation occupancy.


Introduction
The worldwide pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged in late 2019 in Wuhan, China (Zhu et al., 2021;Wölfel et al., 2020). The disease is strongly contagious and often causes fever, respiratory difficulties and may lead to long lasting effects (Tene et al., 2022). Coronavirus disease  is primarily transmitted through airborne droplets in proximity to infected individuals (Mostaghimi et al., 2022). According to the World Health Organization, in February 2023, over 662 million confirmed cases and over 6 million deaths have been observed (WHO, Dashboard, Online).
The prevention of a steady increase in Covid-19 caused hospitalizations and an expanding demand for public health infrastructure required the introduction of SARS-CoV-2 control measures comprising nonpharmaceutical interventions such as mobility restrictions and public health and social interventions. A globally widespread decline in adherence to non-pharmaceutical interventions has been observed as a consequence of pandemic fatigue (Petherick et al., 2021). Pandemic mitigation strategies based on effective vaccines allow progressive reduction of interventions (Oliu-Barton et al., 2021).
Strategies to monitor the ongoing health crisis rely on epidemiological surveillance, with the aim of revealing cases and clusters before the further escalation of infections overwhelms the public health care system. The main pillar worldwide for tracking cases and monitoring pandemic development is based on clinical testing. However, clinical testing suffers from limitations such as reporting bias, testing fatigue, and difficulty in detecting asymptomatic virus carriers (Thompson et al., 2020;Daughton, 2020). As a second pillar, wastewater-based surveillance provides a non-invasive surveillance tool for tracking pandemic development. Some of the first major contributions in the field were made by Medema et al. (2020), Ahmed et al. (2020) (among others (La Rosa et al., 2020;Lodder and de Roda Husman, 2020)), providing a proof-of-concept by using established methods of wastewater-based epidemiology (WBE) in the context of SARS-CoV-2 detection. Since then, the research community has steadily gained interest in new applications of WBE with international contributors (Lundy et al., 2021).
The detection of coronavirus in wastewater is due to the presence of viruses in feces and sputum of infected individuals (Cevik et al., 2021;Wyllie et al., 2020;Kashi et al., 2020). As the viral load is introduced into the sewer, the ribonucleic acid of SARS-CoV-2 is transported in conduit pipes and subsequently detected in the inlet sewage of wastewater treatment plants (WWTP) (Crank et al., 2022). The shedding dynamics of infected individuals varies widely. A meta-analysis by Cevik et al. (2021), comprising 79 studies observing SARS-CoV-2 infections, inferred a mean duration of 17 days (±0.8, CI 95 %) of viral respiratory and feces shedding. The viral load is shed before symptom onset (He et al., 2020). However, the literature does not provide consent regarding the peak timing of viral load shedding, ranging from days before symptom onset to weeks after (Cevik et al., 2021).
WBE provides benefits as a surveillance tool. Firstly, WBE provides a cost-effective SARS-CoV-2 surveillance framework in a population. The measurement approach is analogous to polymeric chain reaction analysis in clinical swab tests but is applied to a community-wide wastewater composite sample. Secondly, the potential of WBE is its ability to confirm viral infections, even if no evidence of infection is obtained by clinical testing (Kitajima et al., 2020). Furthermore, wastewater surveillance provides an estimation of the degree of viral circulation independent of clinical diagnosis. Newly emerging variants of concern and their dominance over existing variants can be detected and quantified (Wurtzer et al., 2021;Amman et al., 2022). The limiting factors are represented by lack of standardized procedures for viral quantification and for trends representation (Li et al., 2021;Hart and Halden, 2020). Despite these limitations, WBE is successfully applied as a complementary surveillance strategy.
One of the main driving forces of WBE is the potential to detect case prevalence before clinical testing. This temporal lead in our reference system may be caused by the occurrence of a viral load in excrement before symptom onset in infected individuals (Peccia et al., 2020). A second factor that explains the temporal lead is the weakness of individual testing in local communities (Greenwald et al., 2021). A well-established and rapid infrastructure for WBE quantification and a result-reporting framework is critical for exploiting the time lead over clinical data (Bibby et al., 2021). The average time frame of data accessibility for this study was between three and four days after measurement. According to Aberi et al. (2021), a time lead of 2-7 days of viral load in wastewater over the incidence signals was observed, depending on the time period and site. Vaughan et al. (2023) identified challenges with machine learning approaches for timeseries forecasting by using wastewater-based epidemiological data. The factors influencing forecast quality are complex and interrelated. Limiting factors for WBE forecasts models are low sampling frequency and low (or zero) values for the target concentrations (Vaughan et al., 2023).
This study aimed to present data processing techniques for WBE SARS-CoV-2 measurements to infer short-term hospitalisation occupancy. As proposed by Galani et al. (2022), clinical cases hospitalised can be predicted using SARS-CoV-2 wastewater surveillance data, using regression analysis. Similar approaches are employed herein, where the novelty lies in the application of predicting hospitalisation occupancy at a federal state level by combining multiple time series from different WWTPs. This includes data preparation for the mismatch between a multitude of viral load time series and aggregated hospitalisation time series within each region of Austria. In this work, multivariate regression models were applied together with regularisation methods, outlining the qualitative resemblance of the wastewater signal with hospitalised cases. Regularized regression penalizes model coefficients to prevent overfitting. This allows to use multiple regressor time-series and leads to better prediction results with unseen validation data. In addition to the lead-time quantification analysis and the construction of the forecast models, the forecast accuracy as a function of the abundance of wastewater data was investigated.

Materials and methods
In this section, the quantity and type of data used in the investigation are presented, as well as the methods and analysis procedures applied. Data pre-processing techniques are adopted from Rauch et al. (2022) and explained in the following subsections. As this study builds on a nationwide Austrian WBE endeavour, approximately 70 % of Austria's population is monitored. The abundance of epidemiological data from wastewater expresses the quantity of viral genetic material circulating at local catchment sites. The investigation of this work starts with a data evaluation of the viral concentrations obtained from the laboratories. Therefore, the scientific focus is not on the methodical detection procedures and sampling details but on subsequent data treatment for hospitalisation forecasting purposes.
The underlying sampling and quantification procedures are outlined in Markt et al. (2022Markt et al. ( , 2021 and Daleiden et al. (2022).

Austrian WBE surveillance data
In Austria, wastewater surveillance in the context of the Covid-19 pandemic started in April 2020 and has been extended to over 120 WWTPs, covering approximately 70 % of the population. Fig. 1 shows the map of WBE monitoring locations in Austria, as well as a histogram outlining the size of the WWTPs by means of their design capacity. The plant in Vienna has the highest design capacity in Austria, able to serve 4 million people, the median design capacity of plants used in this work is 44,000 people equivalent. The monitored plants have been selected based on their importance for national surveillance by a governmental panel.
Composite samples are collected with a measurement frequency of 2 to 3 times per week. Samples were concentrated and purified using polyethylene glycol-precipitation or a direct capture method (Promega, Madison, WI, USA). Reverse transcription real-time polymerase chain reaction assays were used to determine viral concentrations in the processed wastewater samples Daleiden et al., 2022). Furthermore, complementary pollutant concentrations from wastewater surveillance are used as population size markers (chemical oxygen demand (COD), ammonium NH4-N, and total nitrogen N) (Arabzadeh et al., 2021). The influent volume flow rate Q is required for outlier detection. Wastewater facilities in all nine federal states of Austria are represented, with Vienna being an exception because only one large-scale WWTP serves a population of 1.9 million inhabitants.
The abundance of data processed in this study is the result of several monitoring programs with contributions and funding from different entities. In this work a total of 15,656 data points are utilised, gathered in the period between April 2020 and August 2022. The following list provides an overview of the participating programs and the corresponding key data: •

Data processing
The quantity of SARS-CoV-2 genetic material in wastewater follows a variety of uncertain processes, such as viral shedding dynamics of infected individuals, in-sewer degradation, sampling biases, and laboratory quantification. To compensate for inherent measurement noise, data preprocessing is required. The laboratory results of the real-time polymerase chain reaction analysis provided the SARS-CoV-2 concentration in genome copies per millilitre of wastewater. Extreme discharge conditions following periods of high precipitation in combined sewer systems result in outliers and dilution errors. Rauch et al. (2021) suggest a simple and practical approach, which is to regard measurements as outliers and exclude them, if Q exceeds the 90 percentile of the long term recorded inflow data. This approach requires at least one year of data recording for the sound estimation of quantiles and is used in this study.
Population-size marker normalisation is applied to compensate for population fluctuations within a catchment area by referencing SARS-CoV-2 virus titer c v to a measured surrogate concentration c m (Arabzadeh et al., 2021). The population-normalised load L v is computed as: A variety of population size markers are applicable to WBE (Picó and Barceló, 2021). The population-size marker compounds investigated in this work are NH4-N, COD, and N tot . Standard loads of f NH4-N = 8, f COD = 120 and f Ntot = 12 with the unit g/(Pe d) (grams per person equivalent per day) were used for population referencing. As recommended by Arabzadeh et al. (2021), NH4-N is the most reliable parameter for SARS-CoV-2 signal normalisation, and is prioritised over COD and N tot . Only if there are insufficient data -because of a particular WWTP not measuring/reporting the compound, the population-size marker normalisation is switched to N tot and COD. Similar to the viral load itself, the population size marker also shows a degree of irregularity. Occasionally, surrogate compounds are scarcely registered or show an impractical quantity. These irregularities are compensated for using the following two procedures: First, a 10- percentile filter was applied, cutting the highest and lowest 5 % of population-size marker values. As for viral concentrations, this procedure is only recommended for long time series with sufficient data. Second, the population-size marker outlier values were replaced with the median of the respective population-size marker concentration. This pragmatic approach is necessary with the high amount of epidemiological data processed in this study, which does not allow manual correction. In contrast to the high spatial measurement resolution provided by the 123 WWTPs, severe Covid-19 cases were hospitalised in central public health facilities. The time-series provided by the Austrian Federal Ministry of Social Affairs, Health, Care, and Consumer Protection are the total hospitalisation occupancy within each federal state. This creates a comparability problem, with aggregated time-series of hospitalised cases at the federal state level and locally distributed wastewater signals on the other. To compare the aggregated hospitalisation time series, all wastewater measurements within a federal state must be aggregated and averaged. Therefore, the daily weighted averages of viral load levels in a federal state were computed. The weights correspond to the design capacity of the respective WWTPs, prioritising large plants over smaller ones. The design capacity of each WWTP is an available parameter that represents the size of the catchment area. In principle, the preferred weighting factor is the catchment population. However, this information is unavailable. This regional pooling procedure provides a solution to the comparability problem of aggregated hospitalisation occupancy in a federal state and multiple monitored WWTPs within.
The regression models trained in this study require equally spaced samples in the time series. L v,weight is a scattered time-series resulting from WBE measurements that are not gapless on a daily basis. To distribute the timeseries data equally, up-and downsampling approaches can be used. Both upsampling (i.e. deriving daily values) and downsampling (i.e. deriving weekly values) techniques can be used to address this issue (Rauch et al., 2022). In this study, the upsampling approach on a daily basis is performed by linearly interpolating gaps before applying data smoothing.
After outlier detection, population-size marker normalisation, and regional pooling, the time series was characterised by a substantial amount of noise. Therefore, data filtering techniques are applied to reduce the signal noise and provide a mechanism to obtain the underlying information of the signal. According to Arabzadeh et al. (2021), the Friedman-Super-Smoother and Spline techniques are best suited for WBE measurements of SARS-CoV-2. Spline smoothing is a common and easily applied curvefitting tool that can be manually tuned to optimally fit the data. Friedman-Super-Smoother is a non-parametric curve-fitting estimator that is best operated on time series with substantial length. For shorter time series (<50 data points), Spline provides more flexibility towards highvariance data, where the Friedman-Super-Smoother risks overfitting the data. The benefit of data smoothing is the reduction of signal noise. In this special case, the ability to influence volatility characteristics is suitable for forecasting hospitalisation occupancy. A potential disadvantage of data smoothing is the loss of information, which may cause significant data points to be undervalued. As an example, Fig. 2 shows the time-line of the daily weighted average viral loads in the federal state of Vorarlberg, as well as the signal smoothed with Friedman-Super-Smoother, combining the information of the six WWTPs. the other federal states show similar dynamics.

Lead time analysis
In this work, the lead time of the SARS-Cov-2 wastewater signal over hospitalisation occupancy is exploited to infer short-term demand for public health services. Hospitalisation occupancy refers to the number of cases in intensive care units (ICU) and non-ICUs at a given point in time. Lead time quantification was performed on historical data on three different clinical time-lines: occupancy of ICU, non-ICU, and the sum of both, the hospitalisations. Clinical data are available on a daily basis and spatially aggregated in the nine federal states of Austria (Bundesministerium für Soziales, 2022). Algorithmically, the lead time τ, where the signals are best aligned is determined by the argument of the maximum, as in where f t denotes the shift operator displacing the time series and ⊙ denotes the operation of choice, i.e. cross-correlation function (CCF). The lead time between the wastewater signal as compared to the ICU, non-ICU, and the total hospitalisation occupancy was determined by two methods. First, the CCF-analysis determines the optimal lead time based on Pearson's correlation coefficient. Second, the coefficient of determination R 2 optimisation in an OLS framework is performed to determine the optimal lead-time τ. A comparison between these two metrics indicated the optimal alignment of the signals. Two separate analyses with respect to the prevailing Covid-19 variant type were performed. First, from June to December 2021, where the delta variant is dominant, and second from January to July 2022, where the omicron variant is dominant (Amman et al., 2022). The lead time of WBE measurements allows the construction of forecast models. When comparing epidemiological hospitalisation time series over the course of the SARS-CoV-2 pandemic, it can be seen that the relationship between viral load levels in wastewater and hospitalisation occupancy shifts over time. Hospitalisation rates are trending downward because of effective vaccination and immunity to infection. Viral load levels in wastewater are affected by in-sewer temperature and variant types, unpredictably changing the shedding behaviour of infected people (Hart and Halden, 2020). Although a qualitative resemblance between hospitalisation occupancy and wastewater viral load is evident, a systematic deviation over time can be recognised. In this study, hospitalisation occupancy was predicted using multivariate regression models, aiming to optimally capture the transient process. Multivariate regression models were trained on the following predictor variables: WBE viral load signal, vaccinations, Covid-19 deaths, positivity rate, and public health and social intervention stringency index (Bundesministerium für Soziales, 2022; Hale et al., 2021). Vaccination data, Covid-19 deaths and the positivity rate are available, such as hospitalisation occupancy, aggregated at the federal state level. The vaccination time series was used in the cumulative form of total vaccinations up to a particular point in time. The public health and social intervention stringency index uses data from the Oxford Covid-19 Government Response Tracker for the comparison of government policies at a national level. Incidence data were not utilised in the regression models because they show a dependency on local test strategies and are spatially and temporally inconsistent.
Significance tests were performed in an ordinary least squares (OLS) framework to evaluate the p-values of the predictor variables in reference to the hospitalisation time series. Varying significance levels were detected, depending on the combination and number of predictor variables used. The majority of models detected a significant relationship (p < 0.05) between the response variable and the predictors: WBE viral load signal, Covid-19 deaths and positivity rate. Not significant predictor variables indicate a dissimilarity with the response variable but improve the regression models by compensating for the systematic deviation between the viral load signal in wastewater and hospitalisation occupancy.
The forecast prediction is constructed by taking advantage of the time lead of WBE data over the hospitalisation time series. By shifting the viral load signal for exactly this time, the differential results can be extended into the forecast period. This technique deviates from traditional forecasting methods, where predictions are based on past and present data without explicitly transposing the data in time. Therefore, an extensive effort of this study is to quantify the lead time, as explained in the previous section. Regression analysis estimates the relationship between the predictor variables encompassed in the design matrix X and response vector Y. The regression techniques applied in this study were OLS and support vector regression (SVR). OLS determines the regression parameters by minimising the sum of the squared residuals, where the residuals are the difference between the estimated linear function and observed response variable (Holland and Welsch, 1977). Linear regression can be formalised using.
where β is the coefficient vector of the model. The least-squares coefficients can be determined by SVR, on the other hand, minimizes the L2-norm of the coefficients, where the error term ε constrains the model within a margin (Rivas-Perea et al., 2013). The objective function of SVR is given by: To test the two regularisation methods against overfitting, ridge and lasso regularisations were implemented. These regularisation methods penalise regression coefficients to avoid overfitting and simplify the model (Boser et al., 1992;Tibshirani, 1996). Ridge regression penalizes all coefficients using the factor l, which is determined via cross-validation. Lasso regularisation shrinks all the coefficients by an absolute value. These regression and regularisation methods provide a total of six regression models, comprising the two models OLS and SVR without regularisation and an additional four from the combination with ridge (ri) and lasso (la) regularisation. Five-fold cross-validation was employed to ensure the credibility of the trained models and to test the fitness and resilience of the data partitioning. Additionally, cross-validation determines the degree of penalisation in both regularisation methods.
To evaluate the most appropriate forecast model, the validation procedure assessed the prediction accuracy in the forecast timeframe. The rolling origin procedure was employed to increase the robustness of the model selection (Tashman, 2000). With this technique, the origin of the regression timeframe remained constant, whereas the length of the training period increased progressively. Fig. 3 outlines the rolling origin forecast procedure performed in this study.
For each model, 30 rolling progression steps were computed, starting at a fixed origin and consisting of a minimum of 90 time-steps. This ensures a good resemblance of the model performance over a range of different forecast circumstances, comprising an increase in hospitalisation cases, tipping point, and levelling of cases. The rolling origin procedure is particularly important for estimating the general performance of models. The model results varied depending on the state of the pandemic. Determining the forecast errors with rolling origin cross-validation ensures that a variety of pandemic circumstances with varying forecast difficulties. In addition to the absolute and relative forecast errors, multiple performance indicators are used to evaluate model fitness, such as the root mean square error (RMSE), the coefficient of determination R 2 and its multivariate adjustment R 2 adj , as well as the Akaike and Bayesian information criteria (AIC & BIC). The formulas for the metrics are presented in Appendix A.

Result and discussion
In this study, hospitalisation occupancy was predicted using multivariate regression models based on epidemiological data. Viral load levels in wastewater and hospitalisation occupancy data are closely related to Covid-19 prevalence and are highly correlated.

Lead time results
The lead time τ between the viral load in the wastewater and the hospitalised SARS-CoV-2 cases was determined using the methods discussed in Section 2.3. Fig. 4 shows the optimal lead time τ between the wastewater signal and the three clinical time-series, ICU, non-ICU, and the sum of both (Hosp). For each federal state, CCF-analysis and R 2 optimisation routine were applied. The Austrian mean is displayed at the bottom right of the table. The CCF-analysis tends to compute lower lead-time estimates than the R 2 optimisation routine. A high variance in lead times among federal states can be observed. This behaviour may be caused by differences in the case reporting procedure and size differences of the regions. Likewise, it is possible that patients are transferred between the ICU and non-ICU, distorting the lead-time quantification. Regardless of these limitations, the quality of the constructed forecasts was not compromised because the forecasting period was not disproportionately high.
The results in Fig. 4 show that the average lead time of ICU occupancy is higher than that of non-ICU occupancy. As the sum of both signals, Hosp positions itself between the other two, as expected. This result can be explained by the mean difference in length of stay (LOS) between ICU and non-ICU patients in the hospital. Chiam et al. (2021) derives a mean LOS of 12.3 days for ICU patients and 5.7 days for non-ICU SARS-CoV-2 patients. Similar results were obtained by Vekaria et al. (2021) with a mean LOS of 12.9 days for ICU patients and 8.5 days for non-ICU patients. In general, the higher the LOS, the larger the lead time between the WBE signal and hospitalisation. From a conceptual perspective, predicting hospitalisation admission rather than occupation is preferred; however, admission data are not publicly available to train the models. Nonetheless, an estimate for hospitalisation admission from occupancy can be formalised as C a and C o denote the clinical admission/occupancy time series and f t is the shift operator, transposing the argument in time by t = -LOS/2. This procedure is a simple estimate that disregards the statistical distribution of LOS for patients, as well as dynamic effects. Assuming that the occupancy signal lags the admission signal by half of the respective LOS, the mean lead time in the delta variant time-frame of the SARS-CoV-2 wastewater signal towards clinical admissions resulted in 11.4 days-time lead for ICU and 8.1 days for non-ICU cases. For the omicron variant time frame, a mean lag time of 8.5 days for ICU and 5.1 days for non-ICU admission was observed. This result resembles the estimates from Galani et al. (2022), predicting a time lead towards hospital admissions of 8 days and 9 days for ICU admission. It is important to note that although a clear time delay can be detected in a retrospective analysis, the concurrent lead time over clinical data is controversial in the literature (Bibby et al., 2021;Olesen et al., 2021). A rapid and well-established WBE sampling and analysis infrastructure is key to exploiting the temporal lead. By evaluating the time differentials between the delta and omicron variant time frames, it can be seen that, on average, a smaller lead time occurs between the signals in the omicron time-frame. From the Austrian mean, the detected decline in time lead for ICU occupancy was 2.9 days and 3.0 days for non-ICU occupancy. The unavailable entries in Fig. 4 are caused by a lack of significant correlations between the two time-series of ICU occupancy and viral load levels in the omicron variant time-frame. This is likely caused by the low number of ICU cases in the omicron timeframe in combination with integer-valued numbers of the response variable. The relatively flat and discrete ICU curve does not resemble the features of the viral load signal, and thus shows sporadic no significant resemblance of the signals.

Forecast results
The goodness-of-fit and performance indicators are presented in this section. Fig. 5 displays boxplots of the relative forecast error at the end of the forecast period between June and December 2021 (delta variant). Each box-plot comprises 270 values from nine federal states with 30 different forecast periods resulting from the rolling-origin procedure. It can be seen that the mean relative error of the majority of the models is between 0.15 and 0.25. This result demonstrates a consistent forecast of hospitalisation, ICU, and non-ICU occupancy from WBE regression models with a mean forecast error between 15 % and 25 % in a forecast horizon of approximately two weeks ahead. SVR outperforms OLS regression by a small margin. No significant increase in model performance can be detected by ridge or lasso regularisation, indicating that overfitting is not a concern for the trained models. The relative forecast errors in the delta timeframe are shown in Fig. 5. The same approach for the omicron time-frame produces infeasible high relative errors, owing to low levels of hospitalisation occupancy leading to inconsistently high errors caused by division by small numbers (occasionally by 0). However, in absolute numbers, the forecasts in the omicron timeframe produce a similar quality of results.
Tables 1 and 2 display the performance indicators of the trained models in the two respective periods. The spatial aggregation of the data is again performed on a regional scale, and the displayed values are the averages of the nine federal states. The AIC and BIC metrics have varying levels regarding the three response variables, owing to the amplitude of the different time series. Therefore, these metrics can only be compared between different models that predict the same response variable. Lower values of AIC and BIC indicate a superior quality of fit. A throughout the lower AIC/BIC can be seen for ICU over the non-ICU and hospitalisation analysis because the amplitude of the signal is an order of magnitude lower.
The overall lowest AIC and BIC values were attained by SVR compared to the other models. Regularisation has no significant or slightly negative impact on model performance. The coefficients of determination show similar behaviour. Discrete-time series with a small amplitude, such as ICU occupancy, are prone to cause a higher relative model error. This is caused by the discrete characteristic and coarseness of the response variable on the one hand and the languid nature of the smoothed viral load signal on the other. This problem of granularity is emphasised for small federal states, where periods of ICU occupancy are characterised by levels ranging between 0 and 5.
By comparing the results in Tables 1 and 2 it can be seen that the Akaike and Bayesian information criteria show similar performance. A decrease in R 2 was visible, especially for the ICU response variable. The poor R 2 performance in Table 2 can be explained by the comparatively low ICU occupancy levels in the omicron timeframe, preventing high model fitness with realvalued predictions. Integer time series with small amplitudes are intrinsically more difficult to predict using real-valued models because of the higher relative influence of random errors in the training data. Although the relative errors are high when forecasting ICU occupancy with small amplitudes, the absolute error for practical forecasting purposes is not higher than that for signals with a high amplitude. Fig. 6 shows two forecast examples for Vienna and Vorarlberg. The bars indicate non-ICU occupancy, where the dark colour indicates the training period and the light grey bars indicate the forecast period. The black dotted line indicates the model prediction. A close agreement between the model prediction and actual hospitalisation occupancy can be seen. With the size difference between the two federal states, different non-ICU patient levels can be observed, as well as a slightly better forecast in Vienna due to the previously mentioned reasons for coarseness in small regions such as Vorarlberg.  3.3. Wastewater data abundance and forecast accuracy As an additional analysis of the hospitalisation forecasts, an increase in forecast accuracy as a function of the available wastewater data was performed. With 123 actively monitored wastewater treatment plants, Austria monitors extensively (average of 15 WWTPs per federal state, excluding Vienna). Multiple monitored plants in each federal state provide considerable data to train the regression models. The trained regression models in this work utilise all available wastewater data. In this analysis, forecasts are constructed with varying amounts of wastewater data for each federal state to determine the resilience of the applied method to less available WBE data. Fig. 7 shows the average forecast accuracy of the six trained regression models on 16 randomly selected days as a function of the utilised wastewater data. The scattered data display the relative forecast error, the grey lines outline the regression trends over the utilised wastewater data, and the dotted black line indicates the trend. It is evident that a trend towards better forecast prediction is achieved when more wastewater data are used, which supports the findings of Vaughan et al. (2023). This result justifies the inclusion of more plants in the surveillance system and suggests an increase in model fitness when more data are included.

Conclusion
In this study, we show the similarity of viral load signals obtained by WBE with hospitalisation occupancy by means of multivariate regression analysis. Hospitalisation occupancy at the federal state level was predicted based on several monitored wastewater treatment plants within one region. The aggregation of dispersed WBE measurements within a region increases the correlation quality of the signals and the performance of the regression analysis. The approach of utilising a multitude of WWTPs within a federal state to forecast hospitalisation demand at a regional level was applied successfully. Forecast predictions were constructed by exploiting the temporal lead of wastewater data over a clinical hospitalisation time series. Consistent forecasts with a mean relative error between 15 and 25 % on a forecast time frame of approximately two weeks is feasible. Regression analysis is carried out on all nine regions of Austria, and the size influences of the federal states are discussed. Encountered challenges are caused by changing Covid-19 variant types, causing systematic deviations in the viral concentration in relation to hospitalisations. Newly emerging variants likely cause a change in the shedding dynamics of infected people and, therefore, influence the epidemiological time series. Low numbers in the ICU time series and a high degree of granularity impede the optimal model construction during the dominance of the omicron variant.
The results support a strong qualitative resemblance between wastewater-based epidemiologically derived signals and SARS-CoV-2 hospitalisation occupancy. Support vector regression and ordinary least square regression models were successfully trained based on viral load levels in raw sewage and epidemiological time-series, such as vaccination data, Covid-19 deaths, positivity rate, and the public health and social intervention stringency index. Therefore, ridge and lasso regularisation had no benefit on the forecast performance. The models were examined with various performance indicators, such as the root mean square error, Akaike and Bayesian information criteria, and the standard and adjusted coefficient of determination. The time lead of the SARS-CoV-2 viral load signal in wastewater and the clinical hospitalisation time series facilitate the direct prediction of hospitalisation occupancy. The time lead quantification analysis suggests a feasible mean forecast horizon of 10.1 days for non-ICU occupation and 16.2 days for ICU occupation. These results demonstrate the short-term predictive potential of WBE SARS-CoV-2 surveillance and allow for direct application of WBE.

Data availability
Data will be made available on request.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
The information criteria AIC and BIC are tools for model selection, designed according to the principle of Occam's razor. The criteria were defined as follows: AIC ¼ 2p-2 lnL À Á ; ðA 5Þ BIC ¼ ln n ð Þ p-2 lnL À Á ; ðA 6Þ where ln( b L) denotes the maximum logarithmic likelihood function under the assumption of normally distributed errors.