Forecasting hospital discharges for respiratory conditions in Costa Rica using climate and pollution data

Respiratory diseases represent one of the most significant economic burdens on healthcare systems worldwide. The variation in the increasing number of cases depends greatly on climatic seasonal effects, socioeconomic factors, and pollution. Therefore, understanding these variations and obtaining precise forecasts allows health authorities to make correct decisions regarding the allocation of limited economic and human resources. This study aims to model and forecast weekly hospitalizations due to respiratory conditions in seven regional hospitals in Costa Rica using four statistical learning techniques (Random Forest, XGboost, Facebook's Prophet forecasting model, and an ensemble method combining the above methods), along with 22 climate change indices and aerosol optical depth as an indicator of pollution. Models are trained using data from 2000 to 2018 and are evaluated using data from 2019 as testing data. Reliable predictions are obtained for each of the seven regional hospitals


Introduction
Respiratory diseases are conditions that affect the organs and tissues within the lungs and airway systems, leading to difficulty in breathing.These illnesses are currently the leading contributors to the global burden of diseases when assessed through disability-adjusted life-years.Moreover, the rising healthcare costs associated with these diseases are placing a growing strain on the economies of nations worldwide, specifically the five major conditions, known as the "big five", which are chronic obstructive pulmonary disease (COPD), asthma, acute respiratory infections, tuberculosis, and lung cancer [11].To maintain a balanced health economy, it is crucial to understand the distribution and the incidence of these diseases to optimize healthcare systems, ensure efficient resource allocation, and enhance healthcare access and quality.
Extensive research indicates that climate influences respiratory health, resulting in higher rates of hospitalization and mortality in varying seasons [see e.g., 6,24,32,39,40].Furthermore, it is well known that human activities significantly impact the short-term and long-term climate by releasing greenhouse gases and other pollutants into the atmosphere, leading to more extreme weather events in recent years.As a consequence, researchers worldwide have been studying the effects of pollution and climatological factors on human health in different regions of the world [15,19,25,26,28].
Specifically with respiratory diseases, climate factors are found to be associated with extreme weather [27,39,40], and pollution, such as particular matter (PM) [16,32].Airborne particulate matter (PM) is a complex mixture of different chemical species originating from numerous sources.These particles can be a product of combustion, suspension of soil materials, suspension of substances from the sea, and can also be formed by chemical reactions in the atmosphere [7].Furthermore, elevated concentrations of PM not only substantially heighten health risks but also contribute significantly to the challenges associated with climate change.Therefore, the monitoring of PM is essential for comprehending and mitigating air pollution, aiming to enhance public health.Nevertheless, the measurement of PM presents challenges in developing countries due to the constrained monitoring infrastructure, resulting in gaps in data analysis both temporally and spatially.
On the other hand, an aerosol is defined as a stable suspension of solid and liquid particles in gas.It is possible to obtain related measurements with the facilitation of remote sensing techniques.Aerosol optical depth (AOD), a satellitederived metric that quantifies the presence of aerosols across the entire atmospheric column, is a good option.Several studies around the world have found that the driving factors of AOD are related to urban and economic development, agricultural activities, industrialization, landscape aggregation, and regional transportation [see e.g.34,42].Moreover, AOD is found to be associated with socio-economic factors, such as GDP, industry, and vehicle density [23].
Costa Rica, a country with an area of 51,179 Km 2 , is located in Central America with an estimated population of approximately 5,003,402 inhabitants in 2018.It is a stable democratic country with a socioeconomic development model based on the opening and economic liberalization, upholding human rights and ensuring universal basic access to essential goods and services, such as education and health system while preserving extensive natural resources.[31].
Although Costa Rica is renowned for its biodiversity, with most environmental policies focusing on the protection and conservation of natural resources, unregulated land use has been observed, resulting in significant environmental impacts.This issue is not limited to the Great Metropolitan Area, the country's most urban and densely populated region, but is also prevalent in other parts of the national territory.Consequently, this form of human development can intensify vulnerability to natural disasters and extreme weather conditions [31].
Regarding to its health system, since the 1940s Costa Rica prompted its national health system and social investment by creating the Costa Rican Social Security Fund (CCSS, from its Spanish acronym Caja Costarricense de Seguro Social).The basis of the health system in Costa Rica is universal and public, which covers 95% of the Costa Rican population [1,31].The CCSS carries all health care functions in the country and divides the territory into 7 Health Regions (región sanitaria in Spanish), and at the same time, it is divided into 104 Health Areas (AS, from its Spanish acronym áreas de salud), and then, 1045 Basic Team for Comprehensive Health Care (EBAIS, from its Spanish acronym Equipos Básicos de Atención Integral de Salud).Each AS caters to a population ranging from 15,000 to 40,000 individuals in rural areas and 30,000 to 60,000 individuals in urban areas.Subsequently, these ASs are subdivided into 1,045 segments, each overseen by an EBAIS.Each segment, in turn, serves approximately 4,000 people.The health service in Costa Rica operates on three distinct levels of attention, interconnected through referral and counter-referral mechanisms.The initial level of service serves as the primary entry point to the healthcare system, providing essential services within each EBAIS at the community level.Additionally, there are two possibilities in which a person can access the health system: 1) an insured person can be attended by a private doctor who is registered by the CCSS and access reference to more specialized attention, and 2) an insured person is attended by a doctor hired by the company.The second level aims to provide support to the first level through a network of 10 main clinics, 13 peripheral hospitals, and 7 regional hospitals, through the provision of specialized outpatient services, outpatient and inpatient interventions, hospitalization, and medical-surgical treatment of medical specialties, and some others of high population demand, such as dermatology, urology and ophthalmology, all of low complexity.Finally, the third level includes outpatient and inpatient services of greater complexity and specialization that require high technology and level of specialization.It is provided through 3 national hospitals and 6 specialized hospitals; the area of influence of this level covers the territory of several provinces.
Due to this health administrative flow, the accurate diagnostics of respiratory conditions that require hospitalization are registered in hospital discharge events (second level), instead of hospital entrance (first level).A a consequence, the limitation of this data lies in the presence of a lagged and unknown effect, necessitating consideration in the analysis.Beyond the delayed impact associated with hospital admissions and discharges, there is also a temporal lag effect pertaining to climate and pollution variables.
In existing literature, most studies are linked to statistical association between climate and hospitalizations [see e.g., 6,24,32,39,40].Additionally, some of them concentrate on forecasting hospitalizations specifically within emergency departments and respiratory diseases [see e.g.33,10,36].Our main goal is to predict hospital discharges due to respiratory diseases at the secondary healthcare level of the system, in the hope of enhancing the country's healthcare economy.Specifically, we analyze and predict hospital discharges in 7 regions to enable health authorities to make accurate predictions and informed budget decisions in the near future.To the best of our knowledge, this approach is the first predictive study undertaken in the region with these distinctive characteristics.This paper is divided as follows: Section 2 describes the data and the implemented methodologies.Section 3 presents the analysis and forecasting results.Finally, section 4 discusses the prediction performance, limitations, and future work.
2 Materials and methods

Data
This section provides an overview of the primary datasets used in our study, focusing on respiratory hospitalization and input data comprising climate, pollution, and lagged hospital discharge.

Respiratory hospitalization
Weekly data on hospital discharges due to all respiratory diseases (J00-J99 using ICD10 [18]) from 2000 to 2019 for seven regions (Brunca, Central Norte, Central Sur, Chorotega, Huetar Atlántica, Huetar Norte and Pacífico Central), were obtained from the Health Statistics Area of the CCSS.The choice of using 'discharge' instead of 'hospital entrance' is based on the accurate registration of diagnostics at the second level of the health system in Costa Rica.According to the official statistics from the CCSS, the average hospitalization duration in regional hospitals in 2019 due to respiratory diseases was 5.97 days [3].For a visual representation of the geographic distribution of these regions, refer to Figure 1.

Input data: Climate, pollution, and lagged hospital discharge
Daily precipitation estimates from 2000 to 2019, sourced from the Climate Hazards Group InfraRed Precipitation with Station data (CHIRPs) [12], were used to measure the land surface rainfall.Simultaneously, the Geophysical Research Center (CIGEFI) of the University of Costa Rica supplied maximum and minimum daily temperatures across the country, employing the same estimation procedure as the Climate Hazards Center Infrared Temperature with Stations (CHIRTS) [13], with an extension of the time period until 2019.Both data sources offer high spatial resolution at 5km by 5km.
After reviewing the 27 core indices1 related to climate extremes, as recommended by the Climate and Ocean: Variability, Predictability and Change (CLIVAR) 2 on Climate Change Detection [20], we adapted them for tropical countries and computed 22 climate change indices for each region related to cumulative rainfall estimates and temperature for each region i = 1, ..., 7 and week t = 1, ..., T (See Table 1).

Tmax_mean
Average (across the region) of maximum weekly temperature.

Tmax_min
Minimum (across the region) of maximum weekly temperature.

n_Tmax_Q3
Average (across the region) of the number of days in which the maximum temperature is higher than the percentile 75 of maximum temperature.

Tmin_min
Minimum (across the region) of minimum weekly temperature.

Tmin_mean
Average (across the region) of minimum weekly temperature.

Tmin_max
Maximum (across the region) of minimum weekly temperature.

n_Tmin_Q1
Average (across the region) of the number of days in which the minimum temperature is lower than the percentile 25 of minimum temperature.

amplitude_max_max
Maximum of the maximum daily amplitude of temperature (difference of daily maximum and minimum temperature).

amplitude_max_mean
Average of the maximum daily amplitude of temperature.

amplitude_max_min
Minimum of the maximum daily amplitude of temperature.12. amplitude_min_max Maximum of the minimum daily amplitude of temperature.

amplitude_min_mean
Average of the minimum daily amplitude of temperature.

amplitude_min_min
Minimum of the minimum daily amplitude of temperature).

n_amplitude_Q3
Average of the number of days in which the amplitude of temperature is higher than the percentile 75 of daily amplitude of temperature.

n_amplitude_P90
Average of the number of days in which the amplitude of temperature is lower than the percentile 25 of daily amplitude of temperature.17. precip_max_max Maximum (across the region) of the maximum daily precipitation.

precip_max_mean
Average (across the region) of the maximum daily precipitation.

precip_max_min
Minimum (across the region) of the maximum daily precipitation.

precip_mean_mean
Average (across the region) of the average daily precipitation Average (across the region) of the average daily precipitation.

n_precip_max_Q3
Average (across the region) of the number of days in which the maximum precipitation is higher than the percentile 75 of daily maximum precipitation.

n_precip_max_P90
Average (across the region) of the number of days in which the maximum precipitation is higher than the percentile 90 of daily maximum precipitation.
In relation to pollution data, daily measurements of AOD from 2000 to 2019 were obtained from the MODIS Atmosphere L3 Daily Product (MOD08_D3) [29] from the National Aeronautics and Space Administration (NASA).We obtained daily data points at 12 locations across the country using a spatial resolution of 1 × 1 degree grid.Subsequently, we calculated the weighted average based on the Euclidean distance between the 12 data points and the centroid of each region i.Finally, weekly AOD values were derived by averaging the daily information.
All precipitation indices were logarithmically transformed due to their asymmetry.Furthermore, we computed the sample cross-correlation function between all covariates and hospital discharges.We observed that the highest correlation between all covariates and hospital discharges occurs in the same week (i.e., at 0 lag), except for AOD, where the highest sample cross-correlation is detected with a lag of 10 weeks, namely AOD_10, thus we incorporate this lagged covariate instead of the same-week AOD in the model.Finally, hospital discharges from both the previous week and the week before are used in the model.

Statistical Methods
We employed a supervised statistical learning approach to forecast hospital discharges based on the environmental information in each region.The prediction model for hospital discharges at time t (HD t ) in a fixed region i is defined as follows: where f (•) represents the applied machine learning technique, X t encompasses all climate covariates at time t, and AOD t−10 denotes the aerosol optical depth at time t − 10.
For f (•), we apply four different machine learning methods: Random Forest (RF) [2,17] is a bootstrapped ensemble method that combines results of regression trees to improve the prediction.
XGBoost (XGBOOST) [4] is a gradient-boosting algorithm that employs bagging and trains multiple decision trees in order to produce forecasts.
Facebook's Prophet Forecasting Model (PROPHET) [38] is an additive-based model that fits non-linear trends with multiple seasonality plus holiday factors.
Ensemble method (ENSEMBLE) is a combination of the predictions of the three aforementioned models with equal weights.
To train the model, we split the data into training (2001-2018) and testing sets (2019).The training process involves a systematic approach, where the training set is divided into sliding windows.Each sliding window comprises a 2-year analysis set and a 1-year assessment set, accounting for the temporal annual seasonality inherent in the data.After that, we perform model tuning using the grid search method and minimize the root mean squared error (RMSE).Once we obtain the optimized hyperparameters (see Table 4 in the Supplementary section), we predict hospital discharges and their prediction interval using conformal inference [22] in 2019 to compare with the testing set.The use of conformal inference ensures that prediction intervals are free from any assumptions about the distribution of our outcome, and it also allows for achieving interesting coverage properties even for small samples (see [22]).
Finally, we compute three metrics to compare the predictive performance of the applied methods for a fixed region.The first one, the Mean Squared Error (M SE) is defined as follows: where m is the number of weeks in the testing period, HD t is the observed hospital discharge at week t, and HD t is the estimated hospital discharge according to each model at time t.The Mean Absolute Error (M AE) is computed as follows: and finally, the Interval Score (IS) [14] is defined as where U t and L t are the upper and lower limits of the prediction interval of (1 − α)%, respectively.
M SE and M AE evaluate the point forecasts with the observed hospital discharges.On the other hand, IS α assesses a (1 − α)%-prediction interval by comparing its upper and lower limits against the observed values.Finally, to quantify the degree of overfitting in each of the models, we compare the M SE and the IS obtained in the training set versus those obtained in the testing period through the relative M SE and relative IS as follows: We also calculate the contribution of each covariate in predicting the dependent variable.In particular, we are interested in quantifying the global contribution through the aggregation of local contributions based on Shapley values [for more details, see 30,43].This allows us to compare which covariates have a greater impact on the prediction process for each specific region.

Results
In Table 2, we provide a comparison of the four employed methods using the metrics detailed in Section 2. It is crucial to highlight that our objective is to assess the predictive performance of the four models during the testing period.This assessment involves comparing the observed outcomes with the projected ones, utilizing the set of covariates observed within the same period.Note that due to differences in the behavior of the series in each region, it is confirmed that there is no method that is optimal for all cases.Likewise, the ensemble method is not optimal according to the calculated metrics.In order to choose the proposed models, we sought to have the smallest IS in the testing period, and we also ensured an IS rel ratio higher than 80%, with the goal of selecting alternatives with minimal overfitting.The other metrics were used to verify that the selected methods are also competitive compared to alternatives.In general terms, all selected methods successfully predict in the testing period across the regions and, furthermore, allow capturing both high and low-frequency characteristics in the original discharge series.
In Figure 2, we present these results.Note that in most regions, the 95% confidence intervals successfully capture the observed series as well.The regions that showed greater difficulty in their forecasts were the Huetar Atlantica and Norte regions, and partially the Pacifico Central region.
For these last regions, the PROPHET method constitutes an interesting alternative, since this method also allows incorporating a time index that can explain those non-stationary components not being explained by the covariates.Also, in Table 3, we show the Shapley values, all in percentage terms to facilitate interpretation.Note that for those regions where the RF and XGBOOST methods are the best, the covariates with the greatest contribution to the prediction of hospital discharges tend to be the lagged discharge variables (orders 1 and 2) along with the lagged AOD variable.On the other hand, depending on the region, there are environmental variables that contribute to predictions to varying degrees once the contribution of lagged variables has been taken into account.For example, it is noteworthy that for the Chorotega and Huetar Norte regions, the amplitude between maximum and minimum temperature is important, as  well as in the latter region, the regional minimum of minimum temperature and the regional maximums of maximum temperature.
In general terms, a considerable number of the covariates we used show significant contribution in the prediction models.

Discussion
To summarize, we used climatic and pollution data to model and predict weekly hospital discharges due to respiratory conditions, a leading contributor to the global healthcare burden in terms of costs, at regional hospital levels in Costa Rica.Four statistical learning approaches: RF, XGBOOST, PROPHET, and the ENSEMBLE method, were applied to each of the seven regions.Furthermore, we compared their predictive capacity in both the training set (2001-2018) and the testing set (2019) using MSE, MAE, and IS, as well as M SE rel and IS rel to assess overfitting.We conclude that the optimal model for each region depends on specific environmental and pollution variables, which do not align for all cases.However, at least for the majority of regions, the lagged variables of hospital discharges and the lagged aerosol variable are of the highest relative importance.In other cases, the non-stationary component is strong enough for a time-indexed model to be preferable.Despite all of the above, the selected model for each region demonstrates its ability to generate reliable forecasts.
As limitations, it is important to note that other significant factors, such as the socio-economic conditions of each region, could serve as potential predictors for hospitalizations due to respiratory conditions.However, many socio-economic factors, including GDP, population density, and healthcare facilities, among others, are not available on a weekly basis or remain constant over time.We argue that, given the established association in the literature between social factors and AOD, we indirectly incorporate this information into the model, recognizing it as a proxy for the anthropogenic effect on climate.
This study enables health authorities to anticipate and visualize the hospital requirements for serving the population with respiratory problems in Costa Rican Health Regions.Additionally, it empowers authorities to administer and allocate limited human and economic resources effectively and efficiently, fostering a balanced national health economy.

Figure 1 :
Figure 1: Health administrative division of Costa Rica.

Figure 2 :
Figure 2: Forecast comparison of hospital discharges, by region.

Table 1 :
Climate change indices related to rainfall and temperature.

Table 2 :
Metric comparison by region and method (all covariates).The best method for each region is shown in bold.

Table 3 :
Relative Importance factors for the best methods (in percentage) by means of the Shapley values.The date covariate corresponds to the time index, needed for the PROPHET algorithm.