Inference of COVID-19 epidemiological distributions from Brazilian hospital data

Knowing COVID-19 epidemiological distributions, such as the time from patient admission to death, is directly relevant to effective primary and secondary care planning, and moreover, the mathematical modelling of the pandemic generally. We determine epidemiological distributions for patients hospitalized with COVID-19 using a large dataset (N = 21 000 − 157 000) from the Brazilian Sistema de Informação de Vigilância Epidemiológica da Gripe database. A joint Bayesian subnational model with partial pooling is used to simultaneously describe the 26 states and one federal district of Brazil, and shows significant variation in the mean of the symptom-onset-to-death time, with ranges between 11.2 and 17.8 days across the different states, and a mean of 15.2 days for Brazil. We find strong evidence in favour of specific probability density function choices: for example, the gamma distribution gives the best fit for onset-to-death and the generalized lognormal for onset-to-hospital-admission. Our results show that epidemiological distributions have considerable geographical variation, and provide the first estimates of these distributions in a low and middle-income setting. At the subnational level, variation in COVID-19 outcome timings are found to be correlated with poverty, deprivation and segregation levels, and weaker correlation is observed for mean age, wealth and urbanicity.


Introduction
Surveillance of COVID-19 has progressed from initial reports on 31 December 2019 of pneumonia with unknown aetiology in Wuhan, China [1], to the confirmation of 9826 cases of SARS-CoV-2 across 20 countries one month later [2], to the current pandemic of greater than 28 million confirmed cases and 900 000 deaths globally to date at the time of writing [3]. Early estimates of epidemiological distributions provided critical input that enabled modelling to identify the severity and infectiousness of the disease. The onset-to-death distribution [4,5], characterizing the range of times observed between the onset of first symptoms in a patient and their death, proved crucial in early estimates of the infection fatality ratio (IFR) where it was used to estimate the cumulative number of deaths in the beginning of the epidemic in Wuhan [6]. Similarly, the onset-to-death distribution was used in recent approaches to modelling the transmission dynamics of SARS-CoV-2 to estimate the reproduction number R t and other important epidemiological quantities such as the serial interval distribution [7][8][9][10][11][12].
Initial estimates of COVID-19 epidemiological distributions necessarily relied on relatively few data points, with the events comprising these distributions occurring over a period of time that was short compared to the temporal pathologies of the disease progression, resulting in wide confidence or credible intervals and a sensitivity to time-series censoring effects [6]. Global surveillance of the disease over the past 197 days has provided more data to re-evaluate the time-delay distributions of the disease. In particular, public availability of a large number of patient-level hospital records-over 390 000 in total at the time of writing-from the SIVEP-Gripe (Sistema de Informação de Vigilância Epidemiológica da Gripe) database published by Brazil's Ministry of Health [13], provides an opportunity to make robust statistical estimates of the onset-to-death and other time-delay distributions such as onset-to-diagnosis, length of ICU stay, onset-to-hospital-admission, onset-to-hospital-discharge, onset-to-ICU-admission and hospital-admission-todeath. In this work, we fit and present an analysis of these epidemiological distributions, with the paper set out as follows. Section 2 describes the data used from the SIVEP-Gripe database [13], and the methodological approach applied to fit the distributions using a hierarchical Bayesian model with partial pooling. Section 3 provides a description of the results from this study from fitting epidemiological distributions at national and subnational level to a range of probability density functions (PDFs). The results are discussed in §4, including associations with socio-economic factors, such as education, segregation and poverty, and conclusions are given in §5.

Data
The SIVEP-Gripe database provides detailed patient-level records for all individuals hospitalized with severe acute respiratory illness, including all suspected or confirmed cases of severe COVID-19 reported by both private and public sector healthcare institutions, from small rural hospitals to large metropolitan academic centres [13][14][15][16][17]. The records include the date of admission, date of onset of symptoms, state where the patient lives, state where they are being treated, and date of outcome (death or discharge), among other diagnosis related variables. We extracted the data for confirmed COVID-19 records starting on 25 February 2020 and considered records in our analysis ending on 7 July 2020. The dataset was filtered to obtain rows for onset-to-death, hospitaladmission-to-death, length of ICU stay, onset-to-hospital-admission, onset-to-hospital-discharge, onset-to-ICU-admission and onset-todiagnosis. Onset-to-diagnosis data were split into the diagnosis confirmed by PCR and those confirmed by other methods, such as rapid antibody and antigen tests, called non-PCR throughout this manuscript. Entries resulting in distribution times greater than 133 days were considered a typing error and removed, as the first recorded COVID-19 case in Brazil was on 25 February [18].
Additional filtering of the data was applied for onset-to-ICUadmission, onset-to-hospital-admission and onset-to-death in order to eliminate bias introduced by potentially erroneous entries identified in the data for these distributions. We removed the rows where admission to the hospital or ICU or death happened on the same day as onset of symptoms, assuming that these were actually incorrectly inputted entries. The decision to test removing the first day is motivated firstly by the observation of a number of conspicuous data entry errors in the database, and secondly by anomalous spikes corresponding to same-day events observed in these distributions. An example of the anomalous spikes in the onset-to-death distribution is shown in appendix B, figure 5 for selected states.
Sensitivity analyses on data inclusion, regarding the removal of anomalous spikes in first-day data indicative of reporting errors (e.g. in onset-to-hospital-admission), and regarding the sensitivity of the dataset to time-series censoring effects, are set out in the results §3.3.
A summary of the data, including number and a range of samples per variable from the SIVEP-Gripe dataset is given in table 1. The age-sex structure of hospitalized patients in the database with confirmed COVID-19 diagnoses is presented in figure 1. A breakdown of the number of data samples per state is provided in appendix B, table 5.

Model fitting
Gamma, Weibull, lognormal, generalized lognormal [19] and generalized gamma [20] PDFs are fitted to several epidemiological  distributions, with the specific parameterizations provided in appendix B, table 4. The parameters of each distribution are fitted in a joint Bayesian hierarchical model with partial pooling, using data from the 26 states and one federal district of Brazil, extracted and filtered to identify specific epidemiological distributions such as onset-to-death, ICU-stay, and so on.
As an example consider fitting a gamma PDF for the onsetto-death distribution. The gamma distribution for the i th state is given by where shape and scale parameters are assumed to be positively constrained, normally distributed random variables The parameters α Brazil and β Brazil denote the national level estimates, and s 1 N þ (0, 1) and s 2 N þ (0, 1), (2:4) where N + ( · ) is a truncated normal distribution. In this case, parameters α Brazil and β Brazil are estimated by fitting a gamma PDF to the fully pooled data, that is including the observations for all states. Prior probabilities for the national level parameters for each of the considered PDFs are chosen to be N + (0, 1). The only exception was for the more complex generalized gamma distribution which used more informed priors to speed-up fitting. The priors for the generalized gamma distribution were chosen based on the previous fits to be: μ Brazil ∼ N + (2, 0.5), σ Brazil ∼ N + (0.5, 0.5) and s Brazil ∼ N + (1.5, 0.5). Additionally, for all fitted densities, the mean and variance parameters were constrained to be positive. Posterior samples of the parameters in the model are generated using Hamiltonian Monte Carlo (HMC) with Stan [21,22]. For each fit, we use four chains and 2000 iterations, with half of the iterations dedicated to warm-up.
The preference for one fitted model over another is characterized in terms of the Bayesian support, with the model evidence calculated to see how well a given model fits the data, and comparison between two models using Bayes Factors (BFs). BFs provide a principled fully Bayesian approach to select between models, incorporating the full posterior densities and thus also the uncertainty of each of the parameters instead of point estimates [23][24][25][26]. Moreover, BFs naturally balance the complexity and accuracy of the compared models, ensuring that the excessively complex models are not automatically favoured. Historically simpler methods have been favoured, as BFs can be costly to compute for complex models, however using recent efficient methods this is not an issue [27]. The details of how to estimate the model evidence and calculate the BF for each pair of models are given in appendix A.
Data cleaning and the analysis of the results was conducted using Python (v. 3.7.7) programming language [28]. The PyStan (v. 2.19.0.0) interface was used for running model fitting with Stan [29].

Brazilian epidemiological distributions
Five trial PDFs-gamma, Weibull, lognormal, generalized lognormal and generalized gamma-were fitted to the epidemiological data shown in figure 2.
All of the models' fits were tested by using the BFs based on the Laplace approximation and corrected using thermodynamic integration [27,30,31], as described in appendix A. The thermodynamic integration contribution was negligible suggesting the posterior distributions are satisfactorily approximated as multivariate normal. The conclusions on the preferred PDF were not sensitive to the choice of prior distributions, that is the preferred model was still the favoured one even when more informative prior distributions were applied for all PDFs. The BFs used for model selection are shown in appendix B, table 6.
The gamma PDF provided the best fit to the onsetto-death, hospital-admission-to-death and ICU-stay data. For the remaining distributions-onset-to-diagnosis (non-PCR), onset-to-diagnosis (PCR), onset-to-hospital-discharge, onset-to-hospital-admission and onset-to-ICU-admissionthe generalized lognormal distribution was the preferred model. The list of preferred PDFs for each distribution, together with the estimated mean, variance and PDFs' parameter values for the national fits are given in table 2. The 95% credible intervals (CrI) for parameters of each of the preferred PDFs was less than 0.1 wide, therefore in table 2 we show only point estimates.
Additionally, in figure 2, in each instance the cumulative probability distribution is given for the best model fit, revealing that out of patients for whom COVID-19 is terminal, almost 70% die within 20 days of symptom onset. Out of patients who die in the hospital, almost 60% die within the first 10 days since admission.
The estimated mean number of days for each distribution for Brazil is compared in table 3 with values found in the literature for China, USA and France. The majority of the data obtained through searching the literature pertained to the early stages of the epidemic in China, and no data were found for low-and middle-income countries. The mean onset-to-death time of 15.2 (95% CrI 15.1-15.3) days, from a best-fitting gamma PDF, is shorter than the 17.8 (95% CrI 16.9-19.2) days estimate from Verity et al. [6] and 20.2 (95% CrI 15.1-29.5) days estimate (14.5 days without truncation) from Linton et al. [12] In both cases, estimates were based on a small sample size from the beginning of the epidemic in China. The mean number of days for hospital-admissionto-death of 10.8 (95% CrI 10.7-10.9) for Brazil matches closely the 10 days estimated by Salje et al. [32]

Subnational Brazilian epidemiological distributions
The onset-to-death distribution, and other time-delay distributions such as onset-to-diagnosis, length of ICU stay, onset-to-hospital-admission, onset-to-hospital-discharge, onset-to-ICU-admission and hospital-admission-to-death, have been fitted in a joint model across the 26 states and one federal district of Brazil using partial pooling. The mean number of days, plotted in figure 3, shows substantial subnational variability-for example, the mean onset-to-hospital-admission for Amazonas state was estimated to be 9.9 days (95% CrI 9.7-10.1), whereas for Mato Grosso do Sul the estimate was 6.7 (95% CrI 6.4-7.1) days and Rio de Janeiro -7.2 days (95% CrI 7.1-7.3). Amazonas state had the longest average time from onset-to-hospital-and ICU-admission. The state with the shortest average onset-to-death time was Roraima. Santa Catarina state on the other hand had a longest average onset-to-death and hospital-admission-to-death time, royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200596 as well as longest average ICU-stay. For a visualization of the uncertainty in our mean estimates for each state, see the posterior density plots in appendix B, figures 6 and 7. Additional national and state-level results for the onset-to-death gamma PDF, including the posterior plots for mean and variance, are shown in appendix B, figure 8.
We also observe discrepancies between the five geographical regions of Brazil, for example states belonging to the southern part of the country (Paraná, Rio Grande do Sul and Santa Catarina) had a longer average ICU-stay and hospital-admission-to-death time when compared with the states in the North region. Full results, including detailed estimates of mean, variance, and estimates for each of the distributions' parameters for Brazil and Brazilian states can be accessed at https://github.com/mrc-ide/Brazil_COVID19_ distributions/blob/master/results/results_full_table.csv.

Sensitivity analyses
In order to remove the potential bias towards shorter outcomes from left-and right-censoring, we tested the scenario in which the data to fit the models were truncated. For example, based on a 95% quantile of 35 days for the hospital-admission-to-death distribution, entries with the starting date (hospital admission) after 2 June 2020 and those with an end-date (death) before 1 April 2020 were truncated, and the models were refitted. With censored parts of the data removed, the mean time from start to outcome increased  Figure 2. Histograms for onset-to-death, hospital-admission-to-death, ICU-stay, onset-to-hospital-admission, onset-to-hospital-discharge, onset-to-ICU-admission onset-to-diagnosis (PCR) and onset-to-diagnosis (non-PCR) distributions show data for Brazil extracted from the SIVEP-Gripe database [13]. For each distribution, solid lines are for fitted PDFs and the dashed line shows the cumulative distribution function of the best-fitting PDF. The left-hand side y-axis gives the probability value for the PDFs and the right-hand side y-axis shows the value for the cumulative distribution function. All values on the x-axes are given in days. State-level fits are shown in figure 3 and  To test the impact of keeping or removing entries identified as potentially resulting from erroneous data transcription (see the methods §2), we fitted the PDFs to some of the distributions on a national level with and without those entries. For onset-to-hospital-admission, onset-to-ICU and onset-to-death we find that generalized gamma PDF was preferred when the first day of the distribution was included, and gamma (for onset-to-death) and generalized lognormal PDFs if the first day was removed. For hospitaladmission-to-death, a gamma distribution fitted most accurately when the first day was included, and Weibull when it was excluded. Removing the first day results in the mean values shifting to the right by approximately 1 day for both onset-to-hospital-and ICU-admission, and by 0.5 days for hospital-admission-to-death (see appendix B, figure 9).
Sensitivity analysis regarding the model selection approach is detailed in appendix A.

Discussion
We fitted multiple probability density functions to a number of epidemiological datasets, such as onset-to-death or onsetto-diagnosis, from the Brazilian SIVEP Gripe database [13], using Bayesian hierarchical models. Our findings provide the first reliable estimates of the various epidemiological distributions for the COVID-19 epidemic in Brazil and highlight a need to consider a wider set of specific parametric distributions. Instead of relying on the ubiquitous gamma or Table 2. For each COVID-19 distribution the preferred PDF with the largest Bayesian support is listed, along with the estimated mean, variance and other parameters of the PDF. Ninety-five per cent credible intervals are given in brackets for mean and variance. The parameters p 1 , p 2 and p 3 for the preferred PDFs gamma and generalized lognormal (GLN) are given in the form gamma(x|p 1 , p 2 ) = gamma(α, β) and GLN(x|p 1 , p 2 , p 3 ) = GLN(μ, σ, s), with the formulae of the PDFs given in appendix B,     [38], and other low-and middle-income countries [39], but we expect they are also highly relevant to the epidemics unfolding in other countries. royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200596 Across Brazil, the epidemic has strong geographical heterogeneity, with some states such as Amazonas and Maranhão reported to be at advanced stages [40,41]. To describe the observed differences at subnational level accurately, using a mathematical model, it is essential to account for variation in model parameters by state. By making use of the state-level custom-fitted onset-to-death distributions reported here, we have estimated the number of active infections on 23 June 2020 across 10 states spanning the five regions of Brazil using a Bayesian hierarchical renewal-type model [7,38,42]. The relative change in the number of active infections from modelling the cases using heterogeneous state-specific onset-to-death distributions, compared to using a single common Brazil one, is shown in figure 4 to be substantial. Relative changes are observed of up to 18% more active infections, suggesting common assumptions of onset-to-death spatial homogeneity are unreliable and closer attention needs to be paid when fitting models of SARS-CoV-2 transmission dynamics in large countries.
Notably, large subnational variability was observed for all fitted distributions, with the mean onset-to-death ranging between 11.2 days in Roraima to 17.8 in Santa Catarina. Hospital-admission-to-death time showed substantial variation between the regions of Brazil, ranging between 8.1 and 11.3 in the North, and between 9.6 and 12.8 in the South. A plausible hypothesis is that the observed differences in outcome timings could be explained by greater difficulty accessing hospitals in the North, or limited access to equipment such as ventilators.
In order to explain the origin of the geographical variation of average distribution times across states, shown in figure 3, we present a basic exploratory analysis based on relevant high-level features. We examined the correlation between socio-economic factors, such as education, poverty, income, wealth, deprivation and segregation, using a number of socioeconomic state-level indicators obtained from Barrozo et al. [43] and additional datasets containing the mean age per state and percentage of people living in the urban areas (urbanicity) [44]. The Pearson correlation coefficients, shown in appendix B, table 8, suggest that poverty, income, segregation and deprivation elements were most strongly correlated with the analysed onset-time datasets. In particular, poverty was strongly negatively correlated with hospital-admission-to-death (−0.68), whereas income and segregation had a high positive correlation coefficient for the same distribution (+0.60, +0.62, respectively). The strongest correlation was observed for hospital-admissionto-death and deprivation indicator, which measures the access to sanitation, electricity and other material and non-material goods [43]. Interestingly, the indicators measuring economic situation were more correlated with average hospitalization times than mean age per state, which suggests that although the low-and middle-income countries typically have younger populations, their healthcare systems are more likely to struggle in response to the COVID-19 epidemic. Socio-economic factors have been also shown to correlate with the accessibility of the COVID-19 diagnosis in the Metropolitan Region of São Paulo, which emphasizes the impact of the spatial heterogeneity of the socio-economic status on the various aspects of the epidemic, from capturing the active cases to providing treatment for the patients [14]. More detailed analysis is necessary to fully appreciate the impact of the economic components on the COVID-19 epidemic response.
Spatial heterogeneity is not the only source of variability in the hospitalization times. Although in this study, we did not stratify the population according to age or other demographic features, other recent studies have used the SIVEP-Gripe database to characterize the COVID-19 epidemic in Brazil. Namely, they looked at the regional and ethnic distribution of the hospitalized patients [15,16], age-sex structure and clinical characteristics such as co-morbidities and symptoms [14,15]. Souza et al. [14] show that 65.5% of cases are patients over 50 years old. Moreover, they also find that 84% of the patients reported having at least one underlying condition. It is clear, that both age and co-morbidities are highly correlated with the adverse outcomes such as hospitalization or death, and to calibrate the epidemiological models of COVID-19 the time-onset distributions presented in this study could be refined further.
In the work presented, we acknowledge several limitations. The database from which distributions have been extracted, though extensive, contains transcription errors, and the degree to which these bias our estimates is largely unknown. Secondly, the PDFs fitted are based on observational hospital data, and therefore should be cautiously interpreted for other settings. Thirdly, though we have fitted PDFs at subnational as well as national level, this partition is largely arbitrary and further work is required to understand the likely substantial effect of age, sex, ethnic variation, co-morbidities and other factors.

Conclusion
We provide the first estimates of common epidemiological distributions for the COVID-19 epidemic in Brazil, based on the SIVEP-Gripe hospitalization data [13]. Extensive heterogeneity in the distributions between different states is reported. The differences are identified by comparing parametric forms, that have been fitted for each epidemiological distribution, and give a more informed and reliable basis for comparison than the empirical distributions. Quantifying the time-delay for COVID-19 onset and hospitalization data provides useful input parameters for many COVID-19 epidemiological models, especially those modelling the healthcare response in low-and middle-income countries. royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200596 Data accessibility. Python, R and Stan code used to analyse the data and fit the distribution is available at https://github.com/mrc-ide/ Brazil_COVID19_distributions, along with estimated parameters for each state and PDFs considered at https://github.com/mrc-ide/ Brazil_COVID19_distributions/blob/master/results/results_full_  Acknowledgements. We thank Microsoft for providing Azure credits which were used to run the analysis.

Appendix A. Model selection
To characterize which model (gamma, lognormal etc.) best fits the data, the Bayesian model evidence z = z( y|M i ) is evaluated. Here and throughout this section, y denotes the data and M i denotes the i th model from the analysed model set. As determining the model evidence requires calculating an integral over the model parameters (θ) which is generally intractable, we approximate it with z 0 = z 0 ( y|M i ), which is based on a second-order Laplace approximation [45], q 0 = q 0 (θ|M i , y), to the true un-normalized posterior density q = q(θ|M i , y). The second-order approximated density is estimated as Here, q(û ) denotes the value of the un-normalized posterior evaluated using the mean estimates of the model's parametersû , andŜ the covariance matrix built from MCMC samples of the posterior distribution. From this expression, a second-order approximation to the model evidence, z 0 , is , where det( · ) denotes the determinant of the matrix.
For each model pair, BFs were computed from the marginal likelihoods. Considering two models M i and M j , the BF is where z( y|M i ) is the evidence of model M i given y. If B ij > 1, the evidence is in favour of model M i . Here, for readability, we will report the BFs as 2 log(B ij ) following Kass and Raftery notation [46]. The sensitivity of our model evidence is tested with respect to the choice of hyperprior distribution, and secondly with respect to the use of the approximate second-order density q 0 . In the latter instance, this is done by performing thermodynamic integration [27,30,31] between q 0 and the true density q in order to obtain an asymptotically exact estimate of the marginal model evidence, The right-hand term corrects the z 0 approximation to the exact Bayesian evidence by a path integral evaluated with respect to a sampling distribution that interpolates between the two densities as q(u; l) ¼ q (1Àl) q l 0 in terms of the auxiliary coordinate λ.