On the relationship between COVID-19 reported fatalities early in the pandemic and national socio-economic status predating the pandemic

This study investigates the relationship between socio-economic determinants pre-dating the pandemic and the reported number of cases, deaths, and the ratio of deaths/cases in 199 countries/regions during the first months of the COVID-19 pandemic. The analysis is performed by means of machine learning methods. It involves a portfolio/ensemble of 32 interpretable models and considers the case in which the outcome variables (number of cases, deaths, and their ratio) are independent and the case in which their dependence is weighted based on geographical proximity. We build two measures of variable importance, the Absolute Importance Index (AII) and the Signed Importance Index (SII) whose roles are to identify the most contributing socio-economic factors to the variability of the COVID-19 pandemic. Our results suggest that, together with the established influence on cases and deaths of the level of mobility, the specific features of the health care system (smart/poor allocation of resources), the economy of a country (equity/non-equity), and the society (religious/not religious or community-based vs not) might contribute to the number of COVID-19 cases and deaths heterogeneously across countries.


Linear regression with independent observations
Consider a data set of n observations {x i , y i } ∈ R p × R. A linear regression model assumes that there is a linear relationship between the outcome variable y and the input variables x, in the form: form where T denotes the transpose, and β ∈ R p is a vector of coefficients. In matrix form, the relationship takes the form The assumptions of the model are the following The parameters of the model can be estimated using ordinary least square methods and produce the explicit formula:β LS = X T X −1 X T y ∼ N(β, σ 2 [X T X] −1 ), assuming X T X is invertible (namely that that the input variables are not linear combinations of one another) with l 2 error of the order of n 1/2 .

Linear regression with dependent observations
Several of the assumptions of standard linear regression models are too strong, for example the hypothesis of independence between the outcome variables y. It has been shown that if the dependencies are sufficiently weak, then both the coefficient vector β and the strength A of the dependencies among the response variables can be estimated with an error of the order of n 1/2 , as the Central Limit Theorem guarantees in the case of iid random variables [16]. Our approach including geography dependency is simplified with respect to the framework of [16], as we assume that A, the matrix of geography relationship is known and not to be estimated from the variables X, y. The parameters of the model can be then estimated using again ordinary least square methods and produce a similar explicit formula for the coefficients: since A is constant with respect to averages and variances taken with respect to the distributions of X and y by our assumptions.

LASSO
Suppose again to have an sample of n observations, {y i , x i } n i=1 . Then, the Least Absolute Shrinkage and Selection Operator (LASSO) optimizes the following functional [21,22]: with λ is a pre-specified regularization parameter. The LASSO estimator can be written in explicit form asβ

MICE
For completeness we report here the main details of an algorithm for imputation called Multiple Imputation by Chained Equations (MICE), as discussed in [17,18]. Let X j for j = 1, . . . , p be one of the variables, with X obs j for j = 1, . . . , p the observed data and X mis j for j = 1, . . . , p the missing data. Suppose X has been partially observed from the multivariate conditional distribution P(X|θ) with θ unknown and with its distribution to be determined. MICE samples iteratively through the distributions where X − j is the vector of input variables with X j dropped. Starting from a simple draw from the marginals, the t-th iteration of the chained equations is a Gibbs sampler that draws iteratively for j = 1, . . . , p. Here X (t) j = (X obs j , X * (t) j ) is the j-th imputed variable at iteration t. For more details, we refer to [17,18].

Appendix S3: Descriptive statistics
In this appendix, we collect the descriptive statistics of the socio-economic variables and of the epidemiological variables. The values in all the tables have been computed using the raw data (no imputation), which is the reason for the different number of countries per variable. Appendix S4: Tables of the importance indices This appendix contains the detailed tables of the importance indices Absolute Importance Index (AII) and Absolute Importance Index (SII) calculated across all our 32 × 2 models (geographically weighted + not geographically weighted). Tables with the title "Weighted" refer to the fact that the reported values in those tables are a percentage of the total number of models for that category. For example, there are twice as many models withỸ 1 as models with Y 1 , so transforming the integer scores in percentages corrects for that problem. Weighted Index of Importance of Socio-Economic Variables, Divided by Category 0 0 0 0 0 0 *Population aged 0-14 was the first variable identified in LASSO for the geographically weightedỸ 1 , but it dropped out with the addition of the second variable and never returned. Further, it was not identified as important in any other models. Therefore we think it is unlikely that it is a significant variable. *Population aged 0-14 was the first variable identified in LASSO for the geographically weightedỸ 1 , but it dropped out with the addition of the second variable and never returned. Further, it was not identified as important in any other models. Therefore we think it is unlikely that it is a significant variable.   Environmental Health

Index of Importance of Socio-Economic Variables, Divided by Category
Ecological footprint (gha/person) 1 1 2 4 Air pollution (avg P.M. 2.5 exposure per year) 0 0 4 4 *Population aged 0-14 was the first variable identified in LASSO for the geographically weightedỸ 1 , but it dropped out with the addition of the second variable and never returned. Further, it was not identified as important in any other models. Therefore we think it is unlikely that it is a significant variable.