The determinants of COVID-19 case reporting across Africa

Background According to study on the under-estimation of COVID-19 cases in African countries, the average daily case reporting rate was only 5.37% in the initial phase of the outbreak when there was little or no control measures. In this work, we aimed to identify the determinants of the case reporting and classify the African countries using the case reporting rates and the significant determinants. Methods We used the COVID-19 daily case reporting rate estimated in the previous paper for 54 African countries as the response variable and 34 variables from demographics, socioeconomic, religion, education, and public health categories as the predictors. We adopted a generalized additive model with cubic spline for continuous predictors and linear relationship for categorical predictors to identify the significant covariates. In addition, we performed Hierarchical Clustering on Principal Components (HCPC) analysis on the reporting rates and significant continuous covariates of all countries. Results 21 covariates were identified as significantly associated with COVID-19 case detection: total population, urban population, median age, life expectancy, GDP, democracy index, corruption, voice accountability, social media, internet filtering, air transport, human development index, literacy, Islam population, number of physicians, number of nurses, global health security, malaria incidence, diabetes incidence, lower respiratory and cardiovascular diseases prevalence. HCPC resulted in three major clusters for the 54 African countries: northern, southern and central essentially, with the northern having the best early case detection, followed by the southern and the central. Conclusion Overall, northern and southern Africa had better early COVID-19 case identification compared to the central. There are a number of demographics, socioeconomic, public health factors that exhibited significant association with the early case detection.


Introduction
The ongoing COVID-19 pandemic, triggered by the SARS-CoV-2 virus (1), has spread globally with noticeable variations in reported cases and deaths across different regions.As of May 5, 2022, the pandemic is affecting more than 220 countries and territories, with over 513 million cases and 6 million deaths worldwide (2).In particular, the spread of the virus and subsequent reporting of cases have shown a slower pace in African nations.Since the identification of the first case in Egypt on February 14, 2020 (3), the growth of new infections within African countries remained relatively modest, with the WHO African region reporting over 8 million cases and 170,000 deaths (4), a stark contrast to the more severe morbidity and mortality rates observed in other world regions (2).
Several factors contribute to the under-estimation and underreporting of COVID-19 cases in Africa.The presence of asymptomatic infections, capable of transmitting the virus (5,6), and a tendency of fewer clinical symptoms among the younger demographic (7), potentially lead to a significant under-estimation of the true case count, especially in regions with younger populations like Africa.Moreover, limited testing and public health resources, inadequate public awareness, cultural stigmatization, self-medication practices, and initially unestablished monitoring practices on the continent might have further contributed to the under-reporting of cases (8).Similarly, in some nations, political motives might influence data adjustment, impacting transparency, with certain governments altering their figures to project a particular narrative (8).
The WHO estimates that only one in seven cases was being detected in Africa (8), highlighting a substantial gap in data accuracy and completeness.Our previous investigation into the early phase of the COVID-19 outbreak across 54 African nations also revealed a significant under-reporting trend (9).Specifically, an average of only about 5.37% of all COVID-19 cases was duly reported.Strikingly, these numbers showed vast differences across nations, with Libya reporting a notably high rate of 30.41%, in contrast to São Tomé and Príncipe's alarmingly low 0.02% (9).
These preceding observations highlight a pressing need to understand the underlying factors contributing to the reporting of COVID-19 cases in Africa.Our study aimed to explore the determinants influencing COVID-19 case detection and to classify African countries using these determinants.By identifying them, we aimed to provide insights that could be crucial for policymakers, health authorities, and lawmakers in crafting more adept public health strategies, refining resource allocation, and developing a more responsive and transparent system to confront the challenges presented by pandemics.

COVID-19 case reporting fractions
The fraction of overall COVID-19 cases reported was estimated for 54 African countries using a mathematical deterministic model with a Bayesian inference framework (9).In this study, we used these reporting rates as the response variable in the statistical analysis (Figure 1).

Factors
Our choice of variables was largely informed by past research and insights surrounding the under-reporting of COVID-19 cases in Africa.Factors like the occurrence of asymptomatic infections which are capable of virus transmission (5,6), and a lower incidence of clinical symptoms among the younger demographic (7), have been pointed out, especially in regions with younger populations like Africa.Additionally, the constraints on testing and public health resources, lack of widespread public awareness, cultural stigma, selfmedication practices, and initially unestablished monitoring practices on the continent are believed to have contributed to the underreporting of cases (8).These considerations mirror challenges faced in other regions; for instance, in Brazil, demographic-related challenges led to misclassification of COVID-19 cases, particularly among younger individuals, those with lower education levels, and rβural residents (10).Such challenges highlighted the significant issue of misidentifying COVID-19 cases as severe acute respiratory infections (SARI).Education has also been pointed out as being critical for Africa's health outcomes (11).Hence, we included the urban population, population literacy, female percentage, median age, percentage of Christians, and percentage of Muslims for each country to capture key demographic nuances.
Studies suggested variations in reported data due to a nation's wealth, political climate, and inequalities (12).Notably, authoritarian governments were shown to skew data to appear competent and boost global reputation.Bureaucracies also modify data around elections to align with leadership's wishes (13).Furthermore, democracy-related indicators, like the democracy index and world press freedom index, suggest countries with lower ranks, such as Turkey, China, Indonesia, and Iran, face greater COVID-19 under-reporting issues (14).To this end, for factors we also examined GINI index, GDP per capita, human development index, democracy index, press freedom index, political stability, government social media censorship, voice given to and accountability for citizens, corruption, ease of doing business, internet filtering, and air passenger volume.
Health-related factors, highlighted for their potential impact on reporting, were inspired from the results of study (15), which states that the number of tests conducted, global health security index, and average body mass index significantly correlated with reported COVID-19 cases per million population.These encompassed the number of nurses, number of physicians, public health expenditure in GDP, public health expenditure in total expenditure, body mass index, prevalence of cardiovascular diseases, prevalence of diabetes, global health security, cancer, prevalence of cholesterol, prevalence of lower respiratory infections, and malaria incidence.
All factors considered in this study are summarized in Table 1.

Statistical analysis
There are five missing values for malaria incidence (Lesotho, Libya, Mauritius, Seychelles, and Tunisia), three for GINI index (Equatorial Guinea, Eritrea, and Libya), two for GDP (Eritrea and South Sudan), and one for air transport (Mali), public health expenditure in GDP (Somalia), public health expenditure in total (Somalia), human development index and press freedom score (São Tomé and Príncipe).These missing values were imputed with mean values of corresponding variables in order to keep the original variable distributions.The numerical variables were then standardized to enable direct effect comparison.
To select the covariates to be used for the generalized additive model (GAM), we identified pairs of covariates that have a Spearman correlation greater than 0.6 or less than−0.6.For one covariate that has Spearman correlation above/below the threshold with more than one other covariates, we checked correlation between them and the response variable: if most other independent variables correlate higher with the dependent variable, this covariate will be dropped.Through this process, 12 factors were identified to be included in the drop list: population aged 65+, literacy, GINI index, press freedom, public health expenditure in GDP, public health expenditure in total, business, number of nurses, BMI, cancer prevalence, internet filtering and Christian population.
We used a GAM for the effects of the covariates on the COVID-19 daily case reporting rates.The general form of the model is where y i is the response variable, x ji j , , ,n = … ( ) 1 2 are the predictors,  i is identically and independently distributed as a normal random variable, g is a monotonous link function and f j j , , ,n = … ( ) are nonparametric smoothing functions.Here we adopted cubic spline functions for continuous covariates and linear functions for categorical covariates.Compared to the data size, the GAM model has room for three more additional predictors after applying the drop list of aforementioned 12 factors.To get the best GAM, we added one predictor from the drop list at a time back to the GAM and kept the one that yields the largest deviance explained.Three rounds were run and internet filtering, number of nurses and literacy were added back to GAM after each round of selection.

Hierarchical clustering of principal components analysis
Using the COVID-19 daily reporting rates and the significant continuous factors from the statistical analysis, we clustered the 54 African countries through Hierarchical Clustering of Principal Components Analysis (HCPC) performed by packages "FactoMineR" and "factoextra" in RStudio2023.06.1 + 524.The algorithm is done in two major steps: first it reduces the dimension of the data through principal component analysis (PCA), and second cluster analysis is conducted on the PCA results using the Ward's criterion which minimizes the total within-cluster variance.

Results
The deviance explained for the best GAM model with identity link function is 97.4% (adjusted R-squared: 0.897), indicating a high explanatory power from the model.Among the 34 covariates, the significant factors with p-values less than 0.05 are: population, urban population, median age, life expectancy, Islam population, literacy rates, global health security, number of nurses, number of physicians, Daily COVID-19 reporting rates in percentage for 54 African countries in ascending order.prevalence, human development index and air transport correlate with higher reporting rates.HCPC analysis resulted in three major clusters as shown in the dendrogram in Figure 3A ) are significantly different across clusters from the ANOVA test, as is shown in Figure 4.According to Figure 4, Cluster 2 (red) has the best COVID-19 reporting performance, followed by Cluster 3 (blue) and then Cluster 1 (gray).Among the three major clusters, Cluster 2 and 3 on average exhibit higher median age, life expectancy, urban population size, human development index, GDP, number of physicians, number of nurses and literacy rates and lower malaria incidence.However, Cluster 2 and 3 on average also have higher corruption index, diabetes and cardiovascular diseases prevalence.Cluster 1 shows intermediate Partial effects of statistically significant predictors (A-U) on COVID-19 daily case reporting rate (%) with p-values from generalized additive model (Deviance explained = 97.4%).p-values<0.05 were considered statistically significant.
average values with Cluster 2 and 3 polarized in democracy index, voice accountability, lower respiratory infections prevalence and Islamic population size.

Discussion
We identified four demographic factors: total population, urban population, median age and life expectancy; seven public health related factors: global health security, number of physicians, number of nurses, malaria, diabetes, lower respiratory diseases and cardiovascular diseases prevalence; eight socioeconomic factors: GDP, human development index, democracy index, corruption index, voice accountability, air transport, social media and internet filtering; one religious factor Islamic population and one education factor literacy rates as significantly associated with COVID-19 case identification during the early stage of the pandemic in Africa.Based on these determinants and the estimated daily case reporting rate, the African Beginning with demographics, we note that areas with a low urban population might see more contained and easier-to-track spread, while highly urbanized areas could boast better health infrastructure and reporting systems (Figure 2B).Middle urban areas may face reporting challenges as rapidly growing populations outpace healthcare facilities' development (49).Countries with a higher median age (Figure 2C) might experience a higher COVID-19 case fatality rate due to increased risk in older individuals (7, 50), significantly straining healthcare systems.Higher life expectancy (Figure 2D) could indicate superior overall healthcare systems (51,52), potentially leading to more efficient COVID-19 detection and reporting.The negative association with population size (Figure 2A) (53) suggests that larger populations face lower reporting rates due to the challenges of scaling up testing and reporting infrastructure (54,55) or the greater likelihood of under-detection in densely populated areas.
Socioeconomically, a higher GDP (Figure 2N) might correlate with better healthcare infrastructure (56, 57), thus enhancing testing and reporting capacity.The democracy index (Figure 2P) suggests that countries with very high (58) or very low (59) scores exhibit higher reporting rates, possibly reflecting transparency in democratic countries and international scrutiny or aid in less democratic countries (60).In general, during the COVID-19 pandemic, many countries experienced declines in their democracy scores, particularly those with authoritarian regimes (61, 62).Increased corruption (Figure 2Q) is linked to decreased reporting rates, suggesting that corruption may hinder accurate reporting (63) due to factors like mismanagement of information and resources.Countries where citizens have more voice and accountability (Figure 2R) may report better, likely due to greater transparency and public demand for accurate information.This could also be seen in social media censorship (Figure 2T), but beyond a certain level, the improvement on case reporting becomes less pronounced.The sharp decline at the lower end of the health development index (HDI) (Figure 2O) scale might suggest that countries with lower development levels have much lower COVID-19 reporting rates, possibly due to limited healthcare infrastructure and fewer resources for testing and reporting.As countries attain a moderate level of human development, the reporting rate's decrease slows, potentially indicating that beyond a certain threshold of development, improvements in reporting rates become less pronounced, possibly due to the establishment of basic reporting systems.Countries with moderate levels of air transport have the highest reporting rates (Figure 2S), which could be tied to better connectivity and infrastructure that also supports health reporting systems.Interestingly, a higher proportion of the Muslim population (Figure 2E) correlates with increased reporting rates, prompting consideration of the complex interplay between religious practices, community engagement, and public health policies.This finding warrants further investigation to understand the underlying social mechanisms at play.The inverse U-shaped curve observed for literacy rates (Figure 2F) suggests that moderate literacy aligns with the highest reporting rates, likely due to the intersection of disease prevalence and public health awareness.Countries with high literacy rates might have lower incidence due to preventative measures and thus fewer under-reporting, while low literacy might impede disease recognition and reporting.
Public health insights indicate varying impacts of healthcare resources on COVID-19 reporting rates.Higher malaria incidence (Figure 2J) (64) may indicate robust disease tracking systems (65).Interestingly, the number of nurses per capita (Figure 2H) has a negative association with reporting rates, suggesting more nurses might not directly translate to higher reporting, potentially due to efficient disease management or other unaccounted factors.Conversely, a higher number of physicians (Figure 2I) correlates with increased reporting rates, possibly indicating better diagnostic and surveillance capacity.It could also reflect broader healthcare system quality, where more physicians per capita mean more comprehensive care, including chronic disease management, and a heightened awareness and ability to report infectious diseases like COVID-19.The global health security index's positive association (Figure 2G) implies that countries prepared for health crises report more efficiently (66).This result would imply that countries with higher scores in global health security, which encompasses factors like disease detection, response, health system quality, and the risk environment, are likely to report COVID-19 cases more efficiently and accurately.This is consistent with what would be expected, as a higher global health security index indicates a stronger healthcare infrastructure capable of dealing with pandemics.For diabetes prevalence (Figure 2K), a negative association would suggest that as the prevalence of diabetes in the population increases, the COVID-19 reporting rates actually decrease.This could be due to several potential factors.For example, countries with higher prevalence of diabetes may have healthcare systems that are more burdened by chronic disease management (67), possibly leading to less capacity for effective infectious disease surveillance and reporting.Alternatively, it could reflect socio-economic factors (68), where high diabetes prevalence is associated with other factors that might impede effective reporting, such as limited access to healthcare services or lower health literacy regarding infectious diseases.The inverse U-shaped relationship between lower respiratory infection (LRI) rates and COVID-19 reporting (Figure 2L) suggests that countries with extremely high or low LRI rates tend to have lower COVID-19 reporting rates, while those with moderate LRI rates report more cases.High LRI rates may strain health resources, leading to under-reporting of COVID-19, whereas countries with low LRI rates might lack the necessary infrastructure or experience to detect and report COVID-19 effectively.Optimal reporting is observed in countries with moderate LRI prevalence, possibly due to balanced health system vigilance and capacity.Notable exceptions, such as Libya with low LRI but high COVID-19 reporting rates, indicate that other factors also significantly influence reporting.The positive association seen with cardiovascular disease death rates (Figure 2M) suggests that countries with a greater number of reported deaths from cardiovascular diseases have a greater COVID-19 reporting rate.As cardiovascular related diseases are the leading cause of death in Africa (69), countries with a higher number of reported deaths from such diseases would also possess enhanced disease surveillance and reporting practices and thus have a higher COVID-19 reporting rate.
Studies have been conducted to find possible reasons for the low numbers of cases and deaths in Africa.The proportion of older adult people (≥60 years old) was identified to be the major factor to explain low case number, and the health systems capacities was identified to be responsible for the case under-estimation in one study (70).In another study, international flights, testing capacity, population density, young population, Vitamin D levels, cross-immunity from other infections, temperature and UV light and humidity were listed as potential reasons for Africa's low case number (71).However, it is worth note that our study is different as the subject of this study is the reporting/underreporting rate, rather than the overall case number.To our knowledge, there have not been investigations on the determinants of COVID-19 case reporting in Africa.The dependent variable of COVID-19 daily case reporting fraction was estimated using the same mechanistic mathematical model for all African countries and therefore provide a reliable and fair comparison among them (9).Our study also considered a wider range of potential factors than those in existing literature (10, 12-15).Interestingly, some factors do not show the same relationship with total case number and case reporting ratio.For example, higher air transport rate and human development index are contemplated to associate with higher COVID-19 case number (71), but they demonstrate an inverse U-shaped rather than monotonic association with the reporting ratio (Figure 2).And while larger median age implies more cases, its relationship with reporting ratio is negative (Figure 2D).That could provoke further thinking on how the factors affect the case reporting system.The clustering results from our HCPC analysis are in general agreement with other studies.For example, previous emerging infectious diseases epidemics revealed the vulnerability of Western and Central Africa in facing both known and unknown pathogens due to a growing urban population with insufficient public health infrastructure (72), and moreover, Northern and Southern Africa show higher capacities in health systems (70).However, it should be noted that reporting practice varies as the outbreak progresses, and the case reporting rate used here is only the estimate for the initial phase of the pandemic.Therefore, these determinants could be interpretated as the preparedness in face of a novel emerging communicable disease outbreak but cannot be extended to subsequent efforts made by the nations or the overall case identification performance over the entire course.

FIGURE 1
FIGURE 1 , indicated by different colors.The mean Silhouette score for the clustering using the coordinates in the factor map is 0.485.Their geographical locations are shown in Figure3Bwith corresponding colors for each resultant cluster.Basically, Cluster 1 (gray) includes the central African region with the largest number of countries.Cluster 2 (red) includes the northern African region with Morocco, Tunisia, Alegria, Libya and Egypt (Gabon as an exception).Cluster 3 (blue) includes essentially the southern African region with Namibia, Botswana, South Africa (Ghana as an exception), plus Mauritius, Seychelles, Cabo Verde and São Tomé and Príncipe.Among all continuous covariates and the COVID-19 daily case reporting rate used for the clustering, the reporting rate (p = 0 048 .

FIGURE 3 (
FIGURE 3 (A) Dendrogram showing three major clusters from HCPC; (B) Geographical locations of the three clusters.

FIGURE 4
FIGURE 4Significant covariates characterizing the three major clusters from HCPC.

TABLE 1
List of factors considered with descriptions and sources.Except for Eritrea, of which data is from 2011.† Except for Djibouti and Somalia, of which data is from 2003 and 2001.
Voice and accountability: extent to which a country's citizens are able to participate in selecting their government, as well as freedom of expression, freedom of association, and a free media (standard normal distribution from approx.−2.5-2.5).*