1 Introduction

Data Science provides individuals, institutions, and regulatory authorities with insight into the massive online data they are collecting. This potential is especially critical during a crisis, and it was the COVID-19 pandemic that unleashed that potential. The problem is big data does have the tendency to overwhelm the human capacity to understand the story being told within the data (). However, with advances in easy to use ‘drag-and-drop’ software tools, such as Tableau™ and the open-source languages of Python and R, data dashboards and data visualizations are relatively easy to generate and are now prevalent. Patterns and trends, which might otherwise go undetected in tabular data or statistical measures, can be exposed more easily through data visualization technology (). Dynamic visualizations of data have become effective and sought-after tools for these informative trends.

Now that the COVID-19 pandemic appears to be in the rearview, this paper acknowledges the contributions of Data Science to that emergency. The goal is to demonstrate that Data Science is indeed playing an extremely significant role in informing humanity and aiding decision-makers whenever an emergency occurs. The paper also provides a roadmap of open-source scripts for others desiring to use these critical toolsets in future disasters, thereby distinguishing itself from other articles on the topic, such as Berinato (), Saxena et al. (), and Leonelli (). Using publicly available datasets from multiple reliable sources and employing a number of R toolkits, the author developed a series of data dashboards and visualizations for the COVID-19 story.

2 Literature Review

The literature is replete with articles suggesting the power of data dashboards and visualizations (e.g., , ; ; ; ; ). Three authors, Matzler (), and, also suggest that, despite advances in predictive analytics, forecasting during an emergency is still challenging. Attempts to predict the path of the COVID-19 outbreak () serve as an example. Rather, Post, Nielson, and Bonneau () and Friedman () recommend using data visualization techniques to quickly assimilate a critical situation and employ predictive models to statistically validate the visual trends. As a result, the focus has shifted toward ‘visualizations that really work’ (; ).

Even before the World Health Organization (WHO) declared the SARS-CoV-2 virus as a pandemic (), the world of Data Science was actively collecting COVID data (). Articles attempting to digest the data flooded the media (; ; ; ; ; ). Many more data sets and articles have appeared since then, but what is noteworthy is that nearly every presentation of the data made use of data dashboards and visualizations for communication and diagnosis of the pandemic. Data Science exploded in notoriety, and never has this technology been so significant to so many at the same time: researchers, health care professionals, policy-makers, academics, decision-makers, and ordinary citizens. It is most likely the first case where visualizations were widely used in critical global decision-making. The Washington Post’s ‘corona simulator’ article in March 2020 () was the most-viewed article ever written by that newspaper. John Burn-Murdoch, a Senior Data-Visualization Journalist for the Financial Times, has seen his social media followers balloon (), thanks to his visualizations ().

The ‘COVID-19 Dashboard’ provided by Johns Hopkins University & Medicine was unique and filled a void in international public health systems. It impacted the way the public accessed real-time health information (). Similar data dashboards were widely replicated by governments, enterprises, and media outlets, and these dashboards are expected to change how society improves coordinated responses to future pandemics (; ). As an example, Marivate and Combrink () present a data dashboard to inform the public about the COVID outbreak in South Africa. The Centers for Disease Control and Prevention (CDC) in the United States provides a public dashboard to help communities assess the impact of COVID-19 and to help individual states take appropriate action. The WHO developed a dashboard to provide COVID information by country, and its Public Health and Social Measures (PHSM) program underscored the steps that need to be taken by countries, territories, and areas to enforce rules or guidelines to limit the spread of COVID-19.

3 Methods

There are two contributions driving the methods used in this paper: the sources of publicly available data and the open-source toolkits for assimilating these data. For the former, the volume of data that agencies and institutions were tracking on COVID-19 initially overwhelmed our capacity to quickly digest the impacts. The data were not only the number of confirmed cases, but also related patient data, the virus’ impacts on these patients, and the effects on nations and their economies. It takes time and the proper tools to completely understand how the virus was affecting humans and how to mitigate or remedy the ensuing problems. Big data in crises like this is so complex that traditional data-processing approaches and software tools are not as effective in assimilating the situation as Data Science toolkits. The challenges include capture, storage, analysis, search, transfer, query, updates, sourcing, and privacy. Marr () simplified these issues into the Five V’s: Volume, Velocity, Variety, Veracity, and Value. Adherence to the Five V’s also implies that the data must be ‘tidy’, a reference to the cleanliness of the data.

A number of open-source COVID data repositories were created by reputable organizations. Table 1 provides a comprehensive list of those data sources.

Table 1

COVID-19 public data sources.


SOURCEURL

Johns Hopkins University2018 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE

World Health Organizationhttps://covid19.who.int/

European Centre for Disease Prevention and Controlhttps://www.ecdc.europa.eu/en/geographical-distribution-2019-ncov-cases

European Centre for Disease Prevention and Controlhttps://www.ecdc.europa.eu/en/publications-data/download-todavs-data-geographic-distribution-covid-19-cases-worldwide

The Covid Tracking Projecthttps://covidtracking.com/

U. S. Centers for Disease Control and Prevention (CDC)https://www.cdc.gov/coronavirus/2019-ncov/covid-data/covidview/index.html

ESRI (Environmental Systems Research Institute) COVID-19 Resourceshttps://coronavirus-disasterresponse.hub.arcgis.com/

COVDATA: a data package for R that collects and bundles datasets related to the COVID-19 pandemic from a variety of sources.https://kjhealy.github.io/covdata/index.html

BeOutBreakPrepared-nCoV2019https://github.com/beoutbreakprepared/nCoV2019/tree/master/latest_data

Kaggle COVID-19 Open Research Dataset Challenge (CORD-19)https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

Kaggle Novel Corona Virus 2019 Datasethttps://www.kaggle.com/sudalairaikumar/novel-corona-virus-2019-dataset

Towards Data Science: Fighting the Covid-19: All the datasets and data efforts in one placehttps://towardsdatascience.com/fighting-the-covid-19-all-the-datasets-and-data-efforts-in-one-place-4d6aeb0157ab

GitHub Our World in Datahttps://github.com/owid

GitHub New York Times COVID-19 Datahttps://github.com/nytimes/covid-19-data

GitHub CSSE (Center for Systems Science and Engineering at Johns Hopkins) COVID-19 Daily Reportshttps://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports

GitHub MIDAS Network Coordination Centerhttps://github.com/midas-network/COVID-19

University of Oxford Covid-19 Government Response Trackerhttps://www.bsg.ox.ac.uk/research/research-proiects/covid-19-government-response-tracker

The data contain thousands of observations with a number of attributes, some of which usually are not significant to the analysis. Thus, the first step is to investigate the data structure. To help accomplish this task, the author employed ‘DataExplorer’ in R, which provides a visualization of the structure (Figure 1).

Figure 1 

DataExplorer visualization of the underlying data structure (author generated).

In addition, raw data usually requires some data ‘wrangling’: that is, taking care of missing values, dealing with obvious outliers due to transcription errors, duplicate records, etc. All of these data tasks can be corrected using pre-defined functions in R or Python. For example, R has an excellent set of data cleaning packages in ‘Tidyverse’, which is worthy of investigating. In the case of all of the COVID-19 data repositories used in this paper, the organizations that provided these data already had published ‘tidy’ data.

Since the focus is on visualizations, of the many that are available, the question becomes: which visualizations are appropriate, and which visualizations tell the story that the data scientist wishes to analyse? Every visualization tells a different story, and every scientist interprets the visualization differently. Thus, there can be multiple ways to visualize data. As a guide, Lengler and Eppler () developed a systematic overview of 100 visualizations. Their efforts culminated in a structure for the visualizations that replicated the logic, look, and use of the periodic table of elements in chemistry.

4 Analysis of a Crisis

4.1 Static visualizations

Data were confirming the exponential growth of COVID that is to be expected with a pandemic. After the first trickle of confirmed cases, the trickle turned into a torrent (). From the onset, the world fixated on its exponential growth, and the WHO began to underline the importance of ‘flattening the curve’ (). The result was a study of what will undoubtedly become one of the most infamous data visualizations recorded in the public media for COVID. China became the first country to provide the curve and then attempt to flatten it (Figure 2).

Figure 2 

Number of deaths, confirmed cases, and cured cases in China (author generated).

Since the WHO declared the pandemic, every country followed China (Figure 3).

Figure 3 

Total number of cases world-wide except China (author generated).

A different variation of this story can be told, not through traditional bar graphs, but through an adaptation of a ‘wind rose’ to visually depict the spread by country (Figure 4). In this wind rose adaptation, the United States, the country with the highest number of confirmed cases (highest speed), is represented as a hybrid of a bar chart and a pie chart.

Figure 4 

Global confirmed cases by country using a wind rose (author generated).

To depict location, the story can be told through another familiar and popular visualization: a geographic map (Figure 5).

Figure 5 

Global confirmed cases by country using a geographic map (author generated).

The velocity and the direction/location of the virus were only part of the story. Perhaps the most significant part was the medical and economic risk. Risk is the statistical story of variation. For that, a violin plot, which is a hybrid of a box plot and rotated kernel density plot, reveals both the density and spread of the virus. Figure 6 is a series of violin plots for the six leading countries in terms of confirmed cases. The United States leads the five other countries represented.

Figure 6 

Global confirmed cases for selected countries using a violin plot (author generated).

The story imbedded in the violin plot of Figure 6 is not only the peak of the violin but also the width, or variation, in the number of cases for each country. Compare, for example, South Korea to the others, and compare the United States to the European countries.

Since the populations of these countries are very different, one begins to question the rate per capita. Figure 7 is the same violin plot as Figure 6, except that it shows the number of confirmed cases per capita for the top 10 countries, and Figure 8 is the number of deaths per capita. In the visualization of Figure 7, Spain and the United States reveal the highest peaks of the violin. The United States does fairly better in Figure 8, which showed deaths.

Figure 7 

Confirmed cases per capita for selected countries using a violin plot (author generated).

Figure 8 

Confirmed deaths per capita for selected countries using a violin plot (author generated).

4.2 Interactive visualizations

While static charts can tell a compelling story at a point in time, researchers, medical professionals, decision-makers, and policy makers often seek to further investigate trends. Interactive visualizations allow investigators to directly interact with the ‘live’ data. As an example, Figure 9 shows two kernel density plots of the number of confirmed cases. The graph on the left has a high bandwidth for the density estimation, and it appears to show a rather normal distribution of the number of cases. The data include all countries. That is a clue because when the bandwidth is relatively high, it obscures much of the underlying data structure. When the bandwidth is adjusted, as shown on the right, the plot reveals that the distribution is not at all normal. Rather, it reveals more than one wave. Since multiple countries are represented, the first wave is actually China, followed by Italy, and then the other countries. When the chart was first created in March 2020, the United States was a small wave. Since then, the United States surpassed the other countries, but it demonstrates that being able to interact with the data dynamically reveals the underlying picture.

Figure 9 

Global confirmed cases by country using an interactive density chart (author generated).

4.3 Networked visualizations

As illustrated above, there are numerous and different ways of creating a visualization, like an artist, but, in this era of rapid dissemination of information, communicating the story to a wide audience often involves the web. Shiny is an R package that makes it relatively easy to build interactive web apps. For example, Figure 10 is a horizontal bar chart arranged by the decreasing number of confirmed deaths in countries outside of China. The chart, produced in a Shiny dashboard, shows that the United States quickly surpassed Spain and Italy in the number of fatalities.

Figure 10 

Global confirmed deaths by country using a Shiny app horizontal bar chart (author generated).

The interactive app of Figure 11 shows the trajectories of growth in different countries, generated for the number of confirmed cases by the number of days since the 100th confirmed case. The data analyst can select an area to zoom in and/or ‘mouse over’ to see the name of the country, as illustrated in the figure. At the time the visualization was created, all of the countries were on a similar exponential growth trajectory. Unlike China, the United States seeded the virus broadly but is also showing signs of slowing down, that is, ‘flattening the curve’.

Figure 11 

Confirmed COVID-19 cases by country using an interactive line chart web app (author generated).

Figures 10 and 11 were created using the Shiny ‘dashboard’ command in R.

This web-based interactive visualization tool is a great example of how Data Science and visualizations can help identify the volume, variety, and veracity of data (; ; ; ). The visualizations help to mitigate data challenges by summarizing the vast amounts of data (Volume), simplifying the different types of data (Variety), and identifying errors (Veracity). However, the biggest challenge is transforming the data into the fifth ‘V’, Value, especially when the investigation involves a pandemic. An interactive web visualization app can be a powerful tool for a media-hungry audience. In a recent Harvard Business Review article, Berinato () maintains that the technical part of putting together charts has advanced rapidly by teams who develop them, but he advocates more focus needs to be on the presentation of the data to public audiences. That is the role of data visualizations. In other words, interactive visualizations are the tools that will allow an audience to better assimilate the complex data they are being given ().

4.4 Data dashboards

Dashboards were very helpful during the pandemic. Figure 12 is an author generated re-imagination of a typical COVID dashboard, the COVID-19 Global Tracker Dashboard (). The purpose of dashboards is to track the pandemic in real-time in both a tabular and visual form.

Figure 12 

Global tracking dashboard (author generated).

The dashboard presents cases, recoveries, and deaths for 195 states. It utilized data from the Johns Hopkins CSSE repository (listed in Table 1). During the peak of the pandemic, the country with the highest cumulative number of confirmed case counts was the United States.

5 Conclusions

The coronavirus pandemic drew attention to the contributions of Data Science to a humanitarian crisis. Even before the WHO declared the virus as a pandemic, numerous organizations began actively collecting and distributing data on the virus and its effects on people. As a result, data visualizations and dashboards became critical tools in educating the public and supporting healthcare professionals, researchers, and decision-makers. Properly generated visualizations and dashboards help mitigate the challenges of summarizing the vast amounts of data (Volume), simplifying the different types of data (Variety), and identifying errors (Veracity), which are some of the ‘Five V’s’ Marr () associated with live data. In addition, the greatest challenge is transforming the data into Value, especially when the timing of decisions is critical. Visualizations and dashboards provide value and make a significant contribution to our insight into crises. This paper reviews these contributions by demonstrating how the COVID-19 story unfolded through author-generated data visualizations and dashboards and by providing the community with open-source access to the scripts that generated these visualizations. Using publicly available datasets from multiple sources and employing R toolkits, the author demonstrates the role that Data Science can play in a pandemic, and that can be implemented by anyone with some basic knowledge of scripting languages like R.