nCov2019: an R package for studying the COVID-19 coronavirus pandemic

Background The global spreading of the COVID-19 coronavirus is still a serious public health challenge. Although there are a large number of public resources that provide statistics data, tools for retrospective historical data and convenient visualization are still valuable. To provide convenient access to data and visualization on the pandemic we developed an R package, nCov2019 (https://github.com/YuLab-SMU/nCov2019). Methods We collect stable and reliable data of COVID-19 cases from multiple authoritative and up-to-date sources, and aggregate the most recent and historical data for each country or even province. Medical progress information, including global vaccine development and therapeutics candidates, were also collected and can be directly accessed in our package. The nCov2019 package provides an R language interfaces and designed functions for data operation and presentation, a set of interfaces to fetch data subset intuitively, visualization methods, and a dashboard with no extra coding requirement for data exploration and interactive analysis. Results As of January 14, 2021, the global health crisis is still serious. The number of confirmed cases worldwide has reached 91,268,983. Following the USA, India has reached 10 million confirmed cases. Multiple peaks are observed in many countries. Under the efforts of researchers, 51 vaccines and 54 drugs are under development and 14 of these vaccines are already in the pre-clinical phase. Discussion The nCov2019 package provides detailed statistics data, visualization functions and the Shiny web application, which allows researchers to keep abreast of the latest epidemic spread overview.

30 Results. As of January 14, the global epidemic crisis is still serious. The number of worldwide 31 confirmed cases has reached 91,268,983. Following the US, India has reached 10 million level 32 confirmed cases. The growth rate has shown that multiple peaks in many countries and some of 33 them are now at the tail of the second epidemic wave. A third wave of COVID-19 remains a threat 34 and is a cause for alarm. Under the efforts of researchers, 51 vaccines and 54 drugs are under 35 development and 14 of these vaccines are already in Pre-clinical phase. The COVID-19 pandemic emerged at the end of 2019 from Wuhan, China [1,2] . The virus has 42 raged globally for more than 12 months and still in the process of accelerating spread. Currently it 43 remains an extremely serious public health challenge, affecting more than 220 countries 44 worldwide. After a year-long battle, the researchers have gained more insight into the transmission 45 route, molecular structure, sequence information and pathogenesis of the novel COVID-19 virus [3-46 6] . Experts in epidemiology and data science also play a great role in this; they collected data and 47 developed many convenient tools to deliver information on the latest infection numbers, high-risk 48 areas and so on. World Health Organization (WHO) [7] and national Center for Disease Control and 49 Prevention (CDCs) publish authoritative data but usually on a daily basis update frequency and 50 the data often appear to lag behind. In contrast, statistics reported by aggregated news media, such 51 as DXY [8] , WorldoMeters [9] , ourworldindata [10] and BNO News [11] , are usually updated more 52 frequently and published before the CDCs. The Johns Hopkins Center for Systems Science and 53 Engineering (CSSE) integrates statistics from these data source and provides an online visual 54 dashboard [12] , it is the most popular and frequently updated platform we could find. Their efforts 55 have facilitated the public to get the latest information on the spread of the epidemic.
56 Data analysis and customized visualizations are essential to the medical researchers and 57 epidemiology experts, for analyzing changes of spreading, monitoring new outbreak trends, 58 assessing current health measures and so on. Obtaining data is a prerequisite for data analysis. As 59 time goes on, there are many websites and resources that provide COVID data. Currently, several 60 R packages are available to provide different types of data. For instance, covid19.analytics [13] 61 provides virus sequence queries, as well as dashboard and SIR models; COVID19 [14] and 62 coronavirus [15] packages provide detailed vaccine and case test data.Here, we provide an R 63 package, nCov2019, which aims to access, visually explore and analyze epidemic related statistics 64 data in R (Figure 1). The nCov2019 package is more comprehensive compare to other R packages. 65 It provides more data types including real-time and historical infection statistics, therapeutic and 66 vaccine data.In addition, our tool provides several convenient and practical visualization functions 67 and an interactive dashboard based on Shiny web application with no code requirements to help 68 users exploring the data visually. The nCov2019 package was developed since Jan, 2020 and is 69 one of the earliest R packages that designed to query COVID data to support epidemiology 70 modeling. It is available at the time there were few data resources available and was cited in several 71 academic articles [16][17][18][19] .

74
The main purposes of developing nCov2019 are to reduce the barriers of data acquisition and 75 to provide essential visualization functions, including dynamic visualization to monitor the spread 76 of the virus in an easier and concise way. To achieve these goals, the nCov2019 package was 77 designed with four main parts: statistics data collection, statistics data query and operation, 78 geographic maps visualization and interactive dashboard.

81
The Statistics data usually contains the latest status and historical data. For the real-time latest 82 status, we chose WorldoMeters as our data source, which has the high update frequency. Our 83 historical data source is CSSE, which integrates data from multiple source centers and the data is 84 reliable and timely. In addition, after one year of continuous battle of the COVID-19 virus, 85 researchers have developed some vaccines, and at the same time, doctors have tried different 86 treatment options. These data are more important than simple summary of confirmed, recovered 87 and death cases. We also collected data on vaccine development and drug therapeutics progress 88 from two website of raps.org [20,21] and these data can also be accessed directly in our nCov2019 89 package.

92
For ease of use, we wrapped the data fetching process into a simple function, query(). It usually 93 needs to be executed only once in a session, then users can obtain five datasets. The latest data 94 contains detailed statistics status at the time of query, and the historical data contains daily statistics 95 for each country, which is often useful to model epidemic growth. We also provide a global 96 summary to help understand the latest progress in the fight against the COVID-19 pandemic.

97
To facilitate downstream data analysis, we defined the `[` operator which mimics the API of 98 data selection for data frame in R. So that data can be easily accessed by specific regions. For 99 example, let X be the historical data, then X[c("USA","UK ")] will return the historical data table 100 for USA and UK only. These data are organized in long format and can be directly handed over to 101 ggplot2 for plotting in R.

102
The vaccine and drug therapeutics development status can be queried at the same time. The data 103 contains the latest information about candidate medication class, mechanism of vaccine, trade 104 name for drugs and their current trial phase. Both datasets have the details information such as 105 background, develop aims and trial details. These data are helpful for users to understand the 106 medicine progress of epidemic prevention and control.

109
Geographic visualization is an effective way to observe the spatial patterns of virus spread. 110 We provided built-in and convenient geographic map visualization functions within the nCov2019 111 package. The visualization functions were wrapped into a simple and easy-to-use command as 112 plot(), so that users can plot the distribution of cases on the maps of the worldwide or national 113 scope. For example, let X be the latest data in query result, then plot(X) will plot the world map 114 contained global confirmed cases (Figure 2).

115
To review and analyze the spread of COVID-19 epidemic situation, a more informative 116 application is to draw dynamic geographic maps at multiple time points. Users can easily get the 117 dynamic map by just specifying the start and end time in our tool. For instance, by performing  118 plot(X, from="2020-03-01", to="2020-08-01") function, an animation will be generated to reflect 119 transmission and spread dynamic of the COVID-19 outbreak during these time points ( Figure S1).

122
We also developed an interactive web dashboard to help users to access and explore these 123 datasets by interactive mouse clicks. Built with the RStudio Shiny framework, the dashboard could 124 be launched with the function, dashboard(). It enables users to select their regions of interest and 125 check both the historical and real-time data. The statistics of confirmed, deaths, and recovered 126 cases will be displayed in the Dashboard header, followed by a downloadable statistics data table, 127 nearby their cumulative curve (Figure 3).

128
Multiple charts are designed on the bottom of dashboard. A chart of global statistics could 129 display nine types of statistics on the world map, such as confirmed cases, active infected cases, 130 number of detections, and population for each country and so on. Vaccine and therapeutics 131 summary table is shown next to global statistics chart. For exploring the dataset easier, we provided 132 an interaction plot; it can be used to explore the relationship between any two of the twenty 133 different statistics, such as whether there is an association between confirmed cases number and 134 the total detection number. Finally, we designed a curve plot to reflecting the growth intensity by 135 daily increase cases for each country. This chart could be used to monitor the outbreak strength 136 over time. One of its practical applications is to determine whether a second wave of outbreaks is 137 occurring.

140
At present, the global epidemic crisis is still serious. According to data calculations from the 141 nCov2019 package, as of January 14, the number of worldwide confirmed cases is 91,268,983 142 (Figure 1), the cumulative number of deaths has reached to 1,951,790. Two country have more 143 than 10 million confirmed cases, (23,616,515 confirmed cases, 236,631 per million population in 144 the United States and 1,051,2831 confirmed cases, 7578 per million population in India), which 145 indicating that there is still great pressure for COVID-19 prevention and control.

146
Detection of infected persons and cutting off the transmission route are the key solutions to 147 curb the virus spread. Detection status could be visualized by using the visual map function 148 provided in our package (Figure 4). Intuitively, it can be seen that Asia and North America 149 performed the highest number of detection tests for the virus, compared to Africa, which had the 150 lowest number of detections. This may be due to a temporary lack or delay of reported data in 151 Africa countries. But given the situation of general economic backwardness and large population, 152 it is needed for other countries to support the fighting against the pandemic in Africa.

153
Over the past year, the epidemic has shown multiple peaks in many countries, such as 154 Australia, Japan, Italy, Germany and China. The epidemic in some countries were under control 155 by middle in 2020, but then the confirm cases has rebounded ( Figure S2). While these countries 156 are now at the tail end of the second wave of the epidemic, the possibility of a third wave in the 157 future is a cause for alarm. Some countries, such as the United States, India, Russia, and South 158 Africa, are still in the midst of a severe and rapid increase.

159
Researchers are playing an active role in the fight against the virus. There are currently 51 160 vaccines in development or in use worldwide. Due to the emergencies, most vaccines and regimens 161 are using a simultaneous multi-clinical trial approach. Fourteen of these vaccines are already in 162 pre-clinical phase. As for the therapeutics, there are currently 54 candidates, including HIV 163 protease inhibitor, IL-6 receptor agonist, HIV-1 Rev protein inhibitor, Autologous adipose-derived 164 stem cells, and Monoclonal antibody.

165
All of the above information is available directly in the R package either using command line 166 or dashboard (Figure 3). Detailed vignette including the usages of data acquisition and 167 visualization functions could be found in the supplemental file and the package is hosted on CRAN 168 (https://cran.r-project.org/package=nCov2019).

171
We provide practical and concise tools for searching outbreak data, as well as vaccine and 172 treatment-related data. The data are sourced from reliable platforms. As a highlight, our nCov2019 173 package is designed not only to help clinicians and other researchers to obtain statistics table, but        Africa had the lowest number of detections, although this may be due to temporary lack or delay of reported data in Africa countries, it cause an alarm that more attentions should be given to African countries.