Computational protocol to perform a spatiotemporal reconstruction of an epidemic

Summary Here, we present a computational protocol to perform a spatiotemporal reconstruction of an epidemic. We describe steps for using epidemiological data to depict how the epidemic changes over time and for employing clustering analysis to group geographical units that exhibit similar temporal epidemic progression. We then detail procedures for analyzing the temporal and spatial dynamics of the epidemic within each cluster. This protocol has been developed to be used on historical data but could also be applied to modern epidemiological data. For complete details on the use and execution of this protocol, please refer to Galli et al. (2023).1

2. RStudio is a user-friendly integrated development environment for R. 3 The installation of RStudio is not mandatory to follow the protocol.However, it makes viewing and interacting with files, packages, objects in the environment, tables, and graphs easy.RStudio can be found at https://www.rstudio.com/.
Optional: QGIS is a free and open-source geographic information system 4 that is used to create, edit, visualize, and analyze geospatial information.The latest version of QGIS can be found at https://www.qgis.org/en/site/.
Install R packages needed for this protocol Timing: 10 min 3.To run this protocol, it is required to previously install the R packages present in the ''key resources table'' as following: Format your dataset or download our dataset from GitHub Timing: 10 min-1 h You can follow the protocol using your data or our sample dataset available on GitHub.
To follow the protocol step-by-step, you must have the data formatted in a dataset in which each row contains a single case.If you have a dataset with cumulative numbers, see below on how to format it properly for the protocol.
Note: This repository includes all the data and code necessary to reproduce the protocol.In this dataset, each row contains the total number of ''cases'' (or deaths) for each cause of death, for each day (or other unit of time) and for each spatial location.
5. Format the table to obtain a single case in each row.
3. Create a time-series plot that shows the progression of the daily number of deaths for each cause of death (in our case Plague or not plague) (Figure 1).

Note:
The plot compares the progression of deaths caused by the disease of interest to deaths related to any other cause: it could show the presence of any kind of seasonality, and the completeness of the dataset.The absence of a signal in the period between 4 and 30 August is due to reasons not related to the analysis. 1ustering of parishes on the basis of their epidemic curves

Timing: 3-4 h
In this step, we are going to group cases/deaths based on the progression of the epidemic.In particular, we are going to analyze the specific cumulative epidemiological curves of each geographical unit.In the example dataset about the 1630 plague epidemic in Milan, the geographical units are the parishes in which the city was divided and where the person died (i.e., territorial entities comparable to modern city neighborhoods).Other geographical units can be streets, houses, villages, districts, etc.
4. Load the R packages in the current RStudio session.

5.
Load the table with one row per case, as shown in step 5 in the ''before you begin'' section (or skip if you have already loaded it in the previous step).6. Build the cumulative curves of the plague deaths for each parish (or other geographical unit).
a. filter for the cause of death or disease of interest.b. build a matrix in which each column is a parish (or geographical unit), and each row is a day (or temporal unit).
c. Add to the matrix the days in which there are no recorded plague deaths.
Note: Select your temporal period of interest.In the example dataset, the cases span the year 1630, thus our range is from the 1st of January 1630 to the 31st of December of the same year.
d. Order the table chronologically.
e. create a cumulative matrix.
f. Normalize the number of deaths by dividing the number of daily deaths in each parish by the total number of deaths in that parish (daily parish deaths/ total number of parish deaths in 1630).
g. Plot the cumulative relative frequency curves of the parishes plague deaths (Figure 2).Note: Some parishes may not have enough cases to build a reliable cumulative curve on the selected time range.We can choose a threshold under which the parishes will be excluded from the subsequent analyses.For example, we selected only the parishes with more than one plague death every two weeks during the epidemic.

Protocol
h. Calculate the duration of the epidemic.To do so, we have to observe the epidemic curve produced in step 3 to manually determine the starting and ending period of the epidemic.
Note: The epidemic period is not simply the time between the first and last case, in fact, as in our data, there may be some isolated cases before or after the epidemic.Thus, it is important to analyze the epidemic curve.In our data, the epidemic begins around the middle of March and continues throughout 1630.
i. Calculate the death threshold.To select parishes with more than 1 death every two weeks, we have to look for parishes with more than 21 deaths (week of the epidemics / 2), as explained in Galli et al., 2023. 1 j.Remove parishes with fewer cases than the threshold and plot only the selected cumulative curves (Figure 3).

Perform the clustering analysis on the basis of the cumulative curves.
Note: To cluster the parishes (or geographic units) in two or more groups, we are going to perform a k-means clustering on the result of the Principal Coordinates Analysis (PCoA) performed on the cumulative curves of plague deaths.In particular, we are going to compare the cumulative curves of the different parishes with each other to generate a distance matrix, which will be subjected to PCoA.Performing a PCoA on the dataset allows us to reduce the dimension of our data and to visualize them in two (or three) dimensions.Then we can apply k-means clustering, an unsupervised clustering algorithm that groups a dataset into a specific number of clusters (determined by silhouette analysis).The k-means algorithm will assign each observation (in our case each parish) to a cluster on the basis of their position on the PCoA space. 16,17 Calculate the Euclidean distance matrix between the cumulative curves of the different parishes.
> weeks_of_epidemic <-42 > colnames(peste_cum_norm_melt) <-c("Date", "Parish", "Count") Note: First, we must determine the best number of axes for the PCoA analysis on the basis of the eigenvalues.The plot shows that the eigenvalues drastically drop down for the first three dimensions.c.Determine the optimal number of clusters in which the parishes (or geographic units) can be divided (Figure 5).

Note:
We determined the optimal number of clusters in the dataset using a popular cluster validation index: the average silhouette width method. 18,19In our example, the average silhouette width is maximized at the ''number of clusters k'' equal to two.Thus, this is the optimal number of clusters estimated by this method.> fviz_nbclust(pcoa, kmeans, method = "silhouette", k.max=10)

d. Clustering
8. Use the PERMANOVA test to determine if the separation between the clusters is statistically significant.
9. Plot the results of the PCoA and the k-means clustering (Figure 6).

Timing: 3-4 h
The parishes, and therefore our cases, have been clustered on the basis of the temporal progression of the epidemic.Now we can analyze what are the differences and similarities between the clusters.
10. Load the necessary R packages in the current RStudio session.
11. Color the cumulative relative frequency curves of plague deaths for each parish on the basis of their clusters (Figure 7).Save a table with the information about the clusters and the corresponding parishes to be used later for further analysis on the clusters.
13. Compare specific parameters relative to the epidemiological curves of the parishes of the two clusters.

Note:
As an example, we are going to determine: the first plague case for each parish, the inflection points of the cumulative curves, and the date at which the parishes of the two clusters reached 25%, 50%, 75%, and 100% of their total plague deaths.
a. Calculate the dates on which the parishes of the two clusters reached the first plague death.c.Calculate the dates on which the parishes of the two clusters reached 50% of total plague deaths.
d. Calculate the dates on which the parishes of the two clusters reached 75% of total plague deaths.f.Calculate the dates on which the cumulative curves of the parishes of the two clusters change concavity, corresponding to the epidemic peak (also known as the inflection point).
g. Merge all the data in one table.
b. Produce a summary that integrates all the information for your dataset.
c. Plot the weekly number of plague deaths for each cluster (Figure 9).Analyzing the spatial dynamics of the epidemic in each cluster

Timing: 4 h
Generate a visualization of the distribution of the parishes of the different clusters on a map using the coordinates and the information about the parishes.16.Produce a summary table that integrates all the information for your dataset (see step 14b).17.Filter the dataset to remove all the parishes without geographic information.
18. Create a new gray cluster (Cluster ''0'') for the unassigned parishes (the parishes in this cluster are the one with less than 21 deaths, see step 6i for details).
19. Summarize the number of cases for each parish, cluster, death cause, and coordinates.> head(df2) 20.Clean the table removing non-plague-related deaths.
21. Load your map stored as a png image and transform it into a raster image graphical object.
22. Annotate the GPS coordinates of the four corners of the map image.
Note: we can retrieve our base map image from QGIS by cropping the area of interest and annotating the GPS coordinates of the margins of our crop.This is essential to plot the points in their exact location on the map: the map itself needs to be referenced to the real GPS coordinates so that the point can be plotted using the real GPS coordinates available for each parish.
# A tibble: 6 > df2$Cluster <-factor(df2$Cluster, levels = c( "0","1","2"),ordered = TRUE) > peste_clus_gps <-df2 %>% filter(Death_cause == "Plague") Note: With the aspect ratio of the map and a multiplicative factor (zoom variable), it is possible to save the final map image at different sizes maintaining good resolution.The reasonable value of the zoom variable depends greatly on the size (in pixels) and the shape of the initial crop of the map.
24. Plotting the parishes over the map according to the GPS coordinate.The size of the points is associated to the number of plague deaths and the color represents the cluster of the parishes.

Note:
The aspect ratio of the map must be maintained.To do so, we have to fix the width and height of the final plot file.RStudio may automatically plot the map with an incorrect aspect ratio.We strongly recommend saving the png file using the provided commands instead.25.Save the final plot as png and scale it using the zoom variable (Figure 10).
Parishes localization on the map of the city of Milan.In blue, parishes of cluster 1; in red, those of cluster 2; in gray, parishes with less than 21 total plague deaths (not used in the clustering analysis).
The size of the points represents the total number of deaths related to the plague experienced by the parish.

EXPECTED OUTCOMES
This protocol has been designed to reconstruct the epidemiological dynamics of an epidemic using the recorded deaths and their geographical location.
The first outcome consists of a time-series plot where it is possible to compare the temporal evolution of the epidemics against the incidence of deaths unrelated to the disease of interest (Figure 1).In blue, parishes of cluster 1; in red, those of cluster 2; in gray, parishes with less than 21 total plague deaths (not used in the clustering analysis).The size of the points represents the total number of deaths related to the plague experienced by the parish.
device="png", units = "cm", width = aspect_ratio*zoom, height = zoom) Then, the protocol performs a clustering analysis on the geographical units (i.e., parishes) on the basis of their cumulative relative frequency of plague deaths.The final outcome of this step is the Principal Coordinates Analysis (PCoA) plot, where the parishes have been colored on the basis of the clusters (Figure 6).
Once we find the clusters, we can start to analyze the temporal dynamics of the epidemic in each of them; in Figure 8, the boxplot and whiskers depict the different progression of the epidemics in the clusters.
Lastly, we analyze the spatial dynamics clusters by visualizing the geographical position of the parishes on a map: Figure 10 shows the position of the parishes over a historical map of the city of Milan.This map also shows the cluster of each parish and the number of plague-related deaths it experienced.

LIMITATIONS
The clustering analysis used in this protocol does not rely on any kind of geographic information.Although this approach is advantageous when we are dealing with historical data for which geographic information is rarer or imprecise, this approach may be limiting for the analysis of datasets in which this information is available and reliable.In this case, the clustering analysis should consider the implementation of geographic information.

TROUBLESHOOTING Problem 1
In step 5 of ''before you begin'' section, the data could be collected by more people also using different Operating Systems.In this case, it must be taken into account possible issues due to informatics compatibility, e.g., the use of different special characters to add a new line in a csv (commaseparated value) file.

Potential solution
Open the csv file with a spreadsheet application (e.g., Office Excel or OpenOffice Calc).
In this application, we can easily apply any adjustment and export the spreadsheet as an xlsx or a csv file.The file should be correctly formatted and ready to be imported in R.
For csv files, we can import the table in R using the same lines of code in step 5 of ''before you begin section''.

Problem 2
At step 5 in the ''before you begin'' section, the table is too large to be loaded and handled in a reasonable time.

Protocol
Potential solution When a table is too large (e.g., more than one million cells), the functions ''read.csv''(for ''csv files'') or ''read.delim''(for tab delimited files) may take too much computational time and RAM to maintain the table in the environment.In this case, the ''tibble'' class can help reduce time and RAM needed to handle the dataset.Thus, to upload a large table it is better to use another R library such as ''vroom''.
The following code must be substituted to step 5 of the ''before you begin'' section: Problem 3 At step 6, parishes name contains spelling errors.

Potential solution
It is possible that the dataset contains spelling errors, or the same parish written in different ways (e.g., capital letters, spaces, etc.).In these cases, R does not consider them as the same parish.
To find possible errors we can list all the parishes and manually check for errors.
Then we can correct them using R.As an example, consider a situation in which the parish of ''S.Bartolomeo'' is written in two different ways: ''S.Bartolomeo'' and ''S.bartolomeo''.
In this way we substituted all the cells containing "S. bartolomeo" with "S.Bartolomeo".

Problem 4
At step 7c, the silhouette analysis indicates an optimal number of clusters higher than two.

Potential solution
The protocol has been developed on the example dataset in which silhouette analysis finds that two clusters is the optimal number to classify the parishes.For this reason, this protocol can be directly applied only when the two clusters are found to be optimal.
Whenever the protocol finds that more than two clusters are needed to describe the dataset, the user should slightly modify the commands to be able to compare the clusters (e.g., in step 14c assign a color for each cluster to the function scale_fill_manual).In the section ''analyzing the spatial dynamics of the epidemic in each cluster'' (step 16), missing precise geographic location of geographical units.
Potential solution GPS information is not always available, particularly when dealing with old datasets that may refer to particular geographic locations (e.g., streets, parishes, neighborhoods) that changed names or disappeared over time.
The clustering approach applied in this protocol does not rely on any type of geographical information.Thus, you can follow the protocol from step 1 till step 14 without any specific geographic information (corresponding to the sections: ''tracing the temporal evolution of the epidemic'', ''clustering of parishes on the basis of their epidemic curves'', and ''analyzing the temporal dynamics of the epidemic in each cluster'').Conversely, it is not possible to continue with the section ''analyzing the spatial dynamics of the epidemic in each cluster'' (from step 15) when GPS coordinate information is missing.
In the presence of approximate GPS coordinate information for the geographical units, a possible solution can be to group the geographical units into larger ones.For example, if you have only approximate GPS information about the parishes but you have precise information about the district in which the parishes are located, you can group all the parishes in their respective district creating larger geographical units.In this case, you have to format the dataset to conform to the larger geographical unit in the ''before you begin'' section (e.g., aggregating the deaths of the parishes of each district).Then, the newly formatted dataset can be used to perform the protocol at the district level from step 1 to the end.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Riccardo Nodari (riccardo.nodari@unimi.it).

Materials availability
This study did not generate new unique reagents.

Figure 1 .
Figure 1.Daily number of deaths per cause of deaths The daily numbers of deaths are represented in separated colors depending on the cause of death: in red deaths related to plague, in gray deaths unrelated to plague.

Figure 2 .
Figure 2. Cumulative relative frequency curves of plague deaths for each parish Each curve represents the cumulative frequency of plague deaths over time in each selected geographic unit (in this case, the parishes).

Figure 3 .
Figure 3. Cumulative relative frequency curves of plague deaths for each parish with at least 21 deaths

Figure 5 .
Figure 5. Optimal number of clusters defined by the silhouette method

Figure 6 .
Figure 6.PCoA colored on the basis of the k-means clustering analysis PCoA using the Euclidean distance on the plague cumulative relative frequency curves.Each point represents a parish, and the colors represent the clusters: in blue parishes of cluster 1, in red parishes of cluster 2. The p value obtained from the PERMANOVA test is reported at the bottom-left of the figure.

Figure 7 .
Figure 7. Cumulative relative frequency curves of plague deaths for each parish with at least 21 deaths colored by cluster

Figure 8 .
Figure 8.Comparison between the temporal progression of the epidemic in the two clusters From left to right, boxplots of the dates in which the parishes of the two clusters experienced the first plague death, 25% of total plague deaths, 50% of total plague deaths, 75% of total plague deaths, 100% of total plague deaths and the inflection point of the curves.In each boxplot, the dates for the two clusters were compared using a Mann-Whitney U test ('*' p value < 0.05; '****' p value < 0.0001).

ProtocolNote:
Latitude and Longitude are the columns used to indicate geographic coordinates.

Figure 10 .
Figure 10.Geographical distribution of the parishes in the city of Milan Parishes localization on the map of the city of Milan.In blue, parishes of cluster 1; in red, those of cluster 2; in gray, parishes with less than 21 total plague deaths (not used in the clustering analysis).The size of the points represents the total number of deaths related to the plague experienced by the parish.