A quantitative workflow for modeling diversification in material culture

Questions about the evolution of material culture are widespread in the humanities and social sciences. Statistical modeling of long-term changes in material culture is less common due to a lack of appropriate frameworks. Our goal is to close this gap and provide robust statistical methods for examining changes in the diversity of material culture. We provide an open-source and quantitative workflow for estimating rates of origination, extinction, and preservation, as well as identifying key shift points in the diversification histories of material culture. We demonstrate our approach using two distinct kinds of data: age ranges for the production of American car models, and radiocarbon dates associated with archaeological cultures of the European Neolithic. Our approach improves on existing frameworks by disentangling the relative contributions of origination and extinction to diversification. Our method also permits rigorous statistical testing of competing hypotheses to explain changes in diversity. Finally, we stress the value of a flexible approach that can be applied to data in various forms; this flexibility allows scholars to explore commonalities between forms of material culture and ask questions about the general properties of cultural change.


Introduction
This supplemental information coincides with the preparation of data for the Plos One article titled A quantitative workflow for modeling diversification in material culture by Erik Gjesfjeld, Daniele Silvestro, Jonathan Chang, Bernard Koch, Jacob G. Foster and Michael E. Alfaro. As highlighted in the article, we demonstrate the analysis of two main data formats, range format (American automobiles) and occurrence format (radiocarbon dates). The automobile data from America and Europe represents data in the range format as it consists of only the first and last years of production. The radiocarbon data associated with the EuroEvol project represent occurrence format data as it consists of a series of radiocarbon dates assigned to different cultural phrases and periods.
The following supplemental information presents all of the necessary data and code in order to replicate the analysis as presented in the Plos One article, including figures. If any code is not working, please email Erik Gjesfjeld (erik.gjesfjeld@gmail.com).

Loading in necessary libraries and programs library(tidyverse) library(rcarbon) library(knitr)
In order to start it is necessary to download the most recent versions of PyRate and LiteRate. These are available on the GitHub page of Daniele Silvestro, along with a series of tutorials and example data sets.

Preparation of range data
We will start with the analysis of range data and will read in data on automobile production years. The data originally comes from the master vehicle list associated with the Ebay parts and motors database (https://pages.ebay.com/motors/compatibility/download.html). This list records a variety of information in order to list automotive parts for sale. For this research, we will use the fields for make, model, year and region. In total, the database contains 288,056 total entries from the years 1896-2019, updated as of July 2019.
The data was further cleaned to identify car models that had major lapses in production (over 3 years). For these models, a additional "G2" or "G3" was added to identify them as further generation car models. In addition, the nation and continent of the manufacturers headquarters were also added to the dataset.

American Automobile Data Cleaning and Organization
This dataset also includes commercial trucks that we decided should not be included in the analysis, so this are filtered out based a manually developed list of commercial truck manufacturers (please contact authors for list of commercial trucks).
The following code groups and arranges the data by car model, summarizes the first and last years of production (assuming continual production), filters only American manufacturers and then filters out the trucks by manufacturer, using the list of commercial truck manufacturers that was manually specified above.
#Removing commerical truck manufacuters cars_mvl <-mvl_rev %>% filter(!make %in% trucks) %>% droplevels() #Filtering just American manufacturers cars_mvl_america <-cars_mvl %>% filter(continent=="America") #Grouping by make / model and taking the first and last years of production (range data) cars_mvl_america_range<-cars_mvl_america %>% group_by(make_model_rev) %>% arrange ( After wrangling the data, it is now ready to be put into range format that will be analysed by LiteRate (see manuscript for details). The default date format for LiteRate is calendar dates (BC / AD), which is what the original data was in and therefore will be maintained.
The fields used will be clade, which are all classified as 1. If you wanted to include additional information such as region of manufacturer it is possible to recode the clade variable to reflect this. Species is simply a unique numeric identifier for each car model in our data. TS refers to the date of origination (time of speciation) and TE refers to the date of last production (time of extinction) or the most recent year of production for cars still being produced (2018 in this study).

Analysis of Automobile Data
The analysis of diversification rates as performed in the manuscript can then be performed by opening a terminal window, navigating to the folder with LiteRateForward.py and executing the following command.
The MCMC is run for 20 million (as opposed to the 10 million default) in order to achieve greater convergence. python LiteRateForward.py -d cars_2018_mvl_american.txt -model_BDI 0 -n 20000000 -s 5000

Calibration of Radiocarbon Dates
As the data are uncalibrated radiocarbon dates, it is necessary to calibrate the dates for them to be meaningful calendar dates. Here, we use the rcarbon (Bevan and Crema 2018) package in R for calibration of the 4,243 dates. One of the unique qualities of calibrated radiocarbon dates is the unique probability distribution for each date, which occurs when reflecting the Gaussian distribution of radiocarbon ages against the "wiggles" of the International Calibration curve (intcal13), see image below.

Sampling of Calibrated Radiocarbon Distribution
Due to the unique probability distribution for each radiocarbon date, we randomly sample from the probability distribution for each radiocarbon age. For each single radiocarbon date, we take 100 random samples at yearly intervals contingent on the probabilities of each year based on the probability distribution obtained after calibration. This information is then placed in a table in which ages can be extracted and evaluated in PyRate (see manuscript for additional details).

Extracting Ages
Once the text file of the sampled ages is created, we can use the tools associated with the PyRate program (details of PyRate can be found in the manuscript and associated citations). Here, we will use the pyrate_utilities function in R to extract the ages from our sampled C14 data and format it as occurrence data for analysis in PyRate. Results from this will be output as a .py file.
#Sourcing the necessary functions source("~/path_to_PyRate_folder/pyrate_utilities.r") #Calling the function to extract C14 ages extract.ages.14C("~/Desktop/sample_results_calib.txt") Once we have the data in the proper format, this is the file sample_results_calib_PyRate.py, we can use PyRate to produce estimates of origination, extinction and preservation rates.

Testing Between Preservation Models
Currently, PyRate has three different models of preservation rates including a Homogeneous Poisson Process (HPP), a non-Homogeneous Poisson Process (NHPP) and a Time-variable Poisson Process (TPP). Here, we will do a simple model test to determine which model best fits our data. As a note, the TPP requires us to construct a set of non-overlapping temporal periods in which preservation rates will be independently calculated for each temporal period. Here, we use a very broad classification scheme of Early (8000-6000 cal BP), Middle (6000-5500 cal BP) and Late Neolithic (5500-3800 cal BP). Of course, dates for these periods are different across regions of Europe, and should therefore be adjusted depending on the region of interest.

PyRate Analysis of Neolithic Europe Radiocarbon Dates
Note: A full PyRate analysis of this occurrence data will likely take anywhere between a few hours on a cluster or a few days on personal laptop or desktop computer. We strongly recommend this command be executed using a computing cluster with remote capabilities. python PyRate.py sample_results_calib_PyRate.py -A 4 -rescale 0.001 -mG -n 10000000 -s 1000 -pP 1 0.01

# -A 4 this uses the RJMCMC algorithm (recommended) # -mG this uses a gamma preservation model (as is recommended for the NHPP) # -n this is the total number of samples (default is 10 million) # -s this is the sampling frequence (default is every 1000 samples) # -pP adjusts the shape and rate of the preservation prior
We will also want to replicate this analysis many times in order to gain confidence in our results and have a robust understanding of the uncertainty around our parameter estimates. Similar to before, we ran 100 replicates -j ::: {1:100} using 10 -j 10 cores of a 16 core virtual machine using a bash shell script and the program parallel. This analysis took 32 hours to complete. With that, all of the analysis has been completed and we are able to visualize the results using the following commands executed in a bash shell.

Creating Figures from the Plos One Article
All the necessary files to recreate the figures produced in the article can be downloaded from the Figshare repository for the article, located here.
In order to recreate the figures, you must download the folder titled 'Files for Figures' to your desktop and execute the commands below. The folder will be titled '11117120' and must be unzipped prior to use.

Figure 5
Overall log richness of archaeological cultures (i.e. pottery types) (A) with three scenarios that would produce a similar pattern of rising diversity followed by a decline in diversity. This includes a A) a constant origination rate and increasing extinction rate; B) a declining origination rate and constant extinction rate; and C) a declining origination rate and increasing extinction rate.