Published April 1, 2022 | Version 1.0
Software Open

SPIN - Species by Proteome INvestigation: Code, databases, and example data

  • 1. Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Denmark
  • 2. GLOBE institute, University of Copenhagen, Denmark
  • 3. Institute of Conservation, Royal Danish Academy, Denmark
  • 4. Dept. of Archaeology, Museum Nordsjælland, Denmark
  • 5. Interdisciplinary Center of Archaeology and Evolution of Human Behavior, University of Algarve, Portugal; Dept. of Anthropology, University of New Mexico, Albuquerque, USA
  • 6. Interdisciplinary Center of Archaeology and Evolution of Human Behavior, University of Algarve, Portugal
  • 7. Dept. of Anthropology University of West Bohemia, Czech Republic; Interdisciplinary Center of Archaeology and Evolution of Human Behavior, University of Algarve, Portugal
  • 8. The Laboratory of Biological Anthropology, Dept. of Forensic Medicine, University of Copenhagen, Denmark
  • 9. Dept. of Earth and Ocean Sciences, University of North Carolina Wilmington, USA; Interdisciplinary Center of Archaeology and Evolution of Human Behavior, University of Algarve, Portugal
  • 10. Dept. of Anthropology, University of Louisville, USA; Interdisciplinary Center of Archaeology and Evolution of Human Behavior, University of Algarve, Portugal

Description

Scripts and configuration files for species identification:
The scripts for species identification were designed to work with RStudio 1.3.1093 on a Windows 10 machine. Small adjustments will be necessary to migrate them to other operating systems or environments. Due to the different search engine output formats, there are two separate projects for DDA data analyzed with Maxquant (1.6.0.17) and DIA data analyzed with Spectronaut (14.5.200813). The analysis is ideally but not  necessarily done with the provided protein database. For species determination based on DIA data, the raw files are searched with library based and DirectDIA in Spectronaut. Output files an be generated with the Spectronaut export schemes provided in the  Configuration” folder. The raw files should be specified and labeled based on the ”Configuration/Experimental annotation.csv”  example. If other libraries than the ones provided with the SPIN article are used, the  respective species should be included in the  "Configuration/Library list.csv”. The SPIN protein databases are already in the Databases folder and can be extended with aligned protein sequences by aligning them with the other sequences for the same gene. The Spectronaut output of the DirectDIA and library-based DIA need to be placed in the respective ”Spectronaut output”  folders. Lastly, the scripts need to be executed from RStudio, by opening ”R-Project/R-Project.Rproj” or from another program with adjusted working directories. The script ”R-Project/scripts/main.R” will execute the species identification pipeline by calling  functions from the other scripts provided in the same folder. If executed successfully, the script will produce a species identification  table in .csv format along with a collection of consensus sequences of the analyzed samples. Species identification based on DDA follows the same scheme with few changes. The data analysis needs to be done in Maxquant using the provided gapless protein database. The output files ”evidence.txt” and ”summary.txt” need to be moved to ”DDA-based/MQ output”. The procedure for running  the species inference scripts is identical to DIA-based species identification.

Databases:
PR210107 Merged Top20 aligned.fasta
Aligned protein database used for species identification by SPIN. Sequences for each gene have been subjected to a multiple sequence alignment using Muscle and saved in .fasta format including the gaps. The database contains predicted and experimental
protein sequences from Uniprot and NCBI spanning the 20 most common bone genes across all available mammalian species. When adding more sequences, they should be aligned within the respective gene group and named following the Uniprot ”fasta
header” format: “>NCBIj[protein ID]j[protein ID] [gene alias] [protein description] OS=[species name] OX=[species ID] GN=[gene name]”. PR210107 Merged Top20 gapless.fasta Gapless protein database used for species identification by SPIN. Generated by removing gaps caused by the multiple sequence alignment. This database is compatible with most search engines and can be configured with Uniprot file parsing rules.

PR200512 HumanCons.fasta
Contaminants protein database. The contaminant protein sequences in this list was inspired by the ”contaminants.fasta” provided with Maxquant (Tyanova, Temu, & Cox, 2016). Contaminants that are only relevant for samples from cell culture, such as bovine  serum albumin and collagen, were removed because they can lead to false contaminant annotations in the bone proteome context. The remaining contaminant sequences are mostly from human keratins and common proteases used in bottom-up proteomics. The annotation of protein sequences was updated to the current Uniprot format. This database should be used in conjunction with the main gapless database for setting up a database search for SPIN in Maxquant or Spectronaut.


Species identification helper files
PR201105 Manual SpeciesFineStructure Peptides.csv
Fine grouping peptides. Collection of manually selected peptide sequences, which are robust markers for identifying species from hardly-distinguishable relatives. Species are grouped in ”clusters”, which describe the group of closely related species that can
be distinguished using the selected peptides. Amino acid variants and peptide sequences are given for every species within each cluster. The ”Site” column refers to the position in the global sequence alignment obtained by pasting all bone genes in alphabetical order and the ”Comment” indicates the identification frequency in the reference samples. For expanding the list, it would be sufficient to provide the cluster, species, and peptide sequences for every species in the cluster.
Library list.csv
Library list. Simple list of all species with available spectral libraries. Closely-related species without available species-specific  libraries, such as the American bison or buffalo, were included as well. The list was used for merging library-based and Direct DIA results. For every DirectDIA species call that did not appear in this list, the library-DIA species was replaced with the DirectDIA species.
FASTAfix.csv
Missing gene annotations. Small helper file to add gene annotations that were missing or inconsistent in Uniprot.

Files

SPIN_species-inference_1.0.zip

Files (10.1 MB)

Name Size Download all
md5:6aec9f5d355a441c35294d780db23705
10.1 MB Preview Download

Additional details

Funding

Collaborative Research: Inquiry into the Origins of Modern Human Distributions 1724997
National Science Foundation
Collaborative Research: Inquiry into the Origins of Modern Human Distributions 1725015
National Science Foundation
TEMPERA – Teaching Emerging Methods in Palaeoproteomics for the European Research Area 722606
European Commission
PROSPER – Hominin phyloproteomics for the Pleistocene: PalaeoPROteomics of Skeletal Parts for Evolutionary Research 948365
European Commission
Collaborative Research: Hominid Response To Environmental Change 1420453
National Science Foundation
Collaborative Research: Hominid Response To Environmental Change 1420299
National Science Foundation
HOPE – HOminin Proteomes in human Evolution 795569
European Commission