The magneto-microbiome: A dataset of the metagenomic distribution of magnetotactic bacteria

Magnetotactic bacteria (MTB) are diverse prokaryotes characterized by their ability to generate biogenic magnetic iron crystals. MTB are ubiquitous across aquatic environments, and growing evidence has indicated they may be present in association with animal microbiomes. Unfortunately, they are difficult to culture in vitro and more studies understanding their biogeographical distribution and ecological roles are needed. To provide data regarding the patterns of diversity and distribution of MTB, we screened the entire Sequence Read Archive (SRA) from the National Center for Biotechnology Information for DNA sequencing reads matching known MTB taxa. The dataset summarizes the count of reads assigned to MTB from more than 26 million SRA accessions comprising approximately 80 petabases (7.98 × 1016) of DNA. More than 396 million DNA sequencing reads were assigned to 214 MTB taxa in 691,086 (2.65 %) SRA accessions. The final dataset can be utilized by researchers to narrow their efforts in examination of both environmental and ecological roles of specific MTB or to identify potential host organisms. These data will be instrumental to further elucidating the importance and utility of these enigmatic bacteria.


Subject
Biological Sciences Specific subject area Microbiology: Microbiome; Environmental Genomics and Metagenomics Type of data Tables How the data were acquired Data were acquired by processing the entire Sequence Read Archive (SRA) from the National Center for Biotechnology Information (NCBI) for reads matching known magnetotactic bacteria using the Google BigQuery cloud environment.Known magnetotactic bacteria were queried according to their NCBI taxonomy accession and counts of all next-generation sequencing reads in the entire publicly available SRA that could be unambiguously assigned to these taxa were obtained.The data include a markdown-formatted PDF file, Methods.pdf , containing all the computational code and associated annotations required to generate the dataset.Data

Value of the Data
• These data are useful for understanding the spatiotemporal distribution and diversity of magnetotactic bacteria across environments and hosts.• Researchers who study the biogeography and evolutionary ecology of microbial magnetotaxis and the role of the microbiome in animal sensory physiology will benefit from the availability of these data.• These data can be used to identify and prioritize environments or host species for future studies of the ecological role of magnetotactic bacteria or their potential for contributing geomagnetic information to their host.These data can also be used for designing experiments that target the impacts of specific species of magnetotactic bacteria or their collective diversity on host physiology.

Objective
Magnetotactic bacteria (MTB) are a diverse group of prokaryotes that biomineralize iron to form nano-sized magnetic crystals that are stored in a unique organelle called the magnetosome [ 1 , 2 ].By generating chains of magnetic particles, MTB can passively align with magnetic fields (i.e., magnetotaxis).The magnetosomes and magnetic properties of MTB have made them especially useful in numerous biomedical, technological, and engineering applications such as drug delivery, magnetic resonance imaging contrast agents, printing toner, heavy metal recovery, robotics, and astrobiology [ 2 , 3 ].MTB are ubiquitous across aquatic environments and thrive at the oxic-anoxic interface, thus making them challenging to culture in the laboratory [ 1 , 2 ].Recently, however, there is accumulating evidence that MTB are present in the microbiomes of many organisms, and even contribute to the ability of animals to perceive the geomagnetic field through symbiotic mechanisms [4][5][6][7] .In order to better understand the ecological roles of these MTB in both the environment and in association with animal microbiomes (i.e., the magneto-microbiome), metagenomic datasets of MTB presence and biogeographical distribution across both environments and hosts are needed.

Data Description
The Sequence Read Archive (SRA) held at the National Center for Biotechnology Information (NCBI) was first established in 2009 to publicly host the explosive growth in open-access nextgeneration DNA sequencing data [8] .The goals of the SRA are to i) make publicly funded research data findable, accessible, interoperable and reusable (i.e., FAIR) and ii) promote novel opportunities for scientific studies that utilize the massive scale of these genetic datasets [8] .The SRA has recently been made accessible for large-scale, interactive queries through multiple cloud environments, such as the Google Cloud Platform BigQuery, and now includes the taxonomic assignment of every DNA read submitted via the SRA Tax Analysis Tool [ 8 , 9 ].The data described below mined and summarized the entire set of SRA accessions for the presence of specific MTB taxa and not necessarily the presence of magnetotaxis abilities.
The SRA is constantly receiving new submissions, so the data described herein are from the SRA contents as of June 15, 2023 [10] .At this time, 26,126,445 SRA accessions were queried, including 359 trillion reads totaling ∼80 petabases (7.98 × 10 16 ) of DNA data.A total of 691,086 (2.65 %) SRA accessions contained at least one read taxonomically assigned to a known MTB and 396,203,026 MTB-assigned reads were identified.The results of the recovered MTB reads are reported by SRA accession and by taxon in the compressed, comma-separated value (i.e., csv) spreadsheet bq-results-20230615-225959-1686880934964.csv.gz .A list of known MTB taxa queried according to their NCBI Taxonomy accession number (TaxIDs) is included in the spreadsheet file MTB_taxon_sheet.xlsx .See "Experimental design, materials and methods " below for a complete description of the TaxIDs.The complete dataset is available in the compressed csv spreadsheet SRA-metatable_6-17-23_with-MTB-counts-ordered-and-taxons.csv.gz .Because the complete dataset is quite large ( > 2 GB compressed), 26 additional subsets of the dataset are available and split according to common metazoan animal phyla (nine subsets) and classes (17 subsets) to facilitate future studies of the magneto-microbiome ( Fig. 1 ).
All the provided spreadsheets contain both the "total_count" and "self_count" for each MTB taxon, as well as the cumulative count of all MTB in the SRA accession.According to Katz, Shutov, Lapoint, Kimelman, Brister and O'Sullivan [8] , the "total_count" is the sum of all reads assigned to the specific MTB taxon and all its descendent nodes in the taxonomic lineage, whereas the "self_count" is the count of reads strictly assigned the taxon listed.The SRA accessions in each spreadsheet are ordered according to decreasing cumulative MTB "self_count" content.Also available within each spreadsheet is the full taxonomic lineage, or ranks, of the origin of the specific SRA accession, including environmental and metagenomic samples, to facilitate examination by sample origin rather than MTB content.
Additional files in the dataset include a text file README.txt that summarizes all files in the dataset, their formats, and a description of all column headers, a markdown-generated pdf document Methods.pdf that details all the computer code necessary to recreate the dataset, and a list of the MD5 checksums for each file, MD5.txt, for users to confirm that the files have been downloaded and copied correctly.

Experimental Design, Materials and Methods
A list of known MTB taxa was obtained from existing literature [ 1 , 2 , 11 ].Because MTB are diverse, polyphyletic, and often not culturable, this list of putative MTB is not exhaustive.Additionally, because magnetosomes can be gained and lost, the data only indicate the presence of the species listed and not necessarily the presence of species performing magnetotaxis.These taxa were subsequently identified in the NCBI Taxonomy database [12] to obtain accession numbers,

Table 1
Count of the MTB-specific taxa collected from the NCBI Taxonomy database.The asterisk indicates that these 17 taxa were described as having "no rank" but yet were unambiguously MTB.

Rank
Count hereafter TaxIds .The taxonomic lineages, or ranks, of these MTB TaxIds were curated manually to obtain genera, families, and orders that are specific to MTB.In other words, these higher-level taxonomic ranks solely contain lower-level TaxIds that are unique to known MTB.A total of 214 TaxIds were collected and are summarized in Table 1 .Seventeen of the 214 TaxIds contained no rank and represented primarily unclassified MTB from environmental samples.
Next, the counts of next-generation sequencing reads were obtained from each accession in the NCBI SRA database that were assigned to these TaxIds [8] .To perform this search, the Google Cloud Platform BigQuery ( https://cloud.google.com/bigquery ) was utilized.In BigQuery, the SRA cloud dataset "nih-sra-datastore " was first linked to the project.Next, a single search was performed using standard SQL syntax (see the Methods.pdffile in the published dataset for the full SQL script) on June 15, 2023 for the 214 TaxIds.The query processed 364.5 GB of cloud memory and took a total of 28 s.A second query was then initiated to gather the metadata for every SRA accession in the nih-sra-datastore .This query processed 5.96 GB of cloud memory and took 40 s to complete.This dataset was stored as 140 csv files in a "bucket" in Google's cloud services, then downloaded, merged, and compressed into a single file using the "gsutil" function in Google's Cloud SDK tools ( https://cloud.google.com/sdk).The results from the two queries were merged using the data.tablev1.14.9 package [13] in R v4.2.1 [14] .Finally, taxonomic ranks were obtained and added for the origin of each SRA accession using the taxize v0.9.100 package [15] in R, and subsets of the dataset were generated for nine common metazoan phyla and 17 classes.

Fig. 1 .
Fig. 1.Proportion of Sequence Read Archive accessions that contain magnetotactic bacteria among various metazoan phyla and classes.