Genomic DNA of mock bacterial community
A bacterial DNA cocktail of a mock community (DNA-Mock-001) was obtained from the National Institute of Technology and Evaluation’s Biological Resource Center (NBRC, Tokyo, Japan). It comprises genomic DNA prepared from 15 bacterial strains (Table S4).
Fecal sampling and DNA extraction
Fecal samples were collected from three female Thoroughbred horses (Equus ferus caballus): A and S (2 years old) and L (9 years old). All horses were clinically healthy, with no apparent history of intestinal problems. No antimicrobials had been administered to them for at least a month before sampling. The samples from horses A and S were immediately sampled just after being dropped on the straw bed in the stable. The sample from horse L was collected directly from the rectum. The samples were immediately placed on ice and the genomic DNA of each was extracted within 1 h after collection. For DNA extraction, 3 g of wet fecal sample was suspended vigorously in 40 mL of PBS and then left for 1 min at room temperature to remove large plant debris. Then 4 mL of the suspension was centrifuged at 13 000× g for 2 min at 4 °C, and the pellet was washed twice with TE buffer containing 10 mM Tris and 1 mM EDTA. The final pellet was suspended in 800 µL of distilled water. The genomic DNA was extracted from the suspension by using a Quick-DNA Fecal/Soil Microbe Kit (Zymo Research, Tokyo, Japan) following the manufacturer’s instructions. It was purified with an Agencourt AMPure XP purifier (Beckman Coulter, Brea, CA, USA), and short DNA (<10 kb) was depleted by using a short-read eliminator (XS; Circulomics, Baltimore, MD, USA).
All samplings and experiments were conducted in accordance with ethical and welfare regulations of the Animal care committee of the Equine Research Institute. The Animal care committee approved all experimental protocols. We also complied with the ARRIVE guidelines (https://arriveguidelines.org/).
Sequencing and base-calling for mock community
Genomic libraries for two target regions (full-length 16S rRNA gene and 16S-ITS-23S rRNA operon) were prepared by using the four-primer PCR method protocol, version FFP_9038_v108_revN_14Aug2019 (Oxford Nanopore Technologies [ONT], Oxford, UK) with slight modifications. The four-primer PCR uses two target-specific inner primers with a 5′ tail and two universal outer primers which prime off the tail on the 5′ end of the inner primers (Table S5), resulting in the generation of target amplicons with barcodes15,32,36. PCR amplifications were conducted using either LongAmpTM Taq 2× Master Mix (New England Biolabs, Ipswich, MA, USA) or the KAPA2GTM Robust HotStart Ready Mix PCR Kit (Kapa Biosystems, Wilmington, MA, USA). PCR was performed in a total volume of 25 µL containing the inner primers (50 nM each), the barcoded outer primer mixture (300 nM) from the PCR barcoding kit (SQK-PBK004; ONT), and the DNA cocktail (1 ng) as template. PCR conditions are shown in Table S6. The PCR amplicons were purified with the Agencourt AMPure XP purifier and quantified by a NanoDrop spectrophotometer (ThermoFisher Scientific), and the libraries were sequenced on a MinION sequencer using R9.4.1 flow cells (FLO-MIN106D; ONT) following the manufacturer’s instructions. Base-calling of raw fast5 data from the MinION was carried out in Guppy v. 3.6.1 software (ONT) with its “‑‑trim_barcodes” option for removing sequencing adapters and barcodes. Details on each sample, including number of reads, median read lengths, and median read quality, are shown in Table S7.
Sequencing and base-calling for fecal samples
For fecal samples, both amplicon sequencing and shotgun sequencing were conducted. Genomic libraries for amplicon sequencing targeting the full-length 16S rRNA gene or the 16S-ITS-23S rRNA operon were prepared as for the mock community by using a KAPA2G Robust HotStart Ready Mix PCR Kit. Genomic libraries for shotgun sequencing were prepared by using a Rapid Barcoding Kit (SQK-RBK004; ONT) following the manufacturer’s instructions. All libraries were sequenced on the MinION sequencer using R9.4.1 flow cells. Base-calling of raw fast5 data was carried out as above; details are shown in Table S7.
Sequencing data processing for mock community
The raw FASTQ files were pretrimmed in Seqkit v. 0.12.0 software37 to filter by quality scores of 10, with lengths of 1,300 bp for the full-length 16S rRNA gene and 3,500 bp for the 16S-ITS-23S rRNA operon. Afterward, 30,000 reads were subsampled in Seqtk v. 1.3. software (https://github.com/lh3/seqtk), and chimera reads were removed in yacrd v. 0.6.1 software38. Twenty thousand quality-controlled reads were selected by quality score > 12 and by size: 1,300–1,950 bp for the full-length 16S rRNA gene and 3,500–5,000 bp for the 16S-ITS-23S rRNA operon. Accurate taxonomic assignments of the quality-controlled read sets were performed with the CCMetagen pipeline by coupling with KMA19 and CCMetagen20 software, i.e., read mapping to a reference database in KMA software (‑mem_mode, ‑bcNano, and ‑1t1 options), specifying the minimum phred score (-mp 20), minimum alignment score (‑mrs 0.0), and base-calling option (‑bc 0.7), followed by a quality-filtering step in CCMetagen software with default settings. Two reference databases—rrn DB18 and ncbi_202006 DB, described below—were used in the taxonomic assignment step. The dissimilarity indices between the percentage of each sequencing condition excluding mis- or unidentified values and that of the expected abundance of the mock community were calculated by the vegan package in R v. 3.6.1 software with default settings39,40.
Sequencing data processing for fecal samples
For the FASTQ files from the amplicon sequencing, pretrimmed reads were created in Seqkit software to filter by quality scores of 10, with lengths of 1,300 bp for the full-length 16S rRNA gene and 3,500 bp for the 16S-ITS-23S rRNA operon. Chimera reads were removed in yacrd software. Quality-controlled reads were selected by quality score > 11 and by size: 1,300–1,950 bp for the full-length 16S rRNA gene and 3,500–5,000 bp for the 16S-ITS-23S rRNA operon. Taxonomic assignments were performed as above.
For the FASTQ files from the shotgun sequencing, quality-controlled data sets were created in Seqkit software to filter by lengths of 1,000 bp and quality scores of 10. After conversion from FASTQ format to FASTA format, rRNA genes were detected in each sequence read in Barrnap v. 0.9 software (https://github.com/tseemann/barrnap) with default settings, and the reads which contained rRNA genes were extracted from the raw FASTA files in Seqkit software: 2,333 sequence reads from horse A, 4,417 reads from horse S, and 8,700 reads from horse L. The taxon assignments of each read against ncbi_202006 DB were performed as above. For further investigations of unlinked rRNA genes, the reads that could be classified to at least phylum level were kept, and we classified reads as containing unlinked rRNA genes following the previous criteria17, namely if there was >1,500 bp between the 16S and 23S rRNA genes, or if there was no 23S domain found by 1,500 bp after the end of the 16S rRNA. We removed the sequence reads for final analyses that could not be judged to contain linked or unlinked rRNA genes (i.e., with <1,500 bp after the 3′ end of the 16S rRNA gene), and we kept only reads that included a 16S rRNA gene to avoid potential double-counting organisms with unlinked 16S and 23S rRNA genes. Information on all long-read sequences included is shown in Table S1.
The ggplot2 package in R v. 3.6.1 software was used to depict the percentages of quality-filtered reads from the CCMetagen pipeline in each taxon and boxplots for the top 20 orders with the highest average percentage41. The NMDS analysis was carried out using the vegan package in R software with default settings39. A heat map showing the percentage of each order identified was plotted in ComplexHeatmap software in R, specifying the distance measure “spearman” and clustering method “ward.D2” for hierarchical clustering42.
Taxonomic reference databases
We used two databases which contain sequences of the 16S-ITS-23S rRNA operon: (1) rrn DB, which includes sequences from the whole ribosomal operon of 22,351 bacterial species retrieved from GenBank18, and (2) ncbi_202006 DB, which was created in this study. To create ncbi_202006 DB, all of the bacterial and archaeal data were downloaded from NCBI in June 2020 in genome_updater v. 0.2.2 software (https://github.com/pirovc/genome_updater), and rRNA genes were predicted and extracted in Barrnap software. Each rRNA gene in the same nucleotide sequence was concatenated if the distance between rRNA genes was <2,000 bp, because the lengths of ITS regions in linked rRNA genes were by definition <1,500 bp17. All sequences whose lengths were <1,000 bp were removed from the database. The final ncbi_20206 DB contains 493,329 sequences. Accessions for ncbi_20206 DB are available at https://bitbucket.org/ykinoshita1984/ncbi_db/downloads/.
Statistical analysis
The vegan package in R software was used to conduct permutational multivariate analysis of variance (PERMANOVA), specifying the Bray–Curtis method and 10,000 permutations39. P < 0.05 was considered to indicate a significant difference.