Extraction of CRISPR-targeted sequences from the metagenome

Summary Homology-based search is commonly used to uncover mobile genetic elements (MGEs) from metagenomes, but it heavily relies on reference genomes in the database. Here we introduce a protocol to extract CRISPR-targeted sequences from the assembled human gut metagenomic sequences without using a reference database. We describe the assembling of metagenome contigs, the extraction of CRISPR direct repeats and spacers, the discovery of protospacers, and the extraction of protospacer-enriched regions using the graph-based approach. This protocol could extract numerous characterized/uncharacterized MGEs. For complete details on the use and execution of this protocol, please refer to Sugimoto et al. (2021).


SUMMARY
Homology-based search is commonly used to uncover mobile genetic elements (MGEs) from metagenomes, but it heavily relies on reference genomes in the database. Here we introduce a protocol to extract CRISPR-targeted sequences from the assembled human gut metagenomic sequences without using a reference database. We describe the assembling of metagenome contigs, the extraction of CRISPR direct repeats and spacers, the discovery of protospacers, and the extraction of protospacer-enriched regions using the graph-based approach. This protocol could extract numerous characterized/uncharacterized MGEs. For complete details on the use and execution of this protocol, please refer to Sugimoto et al. (2021).

BEFORE YOU BEGIN
Hardware This section describes the hardware specifications, operating system, and interpreters required to execute the protocol.

Computer.
a. At least 128 GB of memory. b. A hyper-threading CPU with at least 8 threads. 2. Operating system. a. Linux distribution capable of executing scripts written in the languages of the interpreters specified below. 3. Interpreters. a. Python3. b. Java. c. Bash.
Note: The computer must have sufficient memory for metagenome assembly. A typical human gut metagenome library ranges in size from a few GB to hundreds of GB. Therefore, assembling a large library might require more than 128 GB of memory, which is likely unavailable for a standard desktop machine. In this case, consider using high-capacity servers in a data center.

Software
The software and python packages listed in the key resources table must be downloaded and installed on the system accordingly.

MATERIALS AND EQUIPMENT
Alternatives: Our repository also includes custom Bash scripts used in the previous study to automate the analysis pipelines. With modifications, these scripts could be used as starting points to build the custom pipeline for the reader's purpose.
CRITICAL: These automation pipeline scripts are mainly written in Bash language (files ending their names with .sh) and require modifications to specify the file paths in their codes. The parts that require changes are noted with comments in the Bash scripts (see REAGENT  Alternatives: We also provide FASTA and BED-formatted files, which contain spacers, protospacers, CRISPR direct repeats, and CRISPR-targeted terminally redundant sequences extracted from approximately ten thousand human gut metagenome libraries in the previous study. This dataset includes approximately two million unique spacers, ten thousand unique CRISPR direct repeats, and approximately ten thousand CRISPR-targeted unique sequences. The protocol described in this article does not require these files. However, for the case of human gut metagenome analysis, this previously extracted dataset could be used to skip the processes and reduce the analysis time.

STEP-BY-STEP METHOD DETAILS
Preprocess raw paired FASTQ files

Timing: 1-2 h
This process is quality control of the input sequences.
Alternatives: A pipeline script for preprocessing, assembling, and spacer collection is provided in the repository (pipeline_scripts/assembly_pipeline.sh).
Note: We used the human genome reference FASTA file posted by the author of BBtools.
3. Correct errors in the reads.
CRITICAL: This output FASTQ file is analysis-ready and used for assembly and spacer extraction.
Dataset from the previous study (Sugimoto et al., 2021)
Note: We skipped the error-correction step in the SPAdes program to avoid the occasional endlessly running problem. Please refer to the troubleshooting section for detail.

Assign unique identifiers to all assembled contigs.
CRITICAL: Assigning unique identifiers is important to avoid conflicts in the later analyses.
Here, we used randomly generated strings; however, any unique readable identifier, such as sample IDs, could be used instead.
Optional: We recommend removing contigs shorter than 1k bases to reduce the file size.
Note: The pre and assembling processes are independently conducted for each paired FASTQ file.
Optional: We advise that the system has sufficient disk space to store all preprocessed FASTQ and assembled FASTA files. The rest of the files, including intermediate and temporary files, can be deleted to save the disk space.

Extraction of CRISPR direct repeats from the assembled contigs
Timing: 1-2 h From the assembled contigs, discover CRISPR consensus direct repeats.
7. From the output gff file, extract direct repeats, and convert them into a FASTA file. Note: Again, these direct repeat extractions are independently conducted for each assembled FASTA file.
8. After discovering all direct repeats from each assembled FASTA file, we merge them into a single file, assign unique identifiers, and remove redundancy.
Optional: We provided the direct repeat sequences extracted from the human gut metagenomes in the previous study. The sequences are stored in a FASTA formatted file located in the supplementary folder (supplementary_data/crispr/drs/all_dr.clustered.fasta).

Extraction of CRISPR spacers from the preprocessed reads
Timing: 1 h Alternatives: A pipeline script for the spacer collection is provided in the repository (pipeli-ne_scripts/collect_spacers.sh).
Here, we return to the preprocessed FASTQ files. From them, we extract CRISPR spacers from reads containing CRISPR direct repeats.
9. Extract direct repeat-containing reads. This initial filtering step significantly reduces the number of reads, effectively speeding up the following spacer extraction processes.
10. Second filter to mask the direct repeats in the reads.
11. Extract spacers from the masked reads using our python script.
Note: The python script used above is available from our repository (pipeline_scripts/ extract_spacers.py). Optional: The python program used above does not output short (<20 bp) and long (>50 bp) sequences by default. These parameters can be changed by options (-s and -l).
Note: Spacer extraction is conducted for each preprocessed FASTQ file.
12. Merge all discovered spacers, assign unique identifiers, and remove redundancy.
Optional: It is highly advisable to make each spacer and direct repeat trackable to the original samples, contigs, and/or associated direct repeats. This can be achieved using FASTA descriptions in each record.
Optional: We provided the spacer sequences extracted from the human gut metagenomes in the previous study. The sequences are stored in a FASTA formatted file located in the supplementary folder (supplementary_data/crispr/spacers/all_spacers.clustered.removed_ short.fasta).
13. Search CRISPR direct repeats from the contigs.
14. Mask sequences around direct repeat aligned regions.

OPEN ACCESS
Note: Protospacer search is conducted for each assembled FASTA file.

Merge all discovered protospacers into a single BED-formatted file.
Optional: We provided the protospacers discovered from the assembled human gut metagenome contigs in the previous study. The protospacers are stored in a BED-formatted file located in the supplementary folder (supplementary_data/targeted_sequences/protospacers/all_protospacers.bed).

Timing: 2-4 h
Here, we cluster spacers using a co-occurrence network calculated from the spacer alignment result. From this network, we discover the graph communities, representing subsets of spacers that corresponding protospacers co-occur together across the assembled contigs. We extract the protospacer-enriched regions using these graph communities, i.e., spacer clusters. This process effectively reduces the false-positive ratio by excluding lone or randomly scattered protospacers.
17. Initial clustering is based on distance.
18. Calculate a weighted graph from the spacer co-occurrence and generate an abc formatted file.
Note: The python script used above is available in our repository (graph_clustering/ generate_abc_edgefile.py).
19. Convert the abc file into an mci format.
20. Run an mcl program to discover graph communities.
21. Convert the mcl output to a tabular format.
22. Mark CRISPR-targeted regions using the clustering result and the merged protospacer bed file.
Note: The python script used above is available in our repository (graph_clustering/ mark_crispr_targeted.py).
Note: The fourth column in the output file is formatted as {cluster id}:{protospacer count in the region}.
23. Merge the regions to rescue the fragmented clusters.
Note: The extraction of CRISPR-targeted sequences is conducted for each assembled FASTA file.
Optional: We provided the CRISPR-targeted terminally redundant sequences extracted from the assembled human gut metagenome contigs in the previous study. The sequences are stored in a FASTA formatted file located in the supplementary folder (supplementary_data/ targeted_sequences/tr/tr_sequences.fasta).

EXPECTED OUTCOMES
The final output FASTA files contained CRISPR-targeted sequences. These sequences typically range in size between a few hundred and several hundred thousand bases. The output sequences likely contain multiple kinds of MGEs (Figure 1).

LIMITATIONS
It is possible to predict the CRISPR-targeting hosts by searching protospacer-associated direct repeats in the reference genome database. However, CRISPR undergoes intense horizontal gene transfer. Therefore, combining it with other methods, such as tRNA alignment, is highly advisable to predict the infecting host.

Potential solution
The time metagenome assembly takes strongly varies between libraries. If an assembly takes longer than a day, and the library size is very large (>100 Gb), one can downsample the FASTQ file by randomly selecting pairs from the original files or normalizing the depth in silico. Both approaches can reduce the file size and potentially the assembling time.

Problem 2
Why did you skip the error-correction step in the SPAdes program (step 4)?

Potential solution
We found that the SPAdes program is sometimes trapped in the error-correction step and seems to endlessly run for more than four or five days. This phenomenon is sporadic, hard to replicate, and might depend on the system specifications. Such a problem was highly problematic when we were analyzing more than a thousand samples. To avoid this issue, we skipped the error-correction The most popular elements among the large (>20 kb) CRISPR-targeted sequences in the human gut metagenome were tailed phages. Conversely, ssDNA viruses, plasmid-like elements, and many unclassified sequences were common among the small (<20 kb) sequences. This result was obtained from our previous study (Sugimoto et al., 2021).

OPEN ACCESS
STAR Protocols 3, 101525, September 16, 2022 step in the SPAdes program and instead used the tadpole program for the preassemble error correction. We compared the statistics between the assembled contigs, using SPAdes or tadpole, and found no significant difference between them.

Potential solution
The less populated organisms in the sample are likely to be shallowly sequenced, leading to highly fragmented and partial contigs. Therefore, they might require deeper sequencing to assemble the complete genome. Conversely, the metadata of SRA recorded libraries could be wrong sometimes. For example, we encountered several libraries likely produced from 16S amplicon sequencing but recorded as a whole genomic sequencing. It is advisable to avoid such libraries and/or refer to the experimental method described in the original article if necessary.

Problem 4
The constructed spacer co-occurrence network is small and fragmented (step 21).

Potential solution
In the original study, we used millions of spacers extracted from thousands of human gut metagenome datasets to construct a network. A significant number of spacers and spacer-aligned contigs are required to build a versatile network and predict graph communities. If you are attempting to analyze other than the human gut metagenome using a few materials, it might be advisable to skip the network building process entirely and manually check each protospacer containing contigs.

Problem 5
Protein homology searches do not hit the database (quantification and statistical analysis).

Potential solution
Many MGE encoded protein sequences such as capsids and polymerases are extremely diversified, therefore it is often difficult to detect their homology to known sequences using the pairwise alignment-based method which is adopted by such as the BLAST program. In order to overcome this issue, hidden Markov models (HMMs) based programs such as HMMER (Eddy, 2011) or HH-suite (Steinegger et al., 2019) could be used for the higher sensitivity homology search. Using the predicted protein sequences from the discovered CRISPR-targeted sequences as seeds, one could iteratively search the bulk of metagenome-derived protein sequence databases such as Metaclust (Steinegger and Sö ding, 2018). The aligned sequences were used to build an HMM, which is used as a query to search a curated protein database such as Protein Data Bank.

Materials availability
CRISPR spacers and direct repeats extracted in the previous study are available from our repository.

Data and code availability
The source codes used in this protocol are available from our repository.