Dynamics of CRISPR-mediated virus–host interactions in the human gut microbiome

Abstract Arms races between mobile genetic elements and prokaryotic hosts are major drivers of ecological and evolutionary change in microbial communities. Prokaryotic defense systems such as CRISPR-Cas have the potential to regulate microbiome composition by modifying the interactions among bacteria, plasmids, and phages. Here, we used longitudinal metagenomic data from 130 healthy and diseased individuals to study how the interplay of genetic parasites and CRISPR-Cas immunity reflects on the dynamics and composition of the human gut microbiome. Based on the coordinated study of 80 000 CRISPR-Cas loci and their targets, we show that CRISPR-Cas immunity effectively modulates bacteriophage abundances in the gut. Acquisition of CRISPR-Cas immunity typically leads to a decrease in the abundance of lytic phages but does not necessarily cause their complete disappearance. Much smaller effects are observed for lysogenic phages and plasmids. Conversely, phage-CRISPR interactions shape bacterial microdiversity by producing weak selective sweeps that benefit immune host lineages. We also show that distal (and chronologically older) regions of CRISPR arrays are enriched in spacers that are potentially functional and target crass-like phages and local prophages. This suggests that exposure to reactivated prophages and other endemic viruses is a major selective pressure in the gut microbiome that drives the maintenance of long-lasting immune memory.

To cluster similar time series and obtain representative trajectories, we performed pairwise alignments of all strings using the following scores: 1 per match, -2 per mismatch, -2 per gap opening, and -1 per gap extension.These scores were chosen to prevent spurious pairing between short trajectories, that would be obtained if the penalties for mismatches and gap openings were not sufficiently large compared to the reward for matches.(Indeed, the combination [1,-1,-1,0] leads to a single cluster that contains all the time series, and [1,-1,-1,-1] produces three very large clusters encompassing 60% of all time series.)The gap extension penalty was included to avoid nesting short strings into longer strings.Note that, because internal empty states were collapsed during preprocessing, the length of a string is informative of the temporal prevalence of lineages and targets.Therefore, by penalizing gap extension, short trajectories that represent transitory infections can be distinguished from long trajectories that represent endemic or recurrent infections.
As shown in Figure S2, alternative scoring schemes produce similar clusters (in terms of the normalized variation of information) if they fulfil two conditions: (i) the penalty for mismatches and gap opening is greater than the reward for matches, and (ii) there is a cost for gap extension.If the cost of mismatches and gaps is too low, most time series become clustered in a few large (and poorly informative) clusters.If there is no cost for gap extension, the algorithm is unable to separate transitory from recurrent or endemic dynamics.The scoring scheme [2,-4,-4,-2] was included in Figure S2 to show that it produces almost identical results as [1,-2,-2,1].The reason is that the network-based algorithm that we used for clustering (Infomap) is insensitive to the global scaling of weights.As a result, scoring schemes that differ by a constant multiplicative factor are fully equivalent in practice.To control for possible biases due to different lineage sizes, for each condition (see below), we grouped spacers by lineage and calculated their median normalized abundance and mean prevalence.As a result, each lineage contributed with at most one data point to each condition.(a) Median normalized abundance of targeted MGE (plasmids, lytic/non-lysogenic phages, and lysogenic phages) before the spacer was first observed in the host population ("Before"), in the sample in which the spacer was observed for the first time ("Acquisition"), in subsequent samples in which the spacer was present ("After -With spacer"), and in subsequent samples that contained the same CRISPR lineage but not the spacer ("After -No spacer").

Figure S1 :
Figure S1: Comparison of CRISPR arrays obtained by CasCollect (followed by cleaning with CRISPRCasTyper) with those obtained by directly running CRISPRCasTyper on sample-level assemblies (SLA) and individual-level coassemblies (ILCoA).(a) Venn diagram showing the overlap in CRISPR repeats (clustered at 90% identity) among the three approaches.(b) Venn diagram showing the overlap in CRISPR spacers (clustered at 90% identity) among the three approaches.(c) Length frequency distribution of CRISPR arrays obtained with CasCollect and SLA.

Figure S2 :
Figure S2: Robustness of orientation-based results with respect to the methods used to infer CRISPR orientation.(a) Consistency of predicted orientations with respect to the expected trend of reduced transcription in middle and distal regions of the array.(b) Consistency of CRISPRDirection and CRISPRLoci with respect to the predictions of CRISPR Orientation.CRISPRDirection and CRISPRLoci failed to predict the orientation of approximately 10% of the arrays in the proximity of Cas genes (black).The percentages in (a) and (b) indicate the fraction of consistent orientations with respect to all available predictions.(c) Relative position of locally adapted spacers within the CRISPR array.L: leading end; T: trailing end.

Figure S3 :
Figure S3: Comparison of the clustering of temporal dynamics obtained with different parameters.The normalized variation of information takes values from 0 (if two sets of clusters are identical) to 1 (if both sets are completely uncorrelated).Four-digit codes indicate the parameters used to align the temporal dynamics: The first digit corresponds to the reward per match, the second is the penalty for mismatch, the third is the penalty for gap opening, and the fourth is the penalty for gap extension.The red rectangle indicates the parameter combination used in the article.

Figure S4 :
Figure S4: Differential expression of spacers as a function of their location within the CRISPR array.(a) Normalized transcript coverage of leading (L), middle (M), and distal (D) regions of the array.Each region was defined as one third of the total array length.To compare among arrays with different overall expression levels, transcript coverages were separately normalized in each array, so that the total coverage per array was equal to 1.(b) Transcript coverage ratio, indicating the relative expression of the distal third of the array with respect to the leading third.

Figure S5 :
Figure S5: Effect of CRISPR-Cas immunity on MGE abundance and host diversity (see also Fig. 4).To control for possible biases due to different lineage sizes, for each condition (see below), we grouped spacers by lineage and calculated their median normalized abundance and mean prevalence.As a result, each lineage contributed with at most one data point to each condition.(a) Median normalized abundance of targeted MGE (plasmids, lytic/non-lysogenic phages, and lysogenic phages) before the spacer was first observed in the host population ("Before"), in the sample in which the spacer was observed for the first time ("Acquisition"), in subsequent samples in which the spacer was present ("After -With spacer"), and in subsequent samples that contained the same CRISPR lineage but not the spacer ("After -No spacer").Error bars indicate the 95% confidence intervals for the median, based on 2,000 bootstraps.All comparisons shown above the plot are statistically significant with p < 10-4 (Mann-Whitney test).(b) Mean prevalence of targeted MGE, calculated as the fraction of samples in which the target is detected.Error bars indicate the 95% confidence intervals for the mean.Statistical significance based on Student's T test: p < 0.05 (*), p < 0.01 (**), p < 0.001 (***).
Figure S5: Effect of CRISPR-Cas immunity on MGE abundance and host diversity (see also Fig. 4).To control for possible biases due to different lineage sizes, for each condition (see below), we grouped spacers by lineage and calculated their median normalized abundance and mean prevalence.As a result, each lineage contributed with at most one data point to each condition.(a) Median normalized abundance of targeted MGE (plasmids, lytic/non-lysogenic phages, and lysogenic phages) before the spacer was first observed in the host population ("Before"), in the sample in which the spacer was observed for the first time ("Acquisition"), in subsequent samples in which the spacer was present ("After -With spacer"), and in subsequent samples that contained the same CRISPR lineage but not the spacer ("After -No spacer").Error bars indicate the 95% confidence intervals for the median, based on 2,000 bootstraps.All comparisons shown above the plot are statistically significant with p < 10-4 (Mann-Whitney test).(b) Mean prevalence of targeted MGE, calculated as the fraction of samples in which the target is detected.Error bars indicate the 95% confidence intervals for the mean.Statistical significance based on Student's T test: p < 0.05 (*), p < 0.01 (**), p < 0.001 (***).

Figure S6 :
Figure S6: Relative representation of different CRISPR-Cas subtypes in the representative trajectory clusters (TC) shown in Figure 5.