Whole genome comparison of a large collection of mycobacteriophages reveals a continuum of phage genetic diversity

The bacteriophage population is large, dynamic, ancient, and genetically diverse. Limited genomic information shows that phage genomes are mosaic, and the genetic architecture of phage populations remains ill-defined. To understand the population structure of phages infecting a single host strain, we isolated, sequenced, and compared 627 phages of Mycobacterium smegmatis. Their genetic diversity is considerable, and there are 28 distinct genomic types (clusters) with related nucleotide sequences. However, amino acid sequence comparisons show pervasive genomic mosaicism, and quantification of inter-cluster and intra-cluster relatedness reveals a continuum of genetic diversity, albeit with uneven representation of different phages. Furthermore, rarefaction analysis shows that the mycobacteriophage population is not closed, and there is a constant influx of genes from other sources. Phage isolation and analysis was performed by a large consortium of academic institutions, illustrating the substantial benefits of a disseminated, structured program involving large numbers of freshman undergraduates in scientific discovery. DOI: http://dx.doi.org/10.7554/eLife.06416.001


Introduction
Bacteriophages are the dark matter of the biological universe, forming a vast, ancient, dynamic, and genetically diverse population, replete with genes of unknown function (Pedulla et al., 2003). Phages are the most abundant organisms in the biosphere, and the ∼10 31 tailed phage particles participate in ∼10 23 infections per second on a global scale, with the entire population turning over every few days (Suttle, 2007). The population is not only vast and dynamic, but comparisons of virion structures suggest that it is also extremely old (Krupovic and Bamford, 2010). It is thus not surprising that bacteriophages are genetically highly diverse, although their comparative genomics has lagged behind that of other microbes, largely due to the lack of individual isolates for genomic analyses . To date, there are approximately 2000 completely sequenced bacteriophage genomes in the GenBank database, a small number relative to the more than 30,000 sequenced prokaryotic genomes (http://www.ncbi.nlm.nih.gov/genome/browse/), in spite of phage genomes being only 1-5% of the size of their host genomes.
Double-stranded DNA tailed phages are proposed to have evolved with common ancestry but with different phages having differential access to a large common gene pool (Hendrix et al., 1999).
Phage genomes are typified by their mosaic architectures generated by gene loss and gain through horizontal genetic exchange; however, the parameters influencing access to the common gene pool are numerous and likely include host range, genome size, replication mode, and life style (temperate vs lytic). Migration to new hosts is probably common, but is affected by local host diversity and mutation rates, as well as resistance mechanisms such as receptor availability, restriction, CRISPRs, and abortive infection systems (Buckling and Brockhurst, 2012;Jacobs-Sera et al., 2012;Hoskisson et al., 2015). Constraints on gene acquisition may also be imposed by synteny-particularly among virion structural genes-and by size limits of DNA packaging (Juhala et al., 2000;Hatfull and Hendrix, 2011).
We have previously described comparative analyses of modest numbers of mycobacteriophages and shown that they can be sorted by nucleotide sequence and gene content comparisons into groups of closely related genomes referred to as 'clusters' (designated Cluster A, B, C, etc.); phages without any close relatives are referred to as 'singletons'. Some of the clusters can be further divided into subclusters (e.g., Subcluster A1, A2, A3, etc.) according to nucleotide sequence relatedness (Pedulla et al., 2003;Hatfull et al., 2006Hatfull et al., , 2010Pope et al., 2011b). The genomes are mosaic whereby individual phages are constructed as assemblages of modules, many of which are single genes (Pedulla et al., 2003). Each mycobacteriophage cluster has features particular to that cluster (e.g., regulatory systems, repeated sequences, tRNA genes, etc. [Pope et al., 2011a[Pope et al., , 2011b[Pope et al., , 2013[Pope et al., , 2014a[Pope et al., , 2014b), but eLife digest Viruses are unable to replicate independently. To generate copies of itself, a virus must instead invade a target cell and commandeer that cell's replication machinery. Different viruses are able to invade different types of cell, and a group of viruses known as bacteriophages (or phages for short) replicate within bacteria. The enormous number and diversity of phages in the world means that they play an important role in virtually every ecosystem.
Despite their importance, relatively little is known about how different phage populations are related to each other and how they evolved. Many phages contain their genetic information in the form of strands of DNA. Using genetic sequencing to find out where and how different genes are encoded in the DNA can reveal information about how different viruses are related to each other. These relationships are particularly complicated in phages, as they can exchange genes with other viruses and microbes.
Previous studies comparing the genomes-the complete DNA sequence-of reasonably small numbers of phages that infect the Mycobacterium group of bacteria have found that the phages can be sorted into 'clusters' based on similarities in their genes and where these are encoded in their DNA. However, the number of phages investigated so far has been too small to conclude how different clusters are related. Are the clusters separate, or do they form a 'continuum' with different genes and DNA sequences shared between different clusters?
Here, Pope, Bowman, Russell et al. compare the individual genomes of 627 bacteriophages that infect the bacterial species Mycobacterium smegmatis. This is by far the largest number of phage genomes analyzed from a single host species. The large number of genomes analyzed allowed a much clearer understanding of the complexity and diversity of these phages to be obtained. The isolation, sequencing and analysis of the hundreds of M. smegmatis bacteriophage genomes was performed by an integrated research and education program, called the Science Education Alliance Phage Hunters Advancing Genomics and Evolutionary Science (SEA-PHAGES) program. This enabled thousands of undergraduate students from different institutions to contribute to the phage discovery and sequencing project, and co-author the report. SEA-PHAGES therefore shows that it is possible to successfully incorporate genuine scientific research into an undergraduate course, and that doing so can benefit both the students and researchers involved.
The results show that while the genomes could be categorized into 28 clusters, the genomes are not completely unrelated. Instead, a spread of diversity is seen, as genes and groups of genes are shared between different clusters. Pope, Bowman, Russell et al. further reveal that the phage population is in a constant state of change, and continuously acquires genes from other microorganisms and viruses.
because of the pervasive mosaicism, the relationships among phages within clusters and between clusters are complex. Collections of phages have been isolated on other hosts such as Bacillus spp., Escherichia coli, Pseudomonas spp., Propionibacterium spp. and Staphylococcus spp. (Kwan et al., 2005(Kwan et al., , 2006Kropinski et al., 2007;Marinelli et al., 2012;Hatfull et al., 2013;Grose and Casjens, 2014;Lee et al., 2014) and these can be similarly divided into clusters based on DNA similarity. Recent analysis of 337 phages infecting 31 bacterial species within the Enterobacteriaceae (Grose and Casjens, 2014) reveals 56 clusters of phage genomes. It is thus clear that there is substantial diversity within the phage population, even when comparing phages of a common host and which are expected to be in direct genetic contact with each other in their natural environment . Nonetheless, the numbers of genomes isolated on a particular host generally are too small to define the nature and the size of the populations at large with any substantial resolution.
Viral metagenomic studies provide valuable insights into phage diversity and population dynamics, but typically generate few complete genome sequences or any specific information relating viral genomes to specific bacterial hosts (Hambly and Suttle, 2005;Rodriguez-Brito et al., 2010;Mokili et al., 2012). A recent analysis of Synechococcus phages using metagenomic analysis coupled with viral tagging showed that there are multiple 'populations' of these phages (similar to the clusters described above), but suggested that these represent distinct groups of related phages rather than a continuous spectrum of diversity (Deng et al., 2014). This differs from prior predictions that the phage population as a whole likely spans a continuum of diversity-albeit with uneven representation of different groups of related phages-because of genomic mosaicism (Hendrix, 2003;Hatfull, 2010Hatfull, , 2012. However, as the Synechococcus phage data are derived from a single sample using a single host, it is unclear if this extends to phages of other hosts (Deng et al., 2014).
Here we describe the comparative analysis of a large number of completely sequenced mycobacteriophage genomes and demonstrate that they represent a spectrum of diversity and do not constitute discrete populations. Rarefaction analyses of their constituent genes are consistent with populations of gene families shared among mycobacteriophages being augmented by the introduction of new gene families from outside sources. The assembling of a large and highly informative collection of bacteriophages by a consortium of students and faculty at multiple institutions demonstrates that a course-based research experience (CRE) can be successfully implemented at large scale without compromising the authenticity or richness of a scientific investigation imbued with discovery and project ownership.

Results and discussion
A genome-by-genome approach to defining phage diversity Exploring phage diversity using a genome-by-genome approach has notable advantages and some potential disadvantages. The main advantage is that complete genome sequences give information about genome length and composition, providing key insights into genome mosaicism and how genome segments are shared and exchanged. A difficulty is that there are not large extant phage collections available for most bacterial hosts, and isolation, purification, and characterization of phages can be slow and time-consuming. Because isolation typically requires plaque formation and growth in the laboratory, some naturally occurring phages may escape isolation using standard methods. Thus, although the diversity of phages isolated and propagated in the laboratory may not capture all types of phage, it represents a minimum, not a maximum, index of diversity.

Authentic research in a CRE
The 2012 report from the President's Council of Advisors on Science and Technology (PCAST) focused on the poor retention of undergraduate students in science, technology, engineering and mathematics (STEM) as an impediment to meeting US economic demands (PCAST, 2012). One of the PCAST recommendations is to replace traditional introductory laboratory courses with research-based experiences that would inspire freshman students and promote STEM retention. A powerful strategy is to engage students in scientific discovery through CREs. The successful implementation of this strategy depends on (i) identifying research questions that can engage students in contributing genuine advances in scientific knowledge without requiring prior expert knowledge, and (ii) designing the project so that large numbers of students can participate in a meaningful fashion.
We have previously described the Howard Hughes Medical Institute (HHMI) Science Education Alliance Phage Hunters Advancing Genomics and Evolutionary Science (SEA-PHAGES) program, in which beginning undergraduate students isolate, purify, sequence, annotate, and compare bacteriophages, and have described its educational advantages (Jordan et al., 2014). By taking advantage of the massive diversity of the phage population so that each student can isolate a unique phage, the program encourages student ownership of their science. And because the collective discoveries by many students generate new scientific insights, the program creates a scientific community of students engaged in authentic research.
The SEA-PHAGES program has contributed to the growth of the collection of sequenced mycobacteriophages to nearly 700 individual isolates (http://phagesdb.org), of which 627 were selected for a detailed analysis (Supplementary file 1). This is by far the largest collection of sequenced phage genomes for any single host and thus promises to substantially advance our understanding of phage diversity. The phages were isolated using either direct plating or by enrichment using Mycobacterium smegmatis mc 2 155 as a host, and sequenced using next-generation approaches (see 'Materials and methods'). More than 5000 students-primarily freshmen-at 74 institutions have been involved since inception of the SEA-PHAGES program in 2008, and the phages isolated represent a broad geographical distribution ( Figure 1) and a variety of viral morphotypes (http://phagesdb.org). The new insights gained from comparative genomic analyses of these phages-as described below-demonstrate the effectiveness of viral discovery and genomics as a model for CRE development.

Assembling mycobacteriophages into clusters and subclusters
Using previously reported parameters based primarily on nucleotide sequence similarity spanning >50% genome length (Hatfull et al., 2006), the 627 genomes were assembled into 20 clusters (A-T) and eight singletons (with no close relatives) ( Figure 2, Supplementary file 1); 11 clusters were subdivided into 2 to 11 subclusters ( Table 1). There is considerable variation in cluster size with substantial differences in the numbers of genomes in each cluster (2-232), but there is relatively little variation in either genome length or the numbers of genes per genome in any given cluster ( Table 1). Cluster assignment is of practical utility and is generally robust, with clustered phages typically sharing genome architectures, as noted for the Enterobacteriacea (Grose and Casjens, 2014). For example, Cluster A phages are similar in size and transcriptional organization, and share an unusual immunity system (Brown et al., 1997;Pope et al., 2011b). Cluster M phages all contain large numbers of tRNA genes (Pope et al., 2014a), Cluster K (Pope et al., 2011a) and Cluster O (Cresawn et al., 2015) phages have different but characteristic repeated sequences, and Cluster J phages have an unusual capsid with a triangulation (T) number of 13 (Pope et al., 2013). Therefore, the organization of related mycobacteriophages into clusters provides a framework for identifying and interpreting gene trafficking within and among potentially distinct groups of genomes.

Gene content relationships among sequenced mycobacteriophages
Genome mosaicism is more apparent from comparison of gene product amino acid sequences than nucleotide sequence comparisons because of the accumulation of genome rearrangements over a longer period of evolution, during which indications of DNA similarity are lost. To compare mycobacteriophage gene contents we grouped related genes into protein families ('phamilies' or 'phams') using Phamerator (Cresawn et al., 2011), which we modified to use kClust (Hauser et al., 2013) so as to easily accommodate the large numbers of comparisons. The 69,633 genes assembled into 5205 phams of which 1613 (31%) are orphams (single-gene phamilies ). Approximately 25% of phams can be assigned functions in viral structure and assembly, DNA metabolism, integration, lysis, and regulation, but the vast majority are of unknown function. Representation of gene content relationships among all 627 phages as a network phylogeny reveals relationships that are in accord with the cluster and subcluster designations derived from nucleotide sequence comparisons ( Figure 3). The multiple branches between clusters/subclusters reflect the phylogenetic complexities that arise from genome mosaicism, where genes within a genome have distinct evolutionary histories. The distribution of orphams (genes without mycobacteriophage homologues) provides additional support for cluster/subcluster assignments; Figure 4). A relatively high proportion of orphams is a characteristic of both singleton genomes and single-genome subclusters (Figure 4). At least 30% of genes in all of the singleton genomes are orphams, and the single-genome subclusters have a minimum of 15% orphams; genomes in other clusters and subclusters typically have fewer than 10% orphams ( Figure 4). The presence of numerous orphams ensures that the lack of cluster inclusion did not result from sequence errors or insufficient or inappropriate gene annotation. Notable exceptions are Predator (Subcluster H1) and Mendokysei (Cluster T), both of which are in very small clusters/subclusters, and KayaCho (Subcluster B4). KayaCho may warrant separation into a new subcluster (e.g., B6), but overall the orpham distribution is consistent with the cluster/subcluster designations.

The diversity of different clusters is highly varied
To determine the extent to which the various clusters/subclusters represent discrete groups, we generated a heat map showing pairwise shared gene content ( Figure 5) and quantified the cluster/subcluster diversity (Table 1, Figure 6). The heat map strikingly illustrates that diversity is non-uniform, with genomes in some clusters (e.g., Subclusters B1, C1) being very closely related, whereas in others they display substantial differences (e.g., Subclusters A1, F1). The variation is also evident within the large Cluster A group, with some subclusters having low diversity (e.g., A4, A5, A6), some being highly diverse (e.g., A1, A2), and some plausibly further splitting into subgroups (A3) ( Figure 5).
We quantified the cluster diversity using three different measures, CLuster Average Shared Phamilies (CLASP), Cluster Associated Phamilies (CAP), and Cluster Cohesion Index (CCI) (Tables 1, 2, Figure 6A). Both CAP (the number of phams present in all genomes within a cluster divided by the average number of genes per genome) and CCI (the average number of genes per genome as a percentage of the total number of phams in that cluster) show substantial variation between clusters ( Table 1, S2), and little evidence for commonly conserved 'core genes', as suggested for T4-related phages (Petrov et al., 2010). However, both of these parameters are somewhat influenced by cluster/ subcluster size, which varies from cluster to cluster. In contrast, CLASP (the percentage of phamilies shared between two genomes, then averaged across all possible pairs within a cluster or subcluster) is relatively insensitive to cluster/subcluster size (as seen by a resampling analysis; Figure 6-figure supplement 1), but still shows substantial variation from one cluster to another (Table 1, Figure 6A).

The discreteness of different clusters is highly varied
The heat map of genome comparisons ( Figure 5) also illustrates the degrees to which clusters and subclusters share gene content, a reflection of cluster discreteness, or how isolated discrete clusters are from each other. For example, although the Cluster A phages are highly diverse, they also appear relatively isolated and share relatively few genes with other clusters ( Figure 5). In contrast, phages in Cluster E share substantial numbers of genes with other clusters, including those in Clusters F, J, L, P, and several singletons. We have quantified these relationships with the Cluster Isolation Index (CII, the percentage of phams present within a cluster that are not present in other mycobacteriophage genomes), which demonstrates the considerable variation in isolation from phages of other clusters/ subclusters (Table 1, Figure 6B). For example, at one extreme, 84.6% of Cluster C gene phamilies are found only in Cluster C and not elsewhere. At the other extreme, only 23.8% of Cluster I gene phamilies are constrained to that cluster, with the remainder having relatives present in genomes in other clusters. Other clusters form a spectrum of relationships between these extremes (Table 1, Figure 6B), and clusters such as I and P-which share recognizable DNA sequence similarity (Figure 2-figure supplement 1)-share >60% of their genes with other phages (low CII values; Table 1). Thus, although some clusters could be considered as discrete groups-as reported for the Synechococcus phages (Deng et al., 2014)-this is far from being a universal or characteristic feature of groups of related phages.
Cluster isolation analyses reveal additional complexities arising from highly mosaic genomes. For example, the singleton Dori is clearly related to Cluster B phages (Figure 3) with which it shares Figure 2. Nucleotide sequence comparison of 627 mycobacteriophages displayed as a dotplot. Complete genome sequences of 627 mycobacteriophages were concatenated into a single file which was compared with itself using Gepard (Krumsiek et al., 2007) and displayed as a dotplot using default parameters (word length, 10). The order of the genomes is as listed in Supplementary file 1. Nucleotide similarity is a primary component in assembling phages into clusters, which typically requires evident DNA similarity spanning more than 50% of the genome lengths.  Figure 6B). Likewise, the singleton MooMoo has segments of DNA similarity and shares ∼20% of its gene content (as determined by shared phams) with Cluster F phages (Figure 3, Figure 2-figure supplement 3, Figure 4-figure supplement 1), but also has similarity to Clusters N and I, as well as a low CII (Table 1, Figure 6B). It has low DNA similarity to Cluster O (Figure 2-figure supplement 3), but has several phams in common with the Cluster O phages, and has the same unusual prolate morphology (Figure 3). Complex relationships are also seen in the singletons Gaia and Sparky (Figure 4-figure supplement 2). Taken together, the analyses of both cluster diversity and cluster isolation show that mycobacteriophage populations contain a continuum of diversity, with non-uniform abundance of different types of phages. The prevalence of isolated phages may not necessarily reflect the proportions of different types of phages in the environment, but the availability of a large collection of isolated phages enables capture and whole genome analysis of relatively rare phages that are critical to understanding the complexities of genome relationships. We recently reported genomic analysis of the singleton mycobacteriophage Patience, which has a substantially lower GC% than its host (50.3% vs 67.4%), has a different codon usage profile, but is undergoing codon selection for growth in a high GC% environment (Pope et al., 2014b). If there is a flux of phage genomes and genes entering the mycobacterial neighborhood, then we predict that the phages of a single host do not reflect a closed system with discrete populations, but one that is open with ever-expanding diversity. The mycobacteriophage population is not a closed system Both the huge diversity of phamilies in mycobacteriophages and the high frequency of orphams suggest that genes are constantly added to phage genomes from outside sources just as genes are added to the genomes of their bacterial hosts via horizontal gene transfer. Such gene influx-for example, from host-jumping phages such as Patience (Pope et al., 2014b)-would provide genetic novelty and enable phages to adapt to their ever-changing hosts. To examine gene flux into the mycobacteriophage population, we performed a rarefaction analysis by re-sampling the gene phamilies within the phage population (Figure 7). Remarkably, the rarefaction curves of the entire collection-including the 95% confidence limits-do not fit a hyperbola as would be expected if the mycobacteriophages were limited to an isolated set of genes, and about 2.5 new gene phamilies are predicted to be identified with each newly isolated phage ( Figure 7A). Similar independent analyses on the phages of Cluster A or the phages of Cluster B show that this is also observed within these clusters ( Figure 7B,C). Thus both individual clusters and the collection as a whole are not genetically fixed, but are in constant flux. While a hyperbola can model sampling of gene phamilies from a finite pool, it does not accommodate the influx of new phamilies. The addition of a linear term (see 'Materials and methods'), representing the introduction of new phamilies from outside sources, results in a non-asymptotic curve which predicts the continual identification of new phams even after large numbers of genomes have been sampled (R > 0.999; Figure 7D). This linear term acts as  Figure 6. continued on next page a surrogate for the linear range of a second hyperbolic curve, one representing the resampling of a much larger set of gene phamilies available for introduction into mycobacteriophage genomes. Unfortunately, the current dataset remains insufficient to confidently extrapolate to give an estimate of the total number of viral protein families in the biosphere, which has been previously estimated to be anywhere between a half a million and 2 billion (Rohwer, 2003;Ignacio-Espinoza et al., 2013).
We note that because of the generally slow pace of the advancement of phage genomics, we have little insight into the phage populations of other hosts. We retrieved all double-stranded DNA tailed phage genomes in GenBank that we could identify (a total of 1781), corresponding to about 120 host bacterial genera, with a median number of phages per host genus of two. Using similar parameters for pham building as described above, the 181,717 predicted genes assemble into 47,479 phamilies. The relatively low representation of each phamily (3.8 genes/phamily) compared to the mycobacteriophages (13.4 genes/phamily) is a further reflection of the gross under-sampling of the phage population as a whole.

Implications for bacteriophage taxonomy
Bacteriophage taxonomic classification reflecting phylogeny presents substantial challenges because of genome mosaicism (Lawrence et al., 2002). Classification by viral morphology is well established, but may not accurately reflect the genetic relationships, as illustrated for the prolate-headed MooMoo (Figure 3). We also note that the mycobacteriophage myoviruses have a high CII and form a discrete group (Table 1) as do the Synechococcus myophages (Deng et al., 2014), perhaps reflecting a virulent lifestyle that constrains productive gene exchange; T4-related phages from diverse hosts share a core set of 15-20% of their genes, and whole genome comparisons reveal extensive mosaicism (Petrov et al., 2010). Host range mutability thus may differ in phages with different morphotypes, limiting access to the gene pool, and although grouping phages into clusters and subclusters provides analytical advantages because of the wide range in prevalence of different phages (Table 1), it is not suitable as a broadly applicable hierarchical taxonomic system. The comparative analysis of these mycobacteriophages thus supports reticulate taxonomies that more accurately reflect the phylogenetic complexities (Lawrence et al., 2002;Lima-Mendez et al., 2007).

Implications for student learning through research experiences
A research experience can be a powerful vehicle that enables a person to gain an understanding of the process of science (Hunter et al., 2007). When the research experience occurs early and at a large scale, as described here, the focus can shift from selecting a few 'qualified' students to exploring the potential interests of many students. Clearly, an essential ingredient is the nature of the research project, as definitions of research may vary from an inquiry-based exercise to authentic research with the potential to contribute publishable findings. To optimize the educational benefits, the research project must be intellectually and technically accessible to beginning students (i.e., few prerequisites) and scalable so that many students can simultaneously make progress in parallel, yet independently (Hatfull et al., 2006). Importantly, each student's findings should contribute to a scientific question with integration of all students' discoveries advancing a scientific question of significance, as judged by scientific peer review. This, we  believe, defines an 'authentic' research experience. We note that in the SEA-PHAGES platform, substantial student effort is invested in arriving at high-quality genome annotations by close manual inspection followed by expert verification, a critical component of the detailed comparative analysis of phage gene content described here.

Concluding comments
Bacteriophage genomics has progressed relatively slowly compared to that of other microbes in spite of their relatively small genome sizes. Here we have demonstrated that programmatically integrating the research and education missions at large scale provides an effective solution to expanding our knowledge of viral diversity, with a multitude of insights gained as a consequence of the scale of phage discovery. The nature of different genomic types, the variations of the diversity both within clusters and shared genome content among clusters, and the expanse of the mycobacteriophage population can be viewed at an unprecedented level of resolution. Our conclusions align well with comparative analyses of phages of Enterobacteriacea (Grose and Casjens, 2014) and Bacillus spp.  and we predict that these are general parameters of bacteriophage diversity, at least when sampling broadly across the environment. Both the rarefaction analysis described here and preliminary analysis of phamilies of all sequenced DNA phages illustrate how little of the global phage population has been genomically sampled. With a near endless supply of diverse viruses readily accessible for isolation and analyses, integrated research/education programs will continue to play substantial roles in defining the nature of the virosphere.

Phages and genomes
In addition to extant GenBank sequence information, mycobacteriophages were isolated, sequenced, and annotated in the Phage Hunters Integrating Research and Education (PHIRE) or SEA-PHAGES programs. Phage genomes were shotgun sequenced using either 454, Ion Torrent, or Illumina platforms to at least 20-fold coverage. Shotgun reads were assembled de novo with Newbler versions 2.1 to 2.9. Assemblies were checked for low coverage or discrepant areas, and targeted Sanger reads were used to resolve weak areas and identify genome ends. All genome sequences are publically available at phagesDB.org or in GenBank. Nucleotide comparisons used BLASTN or Gepard (Krumsiek et al., 2007).

Database construction
To create Phamerator database Mykobacteriophage_627, phamilies were constructed by first clustering the entire database of 69,574 genes using strict kClust parameters (70% clustering threshold and 0.25 alignment coverage of the longer sequence). This was followed by multiple sequence alignment of each preliminary cluster using Kalign (Lassmann and Sonnhammer, 2005). Consensus sequences were then extracted using HHmake and HHconsensus (Remmert et al., 2012). The resulting list of sequences was subjected to a second-and less strict-round of clustering via kClust (30% clustering threshold and 0.5 alignment coverage of the longer sequence) to obtain the final phamily assignments. Network phylogeny constructions were made using the NeighborNet function with default parameters in SplitsTree (Huson, 1998;Huson and Bryant, 2006). The numbers of phamilies are reported for between 1 and 627 phage genomes sampled at random without replacement; the mean of 10,000 iterations is shown in red; gray lines indicate a confidence interval of two standard deviations. The black line shows a hyperbolic curve fit to the data from phage counts 1 to 314. The inset shows the number of new phams encountered upon the inclusion of each phage, with the mean number for the 10,000 iterations shown in blue and the predicted value from the hyperbolic curve shown in black. (B) Rarefaction analysis of 232 Cluster A phages. The total numbers of phamilies are reported for between 1 and 232 phages sampled at random without replacement from Cluster A; the mean of 10,000 iterations is shown in red; gray lines indicate a confidence interval of two standard deviations. The black line shows a hyperbolic curve fit to the data from phage counts 1 to 117. The inset shows the number of new phams encountered upon the inclusion of each phage, with the mean number for 10,000 iterations shown in blue and the predicted value from the hyperbolic curve shown in black. (C) Rarefaction analysis of 108 Cluster B phages; the hyperbolic curve was fit to the data from phage counts 1 to 54. (D) Fits of the hyperbolic (Equation 1) and hyperbolic with linear (Equation 2) models for phamily identification within genome samples. DOI: 10.7554/eLife.06416.023 The following source data is available for figure 7: Source data 1. Datasets for determination of rarefaction curves. DOI: 10.7554/eLife.06416.024

Cluster diversity and isolation indices
Four parameters were used to evaluate cluster diversity. The first is the CLASP index that calculates the percentage of phamilies shared between two genomes, then averages across all possible pairs within a cluster or subcluster. Because the pairwise similarities are averaged, CLASP is relatively insensitive to either the overall size of the cluster, or the heterogeneity of its diversity (such as in Cluster C in which of the 45 genomes in total, 44 are in Cluster C1, and only one is in Cluster C2). CLASP robustness with respect to cluster size was demonstrated through a resampling analysis. For each cluster with more than 30 members, a random subset (of 5, 10, 20, or 30 genomes) was selected and CLASP was calculated. For each sample size, 20 iterations were performed with replacement. As expected, there is substantial deviation among the iterations, especially at smaller sizes. However, there is little change in the average CLASP values with different sample sizes (Figure 4-figure supplement 1), showing that cluster size is not a primary driver of diversity. The resampling analyses also suggest that while a greater number of genomes helps refine the CLASP value, there is still predictive power when only 10 genomes are compared. On average, the maximum and minimum iteration values at a sample size of 10 genomes were within 8% of the whole-cluster CLASP value. This implies that, for example, increasing Cluster D from 10 to 50 or 100 genomes may raise or lower its current CLASP value of 88.1, but that value is likely to remain between ∼80 and ∼96.
The second measure used is the CAP, which is calculated as the number of phamilies present in all genomes within a cluster divided by the average number of phamilies per genome. These cluster-conserved genes could correspond to core genes that define a particular phage group such as cluster or subcluster. However, for those clusters with sufficient diversity to detect such core genes, these values are low. For example, among the 66 Cluster F genomes, only five phamilies are present in all genomes. None are virion structural genes, one is a glycosyltransferase whose role is unknown, one is a putative regulator, and the others are small proteins of unknown function. For the Cluster A genomes, 11 phamilies are conserved, seven of which are virion structural proteins, three are involved in DNA metabolism (DNA Pol, Helicase, Rec-Like protein), and one is of unknown function.
The third parameter is the Cluster Phamily Variation (CPV) index, which is the proportion of phams that are not present in all members of the cluster. CAP and CPV are inversely related but imperfectly as CPV varies with cluster size even among similarly diverse clusters; a plot of CAP values against CPV values is shown in Figure 6-figure supplement 2.
The CCI is calculated as the average number of genes per genome as a percentage of the total number of phams in that cluster. Thus if all genomes in a cluster are identical (and if phamilies occur only once in a genome), CCI would be 100; the CCI for two sets of five randomly chosen genomes is ∼2. CCI values correlate with cluster size, but similarly sized clusters as such G, J, and L, or E and K have substantially different CCI values ( Table 1).
The CII is the percentage of phams present within a cluster that are not present in other mycobacteriophage genomes.

Rarefaction analysis
Rarefaction analysis was performed by randomly selecting subsets (without replacement) of between 1 and 627 (all), 232 (Cluster A) or 108 (Cluster B) mycobacteriophages and determining the numbers of phamilies represented. This was repeated 10,000 times to generate a mean number of phamilies observed given a number of phage genomes selected. The means of the accumulated numbers of phams and the numbers of new phages identified are plotted as the function of the number of genomes selected at random. The observed numbers were fit to a hyperbolic function for 50% of the sample (i.e., 1 to 314, 116 or 54 genomes for all, Cluster A or Cluster B phages, respectively); Hanes-Woolf regression was used to estimate Pham Max and K m of the hyperbola: where N Genomes is the number of genomes sampled, N Phams is the number of total phams seen within those genomes, Pham Max is the total number of phams among all mycobacteriophage genomes, and K m is the number of genomes required to sample one half of Pham Max . The lack of fit of the observed data to the hyperbola-with the observed data reflecting infinite size-suggests that the overall population is dynamic. The lack of hyperbolic fit of the data does not result from outliers such as phages with highly deviant GC%, because removing these does not improve the fit. The fit is also not substantially improved by analysis of the two largest clusters, Cluster A and Cluster B (Figure 7), suggesting that the dynamic nature of the gene pool is not an artifact of examining independent phage clusters with separate gene pools.
To model this behavior, we modified Equation 1 to include the introduction of novel phams via recombination with outside, non-mycobacteriophage genomes: where C Phage is the number of outside phams seen in each phage. The value of C Phage was estimated from Figure 7B and new values for Pham Max and K Pham were estimated by Hanes-Woolf regression following data normalization.