Genomic analyses of pneumococci reveal a wide diversity of bacteriocins – including pneumocyclicin, a novel circular bacteriocin

One of the most important global pathogens infecting all age groups is Streptococcus pneumoniae (the ‘pneumococcus’). Pneumococci reside in the paediatric nasopharynx, where they compete for space and resources, and one competition strategy is to produce a bacteriocin (antimicrobial peptide or protein) to attack other bacteria and an immunity protein to protect against self-destruction. We analysed a collection of 336 diverse pneumococcal genomes dating from 1916 onwards, identified bacteriocin cassettes, detailed their genetic composition and sequence diversity, and evaluated the data in the context of the pneumococcal population structure. We found that all genomes maintained a blp bacteriocin cassette and we identified several novel blp cassettes and genes. The composition of the ‘bacteriocin/immunity region’ of the blp cassette was highly variable: one cassette possessed six bacteriocin genes and eight putative immunity genes, whereas another cassette had only one of each. Both widely-distributed and highly clonal blp cassettes were identified. Most surprisingly, one-third of pneumococcal genomes also possessed a cassette encoding a novel circular bacteriocin that we called pneumocyclicin, which shared a similar genetic organisation to well-characterised circular bacteriocin cassettes in other bacterial species. Pneumocyclicin cassettes were mainly of one genetic cluster and largely found among seven major pneumococcal clonal complexes. These detailed genomic analyses revealed a novel pneumocyclicin cassette and a wide variety of blp bacteriocin cassettes, suggesting that competition in the nasopharynx is a complex biological phenomenon.


Background
The pneumococcus is among the most important pathogens worldwide: in 2000,~14.5 million estimated cases of life-threatening pneumococcal diseases like pneumonia, bacteraemia and meningitis occurred and~826,000 children died [1]. Pneumococcal disease can be treated with antibiotics, but antibiotic-resistant pneumococci are found worldwide, e.g. 60 % of pneumococci recovered in Asia are multidrug-resistant [2]. Pneumococcal conjugate vaccines (PCVs) are administered to children in many developed countries and some resource-poor countries, which has significantly reduced morbidity and mortality [3]; however, current PCVs only protect against 10 or 13 pneumococcal types, depending on the vaccine. Pneumococcal types are defined by an antigenic polysaccharide capsule or 'serotype': over 90 different serotypes have been characterised and new ones continue to be discovered [4,5]. Consequently, after PCV is introduced, vaccineserotype disease decreases and nonvaccine-serotype disease often increases [6]. The prevalence of commensal pneumococci in the paediatric nasopharynx, its ecological niche, generally remains the same post-PCV, but reorders in favour of nonvaccine serotypes [7]. Vaccine escape is also possible and new variants can spread rapidly [8][9][10].
Nasopharyngeal colonisation of one or more pneumococcal serotypes is ubiquitous among children and usually asymptomatic [11]. The composition of colonising pneumococci fluctuates over time, indicating the importance of intraspecies competition in pneumococcal ecology [12,13]. Understanding the dynamics of competition is important in the context of understanding how perturbations such as vaccine introduction affect the pneumococcal population structure and result in changes in the pneumococci competing for space and nutrients in the nasopharynx.
Bacteriocins are small, ribosomally-synthesised antimicrobial peptides or proteins produced by bacteria to inhibit other bacteria, and the producer strain has a dedicated immunity system that protects it from its own bacteriocin. Bacteriocins are a diverse group of compounds in terms of size, mode of action and immunity mechanisms, and are produced by both Gram-positive and Gram-negative bacteria. Given that the producer and target strains are often of the same bacterial species, the bacteriocins are largely believed to be involved in competition for limited resources within an ecological niche, but they may also be contributing to the maintenance of microbial diversity at a population level [14][15][16].
Pneumococci have been shown to mediate intraspecies competition by the production of bacteriocins encoded by the highly variable blp (also known as spi/pnc) cassette [17][18][19][20][21][22]. This cassette typically contains several bacteriocin-like peptide genes, clustered together with genes for putative membrane proteins that may have functions in immunity, in a region labelled the 'bacteriocin/immunity region' (BIR). The predicted pneumococcal bacteriocin peptides show homology to type II bacteriocin precursor peptides, and consist of a conserved N-terminal leader sequence ending in a double glycine motif and an alanine/glycine-rich mature peptide [17]. Previous studies have demonstrated the existence of genes for at least 10 such peptides within the blp cassette [17,[20][21][22].
The blp locus is regulated by quorum sensing, in a manner reminiscent of the regulation of competence by ComCDE [17,18]. This is typically effected by a threecomponent system consisting of sensor histidine kinase BlpH (SpiH), response regulator BlpR (SpiR2), and peptide pheromone BlpC (SpiP). At high extracellular concentrations of BlpC the pheromone binds to BlpH, in turn activating BlpR [23]. BlpR then binds to conserved motifs in promoters, inducing expression of all genes in the locus, resulting in the production of more BlpC and bacteriocins. Similar to the bacteriocin-like peptides, BlpC has an N-terminal leader sequence [17,18]; upon recognition by the dedicated ABC transporter BlpAB (SpiABCD), the signal peptide is cleaved off, and the mature peptide is transported out of the cell [24]. Sequence analyses and induction experiments showed that different BlpC sequences correspond to separate pherotypes, with no crossinduction of the blp genes [17,18].
The aims of our study were to: i) provide a detailed characterisation and comparison of the blp bacteriocin cassettes from a large and diverse set of historical and modern pneumococcal genomes; ii) investigate cassette diversity in the context of the pneumococcal population structure; and iii) investigate the genetic stability of blp cassettes over time. We classified the blp bacteriocin cassettes based on gene content and nucleotide sequence, and revealed several novel cassette types. Surprisingly, we also discovered evidence for a circular bacteriocin cassette, which has not previously been described among pneumococci.

Results
Presence, nomenclature and classification of blp cassettes All 336 of the pneumococcal genomes investigated (Additional file 1) had a blp bacteriocin cassette. In the literature there are two sets of gene names associated with these cassettes in pneumococci and we have identified additional novel genes in this study, adding to the confusion around nomenclature. Therefore, the gene names used in this study are presented in Table 1. Unresolvable ambiguities were encountered in seven cassette assemblies and these were excluded from further analyses. Sequences for the 329 remaining blp cassettes were separated by gene content into 33 Categories and by sequence similarity into 79 Groups. 79 Group prototype cassettes were analysed further.

Description of the blp cassettes at the level of gene organisation
The length of the blp bacteriocin gene cassettes ranged from 9.1 to 17.5 kb (Fig. 1). All Categories had a number of genes in common at the start and end of the cassette, including the regulatory, ABC transporter and CAAX protease genes (possibly related to bacteriocin selfimmunity [25]), and a membrane protein gene putatively associated with immunity (Tables 1 and 2). One exception to this was that blpY and blpZ were missing in Category 4, having been replaced by the remnants of an insertion sequence (IS) element.
The demarcation of the ABC transporter genes (blpA) and transport accessory protein gene (blpB) was highly variable even within Categories, and the frequent division of blpA into multiple ORFs was consistent with findings from previous studies [21,22]. Son et al. analysed the sequences of the transporter genes and linked the frequent presence of frameshift-causing repeats and deletions in blpA to 'cheater' (immunity only, non-inhibitory) phenotypes.
Located between the common genes, the composition of the BIR was highly variable among Categories ( Table 2). The number of BIR genes ranged from 1-15 and included structural bacteriocin genes, putative immunity genes, a CAAX protease and genes with products of unknown function.
Overall, most cassettes had one copy of each gene, although Categories 22 and 28 had two copies of pncG and Categories 22, 32 and 33 had two copies of blpL. In every case, the paralogues were separated by multiple genes and their sequences differed considerably. Furthermore, both blpK and a hybrid gene blpKN were present in Category 15. We found three additional hybrid blp bacteriocin genes, resulting from the contraction of blpM + blpN in Categories 21 and 30, and blpQ + blpM in Category 26. A deletion in Category 20 fused blpJ + blpK, but left the coding sequences in separate frames, thus resulting in a different ending for blpJ and the loss of function of blpK.
Interestingly, the BIR of Categories 30-33 contained genes predicted to encode two novel class II bacteriocin precursors (blpD and blpE), a novel putative immunity protein (blpF), a CAAX protease (blpG) and a hypothetical protein (blpV; Tables 1 and 2). blpV appeared to be a partial bacteriocin precursor gene: its amino acid sequence contained two GxxxG motifs, and a typical class II bacteriocin leader sequence was located directly upstream but in a different frame. Another novel bacteriocin precursor gene (blpW) was found in the BIR of Category 29. A search of the NCBI non-redundant protein database yielded no homologous sequences in other pneumococcal genomes, but orthologues were found in Streptococcus pseudopneumoniae [RefSeq:WP_001093252; 92 % identity] and Streptococcus mitis (GenBank:KEQ38555.1; 91 % identity).
Cassettes belonging to Categories 9, 10, 28, 29, 32 and 33 contained one or two ORFs that, when considered together as a single gene (tdpA), encoded a protein with a thioredoxin domain and a Gram-positive anchor domain. The sequence of the complete gene from Category 29 was found to be homologous to genes found in S. pseudopneumoniae and Streptococcus mitis (99 % and 94 % amino acid sequence identity, respectively). The products of these orthologues were labelled as putative bacteriocin transport accessory proteins [GenBank:EID24163; Gen-Bank:EID31510] or transposases [RefSeq:WP_000744512; RefSeq:WP_000795769].
Sequence diversity of predicted products of blp genes (excluding ABC transporter genes and tdpA) The blp bacteriocin cassettes each had four regulatory genes: blpS, blpR, blpH and blpC. Based on amino acid sequences of the deduced products, 13-26 allelic variants were revealed for each protein, although each protein had 2-3 predominant alleles (Additional file 2). It was previously shown that allelic diversity of the peptide pheromone BlpC corresponds to distinct pherotypes; Genes with products marked with boldface font are found in all cassettes; the remaining genes are found in the bacteriocin/immunity region (BIR). c The indicated function is supported by experimental evidence.
Letters represent blp genes; those marked with an asterisk represent genes with only a pnc name. c Letters in boldface font represent bacteriocin precursor genes; those underlined represent membrane protein genes (with a putative function in immunity). based upon the BlpC peptide sequences there was evidence for four distinctly different pherogroups A-D (Table 2), which corresponded to the 6A, R6, P164 and TIGR4 pherotypes plus minor variants, respectively, as detailed in Additional file 3 [17,18,22]. The diversity of BlpC was concentrated in the C-terminal half (the mature peptide), after cleavage by the ABC transporter. The BlpC pherogroups were associated with specific allelic versions of histidine kinase BlpH and response regulator BlpR. There was also a heterogeneous fifth group with mismatched BlpC/BlpH combinations and proteins encoded by sequence regions derived from multiple pherogroups. Sixteen different putative blp bacteriocin precursor genes were identified among the cassettes: ten were described previously [17,18,21]; three were novel (blpD, blpE, blpW); and three were newly-identified hybrid genes (blpMN1, blpMN2, blpQM; Table 1). blpQ, pncT and pncW were classified as putative bacteriocin precursor genes due to the encoded peptides containing a typical leader sequence, although they lacked other salient features of bacteriocins, i.e. Ala/Gly-rich sequence and GxxxG motifs [17]. The gene pncW was previously described as a fusion gene with a mutated cleavage motif [21]; however, in our dataset there were multiple prototypes in which PncW had a presumably functional double-glycine motif and thus we included it among the putative bacteriocin precursors.
The number of blp bacteriocin precursor genes ranged from 1-6 per cassette (Table 2). Most had 1-2 predominant amino acid alleles, and some also had minor allelic variants (Additional file 2). The peptides BlpM and BlpN were previously shown to contribute to intraspecies competition in a murine model of colonisation, and a difference of five amino acids in bacteriocin BlpMN was sufficient to change the inhibitory properties of a strain in overlay assays. Further in vitro mutagenesis work indicated that the product of blpP, located directly downstream of blpMN, mediates immunity against this bacteriocin [20].
The number of genes encoding putative immunity proteins ranged from 1-7 per cassette ( Table 2). Six such genes were found within the BIR, and common gene blpZ was located near the end of the bacteriocin cassette. Based on the amino acid sequences, between 2-20 alleles per protein were identified, although each putative immunity protein had 1-3 predominant alleles (Additional file 2).
Three genes in the bacteriocin cassettes coded for putative CAAX amino terminal proteases: blpY and pncP, located at the end of every cassette, and a novel gene (blpG) within the BIR of prototypes 30 -33. Several major and minor alleles for BlpY and PncP, but only one BlpG allele, were identified. Roles in immunity against bacteriocins have been suggested for CAAX proteases: they may contribute to immunity by processing or degrading proteins, but their targets are unknown [17,21,25,26]. In vitro site-directed mutagenesis indicated that CAAX protease BlpY is essential for immunity and bacteriocin activity [21].
Finally, two genes encoding hypothetical proteins were identified in the blp bacteriocin cassettes: blpT is the first gene present in all cassettes and most prototypes possessed one of two major amino acid alleles at this locus; and blpV was newly identified in this study in four prototypes, all of which had identical BlpV sequences (Additional file 2).

Molecular epidemiology of the blp bacteriocin cassettes
Category 14 was the most prevalent and widelydistributed cassette: it was present in 94 pneumococci of 13 serotypes, recovered from 1939-2007 in 18 countries around the world (Table 3). These pneumococci were members of 15 different clonal complexes (CCs), seven of which are major pneumococcal CCs circulating globally ( Category 19 was genetically the most diverse type of blp cassette: these cassettes were found in pneumococci of 11 serotypes isolated from 1952-2006 in eight countries, and the pneumococci were members of 11 CCs. 13 different Groups could be identified based on variation among~1700 nucleotides across the 14.5 Kb cassette, and these Groups could be clustered into three major phylogenetic clusters based on the nucleotide sequences (Additional file 4).
In contrast, Categories 30, 6 and 11 were found in pneumococci isolated from several countries, but from a single CC associated with 1-2 serotypes, as shown in Tables 3 and 4. All eight examples of cassette Category 6 had nearly identical nucleotide sequences across the 11.2 Kb blp cassette, and they were found among CC191 7F/A pneumococci recovered over the last seven decades. Category 30 and 11 blp cassettes were also nearly identical among modern pneumococci, although genomes of older isolates of either CC236/271/320 19F/A or CC180 3 were not available to assess whether stability among Categories 30 and 11 persists over longer time periods, and the same blp cassette sequences were not found among any of the historical pneumococci of other CCs. Table 4 also shows that CCs generally possessed a blp cassette of only one, or one predominant, Category. This was not true for serotype and blp Category: among the most prevalent serotypes each was associated with several blp Categories (Additional file 5).
Category 6 cassettes were genetically stable, but there were also Groups within Categories that appeared to be similarly stable over several decades and found in widely-circulating CCs. Three examples were: i) Group 14a, identified in 61 genomes from 1952-2007, all but two of which were in CC81 23F (n = 39), CC199 19A (n = 12) and CC66 14/19F/9V (n = 8); ii) Group 23a, found in 21 genomes dating from 1967-2006, all but two of which were CC15 14 ; and iii) Group 24d, identified in 9 genomes from 1939-1999, all of which were CC113 18C . Further details are provided in Additional files 6 and 7.
Moreover, pneumococci are known to be recombinogenic and able to exchange large DNA fragments between unrelated pneumococcal lineages [8,10,28,29], therefore we were interested in whether we could also identify evidence for regions of blp cassette sequence that were shared between unrelated CCs. The assembled prototype blp cassette sequences were thus aligned and inspected, and regions of identical or nearly identical sequence were indeed found between sequences of different prototypes. Several examples are shown in Fig. 2, which depicts shared blp cassette sequences between some of the major pneumococcal CCs, the pattern of which was consistent with evidence for recombination. Other examples of putative blp cassette recombination can be found in Additional file 8.

Discovery of a novel pneumococcal circular bacteriocin (pneumocyclicin) cassette
During the investigation of the blp cassette, we identified a cluster of six genes located upstream of comAB in several strains: sequence analyses and BLAST searches suggested that this gene cluster, with a length of~4.4 Kb, encodes the biosynthetic locus of a novel circular bacteriocin, which we provisionally named pneumocyclicin. This is the first report of circular bacteriocin cassettes among pneumococci, although they have been described in other Gram-positive species, including other Streptococcus spp, as explained below.
The pneumocyclicin genes were designated pcyA-E (pneumocyclicin A-E) and pfgR (pcyA-flanking gene response regulator). Predicted functions and physicochemical properties of their putative products are presented in Table 5. The gene organisation of the pneumocyclicin cassette was similar to that of previously identified circular bacteriocin cassettes [30][31][32], particularly that of uberolysin from Streptococcus uberis [33] and circularin A from Clostridium beijerinckii [34], as shown in Fig. 3a.
The first gene in the pneumocyclicin cassette, pfgR, encodes a transcriptional regulator with an N-terminal helix-turn-helix XRE-family like domain. pcyA encodes the pneumocyclicin precursor, a 98-amino acid polypeptide from which pneumocyclicin could be derived by removal of an N-terminal leader sequence and circularisation of the remaining peptide [30][31][32]. Sequence analyses, a high isoelectric point (pI) and positively-charged residues indicated that pneumocyclicin belongs to the class IIc(i) circular bacteriocins, which have limited sequence similarity but share a common protein architecture consisting of four or five α-helices that form a saposin fold (Fig. 3b) [35]. The structural bacteriocin gene was followed by three genes encoding two putative membrane proteins (pcyB and pcyC) and one soluble ABC transporter ATP-binding protein (pcyD; Fig. 3a). The specific functions of PcyB and PcyC equivalents in other known cassettes are still unclear, but evidence points to roles in bacteriocin maturation and/or transport and immunity. PcyC, like CclC, As-48C, CirC and UblC, is a member of a protein family containing the domain of unknown function 95 (DUF95) [36]. The genes as-48B and cirBCD have been shown to be essential for production of AS-48 and circularin A, respectively, and as-48C and cirBD are required for full immunity to these bacteriocins [34,37]. Finally, pcyE encodes another putative membrane protein. In the AS-48 and circularin A cassettes, the equivalents as-48D 1 and cirE encode small hydrophobic peptides confirmed to be involved in immunity, but not sufficient for full resistance to these bacteriocins [34,37]. However, the predicted PcyE protein was larger and contained more putative membrane spanning helices than either As-48D or CirE.
The cassettes for AS-48, carnocyclin A and circularin A each contain genes for a multicomponent ABC transporter downstream of the genes minimally required for bacteriocin production [34,36,38]. Although not essential, these genes were shown to enhance both production of and immunity to AS-48 [38]. Interestingly, the comAB operon located downstream of the pcy genes also encodes an ABC transporter, which is known to process and transport the competence-stimulating peptide CSP (Fig. 3a) [39].

Prevalence and sequence diversity of the pneumocyclicin cassette
The pneumocyclicin cassette was present in 115 (34 %) of the 336 pneumococcal genomes in our dataset. Forty distinct nucleotide alleles described the 115 pneumocyclicin cassettes and they formed two sequence clusters, as shown in Fig. 3c. The overall nucleotide diversity was low, with a mean p-distance of 0.002 among all alleles  and 0.001 within each sequence cluster. Sequence differences within and between clusters were concentrated within pfgR and the non-coding sequence between pfgR and pcyA. Four adjacent ATTT repeats were identified within pcyB (encoding one of the membrane proteins); several alleles from both clusters had ±1 copy of this repeat, which resulted in a frameshifted coding sequence. This is reminiscent of the four nucleotide repeat seen within blpA of the blp bacteriocin cassette and associated with the 'cheater phenotype' by Son et al. [22].

Molecular epidemiology of the pneumocyclicin cassettes
Pneumocyclicin cassettes of both phylogenetic clusters were found in genomes dating from 1939 onwards. Cluster A alleles were by far the more prevalent of the two alleles, detected in 106 pneumococci recovered from 1939-2008 in 12 different countries. The pneumococci with cluster A alleles were members of 21 different CCs and were of 20 different serotypes. The nine cluster B alleles were found in pneumococci of seven different CCs and serotypes, and were recovered from 1939-2007. Among the 115 genomes with a pneumocyclicin cassette, 80 % (n = 92) were members of one of seven CCs, six of which are major CCs circulating globally and CC1094 is a major South African CC (Table 6; [27]). Apart from one exception in a CC124 genome, all major CCs possessed cluster A pneumocyclicin alleles. Seven of these alleles were the most prevalent and together they represented 71 % (n = 75) of all 106 cluster A alleles. The sequences of these seven alleles were very similar, Fig. 3 The pneumocyclicin cassette putatively encoding a newly-identified circular bacteriocin within the pneumococcal population. a. Comparison of the gene organisation of the pneumocyclicin cassette with those of class IIc(i) circular bacteriocins AS-48, carnocyclin A, circularin A and uberolysin. The genes comA and comB, adjacent to but not considered to be part of the pneumocyclicin cassette, were included to demonstrate that the synteny with other circular bacteriocin cassettes extends beyond the minimal cassettes and includes the downstream ABC transporter genes. b. Comparison of the pneumocyclicin precursor peptide sequence and predicted secondary structure with that of other class IIc(i) circular bacteriocins. Figure adapted from Martin-Visscher et al. [35]. c. Neighbour-joining tree demonstrating two distinct clusters of alleles representing the 115 pneumocyclicin cassettes in the study dataset differing at only 12 nucleotides in total across the~4.4 Kb pneumocyclicin cassette.

Discussion
Bacteriocins have generated renewed interest because of the major problems associated with antibiotic-resistant bacteria and the possible role of bacteriocins as alternatives to conventional antibiotics. There are many publications that describe the array of bacteriocins produced by many Gram-positive and Gram-negative bacteria. Comparatively less is known about pneumococcal bacteriocins, although several studies have delineated the genetic structure, function of some genes, and diversity of pneumococcal blp bacteriocin cassettes on a small number of pneumococcal strains, at a time when large scale sequencing was a challenge and limited genome data were available [17,[20][21][22]. In this study we found that blp bacteriocin cassettes were ubiquitous among a diverse set of pneumococcal genomes that dated back to 1916, in a variety of permutations with respect to both the genetic background of the host pneumococcus and the genetic composition of the blp bacteriocin cassette. Several novel genes and blp bacteriocin cassette types were also revealed.
Most surprisingly, we discovered that in addition to a blp bacteriocin cassette, a third of the pneumococcal genomes also possessed a cassette encoding a putative circular bacteriocin. To date, circular bacteriocins have predominantly been identified among the phylum Firmicutes (which include pneumococci) and are believed to be involved in niche competition. Circular bacteriocins in other Gram-positive bacterial species have been shown to permeabilise the bacterial cell membrane and cause cell death. They are ribosomally-synthesised and post-translationally modified to form a circular structure: as compared to a linear structure, the circular form is more stable, less susceptible to protease degradation and therefore more active. Circular bacteriocins potentially have a role in drug design, delivery and therapeutics, although many questions related to their structure, function and mechanisms of action remain to be determined [14,31,32].
The pneumocyclicin cassettes we discovered were similar in genetic structure and predicted proteins to circular bacteriocin cassettes characterised in other Gram-positive species. Nucleotide sequence similarity among the pneumocyclicin cassettes was high and the majority of cassettes were found in just seven pneumococcal CCs. It is curious that this cassette is just upstream of comABgenes that have been shown to be essential to the development of competence, which is a specific point in the growth cycle during which a pneumococcus can take up and incorporate exogenous DNA into its genome [39,40]. Recombination and transformation play a major role in the evolution of the pneumococcus; therefore, it will be crucial to understand not just the structure and function of Years of isolation : 1978-1988 1952-2005 1999-2005 1952-2007 1939-1999 1962-2008 1945-1996 1916-2007 1916-2008 Cluster Each of the 'Other' alleles was detected 1 or 2 times pneumocyclicin but the potential impact that expression of pneumocyclicin genes may or may not have on competence induction. Among blp bacteriocin cassettes, there was a wide repertoire of putative bacteriocin and immunity genes in the BIR region. Many individual cassettes possessed multiple putative bacteriocin genes and immunity genes: for example, Category 19 possessed six bacteriocin genes, six immunity genes and two CAAX protease genes, and genomes in six of the Category 19 Groups also possessed a cluster A pneumocyclicin cassette. Further work will be required to confirm the function of the genes in the BIR region and understand the biology that underpins the diversity of bacteriocin and immunity proteins. Does the possession of multiple bacteriocin and immunity genes simply mean the pneumococcus has an increased repertoire of bacteriocin arsenal and broadened immunity? A recent paper provided theoretical support for high bacteriocin diversity within a bacterial population and demonstrated that the maintenance of multiple bacteriocin and immunity types relied upon two key factors: circulating strains that were immune to the toxic effect of bacteriocins; and individual strains/lineages that were able to produce multiple bacteriocins and/or immunity proteins [41]. Is there a fitness cost associated with possession of multiple bacteriocin and immunity genes? Do some of these genes have alternative functions? Interestingly, a recent study demonstrated upregulation of blp bacteriocin genes in a pneumococcal infection animal model, possibly suggesting a role for some blp genes in pathogenesis and/or virulence [42].
We also evaluated the blp cassette diversity in the context of the pneumococcal population structure and found that some blp bacteriocins were found in many CCs whilst others seem to be restricted to one predominant CC. There was no obvious association between blp cassette type and serotype. Interestingly, some blp bacteriocin cassettes were genetically stable over many decades; although not surprisingly, patterns of putative large-fragment recombination similar to that previously reported in other recent pneumococcal studies were also identified [8,10,43].
Pneumococcal competition appears to be more complicated than just producing a bacteriocin peptide to kill competitors and an immunity protein to protect itself. Is the observed diversity of pneumococcal bacteriocins and/or the possession of multiple bacteriocin and immunity genes a reflection of a wider target specificity designed for nasopharyngeal competition? The paediatric nasopharynx is colonised by a variety of different microorganisms so it may be that the targets of the blp bacteriocins and pneumocyclicin are not solely pneumococci, but also viridans streptococci, Haemophilus, Moraxella, and others [13]. In previous work, Lux and colleagues investigated the in vitro inhibitory activity of pneumococci that produced bacteriocins using Micrococcus luteus and Lactococcus lactis as indicator strains, and they tested three blp bacteriocin-producing pneumococci against other oral streptococci and observed some inhibition [21]. Therefore, it is possible that the bacteriocins described here are predominantly mediating pneumococcal population-level interactions, but that they are also central to the interactions between pneumococci and other bacterial species residing in the nasopharynx.
Moreover, there is evidence that bacteria can engage in cooperative efforts within an ecological niche, often by means of quorum sensing whereby an individual detects and responds to an extracellular signal. However, this can result in 'social cheaters'individuals who benefit from the cooperative efforts of the population without the fitness cost of exhibiting the specific traits themselves [15]. Son and colleagues demonstrated potential cheating behaviour among pneumococcal strains that produced immunity proteins but not the signalling pheromone or bacteriocins (due to frameshifted sequences), thereby avoiding costly bacteriocin production [22]. Intriguingly, we noticed that some alleles of pcyB in the pneumocyclicin cassette were frameshifted in a similar manner. Future studies will need to be designed to determine the intra-and interspecies activity of pneumococcal bacteriocins and the potential for a cheater phenotype among pneumococci with pneumocyclicin cassettes.

Conclusions
Vast quantities of bacterial genome data have been generated in recent years and we used bioinformatics and computational biology tools to decipher the sequencebased evidence for bacteriocins, after which experimental studies can be designed to investigate specific hypotheses using carefully-selected candidate strains. One can look both forward and backward using these data: existing experimental evidence guides the interpretation of the genomic data, the design of new studies and the selection of test strains, but existing experimental data can also be reinterpreted based on genome sequence data. The sequence data are only predictive of function, but access to such comprehensive data is an efficient and cost-effective starting point to the challenging experimental work and should increase the likelihood of successful laboratory experiments.

Genome collection
The study dataset consisted of whole genome sequence data for 336 pneumococcal isolates recovered from 1916 to 2008 in 32 different countries (Additional file 1: Table S1): 206 published genomes [10,[43][44][45][46] and 130 GenBank genomes [47]. Ethical approval was not required to use any of the isolates in this study. Where not previously published or available online, serotype/ group, multilocus sequence type and clonal complex were assigned as previously described [43].

Classification of blp cassettes and prototype selection
Genomes were initially divided into groups based on approximate allelic profiles, as determined by a combination of two BIGSdb Genome Comparator [48] analyses, using as references the genes blpS through to pncP from the TIGR4 genome, and blpQ, pncT and pncW from the 2306 genome [21]. After sequence assembly, cassettes were divided into 'Categories' on the basis of gene presence and synteny, and into 'Groups' on the basis of sequence similarity, ignoring the presence of IS elements. When Categories were assigned, predictions of functionality were not taken into account.
Based on sequence alignments, separate Groups were created for sequences that differed by >15 nucleotide substitutions. Where the presence of an indel changed gene organisation separate Categories were assigned. A Group prototype was chosen based upon two criteria: i) fully sequenced, assembled and gap-free cassette (although gaps within IS elements were allowed); and ii) the cassette sequence from the oldest pneumococcus.

Identification and assembly of blp cassette sequences
The blp cassette was defined as the genes between and including blpT and pncP. Completeness of the assemblies was checked by comparing automated gene predictions to expected gene presence based on the allelic profile, and by interrogating each genome for all known blp genes and those newly identified in this study. We also performed a second round of assemblies, this time using high throughput sequencing reads and their quality scores as coded in fastq files. Selected cassettes were re-assembled by mapping of the Illumina reads to the original Velvet contigs with SMALT, followed by inspection, correction, and manual joining of relevant contigs in Gap5 [49,50]. IS elements within blp cassettes were not assembled but identified by their end sequences and left as gaps in the sequence database. After extraction of the cassette sequences into fasta files, the 5' and 3' ends of each IS element were trimmed to 25 bp and joined by Ns, so as to match the length listed for the specific IS element by ISfinder [51].

Gene prediction and annotation
Genes were predicted with Prokka [52], Artemis was used for sequence visualisation and manual editing of annotations [53], and RATT was used to transfer annotations between genomes with similar cassettes [54]. The coding sequence (CDS) features predicted by Prokka were modified in some cases: i) blpO and pncM, inconsistently predicted, were manually annotated; ii) blpP, not predicted in any genomes, was manually annotated where its sequence was present because of prior experimental evidence for a role in immunity [20]; iii) bacteriocin CDSs with multiple putative start codons close together were adapted to start at M(D/N)T; iv) CDSs related to IS elements were replaced by mobile element features; and v) frameshifted genes were labelled, except within the transport region where CDSs were ambiguously organised and thus left as originally predicted by Prokka. Functional annotation was based on previously published literature, BLAST searches of non-redundant databases, and interrogation of the BAGEL3 [55] and BACTIBASE databases [56].Sequence and annotation files for blp cassette prototypes are found in Additional file 9.

Identification and annotation of pneumocyclicin cassettes
The pneumocyclicin (pcy) cassette was defined as the genes between and including pfgR and pcyE and identified as described above. The cassettes were separated into nucleotide alleles using the NRDB tool [57] and assigned to phylogenetic clusters with MEGA5 [58]. Gene prediction and annotation with Prokka was complemented with manual BLAST searches and analysis through BAGEL3 to confirm its nature as a bacteriocin cassette.Sequence and annotation files for all pcy alleles are found in Additional file 10.

Software used for sequence analyses and visualisation
BIGSdb [48] was used to store and query assembled blp cassette prototype sequences and associated metadata. BLAST searches were performed in BIGSdb or using BioEdit [59]. Sequence alignments and phylogenetic analyses were performed with MEGA5 and progressiveMauve [60]. Bacteriocin cassettes were visualised with DNAPlotter [61] and Inkscape [62].