A cross-species genome-wide analysis of sequences similar to those involved in DNA uptake bias in the Pastuerellaceae and Neisseriaceae families of pathogenic bacteria

Acquiring new DNA allows the emergence of drug resistance in bacteria. Some Pasteurellaceae and Neisseriaceae species preferentially take up specific sequence tags. The study of such sequences is therefore relevant. They are over-represented in the genomes of the corresponding species. I found similar sequences to be present only in, but not in all, the genomes of the Pasteurellaceae and Neisseriaceae families. The genomic densities of these sequences are different both between species and between families. Interestingly, the family whose genomes harbor more of such sequences also shows more sequence types. A phylogenetic analysis allowed inferring the possible ancestral Neisseriacean sequence and a nucleotide-by-nucleotide analysis allowed inferring the potential ancestral Pasteurellacean sequence based on its genomic footprint. The method used for this work could be applied to other sequences, including transcription factor binding and repeated DNAs.


Introduction
DNA acquisition by bacteria has not only significant clinical, epidemiological and economical implications, it is also of interest to different research areas including research on bacterial acquisition of drug resistance and virulence, recombination, genome dynamics and evolution, speciation, etc.In fact, when it was discovered by Griffith in 1928 [1], bacterial ability to acquire DNA provided the first proof of the physical nature of the hereditary material.Bacteria can acquire new DNA via three mechanisms.Conjugation and transduction result in DNA entry into new bacterial cells thanks to the transfer of plasmids between cells and to the infection by bacteriophages, respectively.The third mechanism, competence for transformation, relays on uptake of extracellular DNA by the bacterial cell itself.Transforming competent bacteria in laboratories is usually artificial and requires the use of chemical (e.g., calcium chloride) or physical treatments (heat choc, electric pulses)-the treated bacteria are thus artificially made competent for transformation.However, many bacterial species are naturally competent for spontaneous transformation after uptake of extracellular DNA (for a review see [2]).Some of these bacteria seem constitutively competent, whereas others are competent only when certain conditions (generally stressful) are met.
Among the naturally competent bacteria for transformation, species from the Pasteurellaceae and Neisseriaceae families are special since they exhibit bias in DNA uptake.They preferentially take up DNA that contains specific short sequences (tags) called Uptake Signal Sequences for the Pasteurellaceae and DNA Uptake Sequences for the Neisseriaceae.Following the logic explained in [3] we will refer to both of these sequences using the unifying name of DNA Uptake Enhancing Sequences (DUESs).The protein that possibly binds the extracellular DNA has been pinpointed relatively recently [4,5] but the mechanistic reason for the bias towards the DUESs is still matter of ongoing research.Still, we know that the preferential uptake of DNA fragments that contain the preferred DUES leads to accumulation of these sequences in the genome which, in turn, results in preferential uptake of those DNAs from conspecific cells (see [3,6]).
The Pasteurellacean and Neisseriacean DUESs so play a crucial role in the horizontal movement of DNAs between competent cells, strains and species of these two important, pathogen-containing, bacterial families.DUESs therefore have a fundamental impact on all the practical and scientific aspects relating to the acquisition of new genetic materials by these bacteria (among which acquisition of resistance to antibiotics and virulence are to highlight).Similarity in DUESs would result in an ease of uptake of the DNA across competent cells, strains and species whereas differences between DUESs would form partial or even full barriers that would hamper cross-species DNA uptake and horizontal transfer of DNAs between species.Determining the nature of the DUESs could therefore be of importance to the prediction and planning of strategies to deal with the exchange of genetic material between species of these bacterial families.Surprisingly, the real nature of these sequences is not fully studied and discrepancies exist between works.The Neisseria DUES was suggested to be the 10 bp sequence GCCGTCTGAA [7,8] until 2007, when I suggested that the 12 bp sequence ATGCCGTCTGAA seems more likely to be the DUES of N. gonorrhoeae 3. Further works [9][10][11] confirmed that suggestion and, in 2013, other 12 pb DUES-like sequences were identified in other Neisseriaceae species [12].When it comes to the Pasteurellacean DUES, it was first suggested to be the 11 bp sequence AAGTGCGGTCA [13].The data later seemed to suggest that only a 9 bp sequence is needed for efficient DNA uptake [14] and, when the Haemophilus influenza genome was sequenced-becoming the first free living being genome to be sequenced [15]-the genomic data seemed to confirm that the H. influenza DUES is the 9 pb sequence AAGTGCGGT [16][17][18].A latter work added the 9 bp sequence ACAAGCGGT as another Pasteurellacean DUES type [19] and, a year later, I suggested that the main Pasteurellacean DUES sequence is probably the 10 bp AAAGTGCGGT [3].The panorama is therefore still confusing and semi-circularity and inadequacy of the analyses logic and methods might be hiding the real nature of the sequences that naturally competent Neisseriaceae and/or Pasteurellaceae species actually prefer for taking up.To experimentally determine the nature of the preferred sequence would mean tedious, time consuming, to not say little rewarding, combinations of mutagenesis and DNA uptake testings of sequences of different sizes and compositions.It would also face the problem of the currently still uncultivable species and of species whose requirements for becoming competent are not known yet.The availability of genomic sequences and the affordability of adequate computing capacities, mean that a computer-based approach to find out the real DUES should be more accessible than the 'wet-lab' approach-especially if a multi-species analysis is planned.The key issue though should be to have well designed (see innovative) strategies that are based on what we experimentally know about the core sequence and about the function and consequences of the DUES, while minimizing but not abolishing biased guiding of the search target sequences and methods towards the already known DUESs.
Here, I start by screening for the presence of the DUES-like sequences in complete and unfinished genome sequences of the Pasteurellales and Neisseriales species available in Genebank database.This allowed me to determine the 10 or 12 bp potential DUES in each genome based on the logic that it must be the most over-represented DUES-like sequence in the corresponding genome.In a further step, and for each type of potential 10 or 12 bp DUES, I determine what should be the corresponding real DUES based on the logic that it should be the most over-represented among the sequences that either contain or are contained by the potential DUES previously determined for the respective species.I later establish the phylogenetic relationships between the different DUESs and I infer their ancestral sequences.

Material and methods
Fig. 1 schematizes the main steps of the workflow followed for this work.These were:

Obtaining the genome sequences
The complete and draft genome sequences of the species used in this M. Bakkali work were downloaded from Genebank's ftp site.

Counts of the DUES-like sequences
Each complete and draft genomic sequence was searched using a 10 or 12 bp sliding window approach and the computer script Sequen-ce_Extractor.pl from [3].The number of observed sequences in each genome was compared to the expected number of the same sequence in the same genome in order to detect the most over-represented DUES-like sequence in the studied genome.The expected number of each given sequence in a genome was calculated the formula used in previous works (e.g.Bakkali 2007).That is 2 L(A) a (T) t (C) c (G) g , where, for each nucleotide, the capital letter is the genomic proportion of that nucleotide, the super-indexed letter is the number of that nucleotide in the analyzed (potential DUES) sequence and L is the genome's length in bp.The most over-represented DUES-like sequence in each genome was determined based on the value of the Chi-squared in the comparison between the observed and expected numbers of the sequence in question.

Identification of the real DUES
Identification of the real DUES was based on identifying the most over-represented sequence among those that either contain or are contained by the most over-represented 10 or 12 bp sequence in the respective genome.The value of the Chi-squared in the comparison between the observed and expected numbers was used as quantifier of the over-representation.For that, sliding window screening for all the 3 to 20 bases in each of the studied Pasteurellacean and Neisseriacean DUES-containing genomes was carried out.The expected and observed numbers of each 3 to 20 bases sequence that contains or is contained in the corresponding DUES-like sequence were used in order to estimate the over-representation of each potential DUES at each size level.

Phylogenetics
In order to infer the ancestral DUES sequence I mapped the distribution of the different actual DUESs on the 16S-based species phylogeny and complemented the results with an analysis of the alignment of the different DUESs, in the case of the Neisseriaceae, and an analysis of the over-representation of nucleotides, in the case of the Pasteurellaceae.
For the species phylogenies, the sequences of all the 16S ribosomal DNAs of the different strains and species of the Pasteurellaceae and Neisseriaceae families were downloaded from the NCBI database and used for alignment and maximum likelihood phylogenetic tree building using MAFT [20,21].A similar tree was also built for the DUES-having species.In both cases the 16S ribosomal DNA from E. coli was used as outgroup.MEGA [22] was used for tree drawing and editing.
Given their short size and small number, the alignment of the different DUESs was made and edited using BioEdit [23].The analysis of the nucleotide over-representation for the pasteurellacean DUESs was more laborious.It was carried out in two different ways.In one way Sequences_Extractor.pl script from [3] was used to separately extract from H. influenza and A. pleuromoneae genomes all the 50 bp sequences containing the sequence GCGGT at position 21 (the most conserved nucleotides of the pasteurellacean DUES-see [3]).An over-representation score was calculated for each nucleotide at each position according to the formula EQ1 = nucs/seqs*AT/2, for As and Ts and EQ2 = nucs/ seqs*GC/2 for Gs and Cs, where nucs is the total number of the particular nucleotide in the position in question, seqs is the number of sequences and AT and GC are the A + T and G + C frequencies in the genome, respectively.The second way was to analyze nucleotide overrepresentation in the sets of all the 11 bp sequences that carry a mismatched DUES at any of the positions 1 to 10 (the 11th, extra, DUES position was at the 3 ′ end of the sequence).I hence used the Sequence-s_Extractor.pl script from [3] and separately extracted from H. influenza and A. pleuromoneae genomes all the 11 bp sequences containing a mismatched DUES.For this latter analysis I first calculated the expected number of sequences at each mismatch level using Eq. ( 1) in [3], henceforth EQ3.Then, the number of expected nucleotides at each position was calculated for As and Ts and Gs and Cs both when they were as matches or as mismatches both in AT and GC positions of the DUES.For that, I calculated the number of sequence categories at each mismatch level as EQ4 = (n)!/(m!(n-m)!).I then calculated the expected times of matches per position of the DUES at each mismatch level EQ5 = ((n-1)!/(m!(abs((n-1)-m))!)).The expected number of A or T matches in each A or T position was then calculated as EQ6 = EQ3(EQ5/EQ4)(1 + ((AT-GC)/4)).The expected number of G or C matches in each G or C position was EQ7 = EQ3(EQ5/EQ4)(1-((AT-GC)/4)).The expected number of A or T mismatches in each A or T position of the DUES at each mismatch level was calculated as EQ8 = (EQ3-EQ6)((AT/2)/(1-(AT/2))).The expected number of A or T mismatches in each G or C position was EQ9 = (EQ3-EQ7)((AT/2)/(1-(GC/2))).The expected number of G or C mismatches in each A or T position was EQ10 = (EQ3-EQ6)((GC/2)/(1-(AT/2))).Finally, the expected number of G or C mismatches in each G or C position was EQ11 = (EQ1-EQ7) ((GC/2)/(1-(GC/2))).
An estimate of the over-representation of each nucleotide type at each position of the DUES at each mismatch load was based upon the equation EQ10 = (obs-exp)/exp, where obs and exp are the observed and expected numbers of the nucleotide in question at the particular position and mismatch level.The overall genomic over-representation of each nucleotide at each DUES position therefore is EQ12 = ∑ m=1 m=n EQ10.

Distribution of the DUES-like sequences
For this analysis the k-mer sizes 10 and 12 bases were considered for the Pasteurellacean and Neisseriacean DUESs, respectively, as data from [3] seemed to indicate that these are the likely core sequence lengths of the DUESs in species from those two families.
The results (Table 1) show that while no DUES-like sequence was noticeably over-represented in the examined non-Pasteurellacean and non-Neisseriacean genomes, most but not all Pasteurellacean and Neiserriacean species do have DUESs.
Two 10 bp DUES-like variants could be identified among the Pasteurellacean genomes analyzed here.Their frequencies vary between species and their genomic densities ranged between 0.036% of the genome of Haemophilus pittmaniae HK 85 ctg1129913985426 and 0.338% of the Aggregatibacter segnis ATCC 33393 genome-about a 10 fold difference between the least and the most dense DUES-containing Pasteurellacean genomes (Table 1).There was no ambiguity as to the nature of the most over-represented DUES-like sequence in the Pasteurellacean genomes that contain the DUES type AAAGTGCGGT (henceforth called H. influenza-type DUES, in accordance with [19]).In the Pasteurellacean genomes that contain AACAAGCGGT (henceforth called A. pleuromoneae-type DUES1, in accordance with [19]), that DUES was consistently closely followed (in terms of frequency) by another sequence, ACAAGCGGTC (which I will henceforth call A. pleuromoneae-type DUES2)-the sequence of the DUES of these genomes thus being ambiguous.When compared to the H. influenzae-type DUES, the A. pleuromoneae-type DUES1 has the same number of differences (four) as does the A. pleuromoneae-type DUES2, parsimony hence does not allow discarding any of these sequences as the potential real DUES of the corresponding species.
For the Neisseriaceae species however, and in wide accordance with the data in Frye et 2013) as the DUES-like sequence in that species (AG-simDUS), and which I found in just 857 copies in the same genome.As in the case of the DUES-containing Pasteurellacean genomes, the densities of the DUES-like sequences were very different between Neisseriacean species and ranged between 0.063% of the Lutiella nitroferrum genome and 1.06% of the Kingella oralis ATCC 51147 genome-about a 20 fold difference between the least and most dense DUES-containing Neisseriacean genomes (Table 2).
From Fig. 2 one can see how, in both bacterial families, the overall results were similar between the finished and unfinished genomes used for the current work.Nonetheless, noticeable differences, in terms of genomic DUES densities, can be observed both between the two bacterial families and between the DUES types in each family.The genomes of the DUES-containing Neisseriaceae species seem to have more DUES density than the genomes of the DUES-containing Pasteurellaceae species, so that DUES-containing Neisseriaceae species, as a whole, have over 60 folds more over-representation of the DUES-like sequences than DUES-containing Pasteurellaceae species.The latter show higher densities for the H. influenza-type of DUES than for the A. pleuromoneaetypes.In the case of the DUES-containing Neisseriaceae, the N. gonorrhoeae-type of DUES seems slightly less frequent than the other variants, being Kingella oralis ATCC 51147 the Neisseriacean bacterium whose genome has the most DUESs.The Pasteurellaceae species Haemophilus pittmaniae HK 85 ctg1129913985426 is also notorious for its very low DUES density.

Identification of the real DUES
Because of the preferential uptake of DNA fragments that contain the DUES (see the introduction section and the references therein), it was shown that natural competence for transformation would inevitably enrich the bacterial genome with the preferred DUES sequence [3,6].
The present analysis is based on the logical assumption that the real DUES would be the most over-represented sequence among the sequences contained in or containing the DUES-like sequence in a genome.I thus screened all the Pasteurellacean and Neisseriacean DUEScontaining genomes for all the 3 to 20 bases sequences that contain or are contained in the corresponding DUES-like sequence.This allowed estimating the over-representation degree of each sequence at each of those sequence (window) size levels.
For the DUES-containing Pasteurellaceae species, the overrepresentation numbers kept increasing with the increase of the size of the sliding window until the 10 bases window size.The result was the same both for the H. influenza-and for the A. pleuromonieae-type DUES variants (Fig. 3a and Supplemental Table 1).
In the case of the Neisseriaceae species, however, the picture was more complex; although the patterns were similar to those observed for the analyzed Pasteurellaceae genomes.Here also, the overrepresentation values kept increasing with the increase in the size of the sliding window up to a limit.However, in the Neisseriaceae case, that limit was different between the different genomes (i.e., DUES-like sequences).It ranged between 14 and 19 bases (Fig. 3b, Table 3 and Supplemental Table 2).It is notorious that the seemingly different sizes of the Neisseriacean DUESs affect even the same core DUES-types so that, for instance, the N. gonorrohoeae and N. lactamica DUES seems to be the sequence AAAAATGCCGTCTGAAAC whereas the N. meningitides DUES seems to be the sequence AAAAATGCCGTCTGAAA, although both species show the same 12 pb core sequence identified in [12] and confirmed here in Table 2.
It is relevant to note that the same analysis when applied to non DUESs in DUES-lacking genomes (negative controls) did not show the size-correlated increase in sequence over-representation seen for the sequences containing or contained in the DUESs of the Pasteurellacean and Neisseriacean genomes (Fig. 3c).M. Bakkali

Phylogenetic relationships between DUESs and inference of their ancestral sequences
The 16S maximum likelihood phylogenetic tree of the different Pasteurellales and Neisseriales strains and species for which 16S sequences were available in the database showed a clear separation between the members of both orders and a within-order and within-family arrangements in accordance with the taxonomical identifications of the bacteria included in the analysis (Fig. 4a).When it comes to the DUESs, the tree also shows a notorious concordance between the phylogenetic distribution of the species and the DUES variants that these harbor.Result of that, a notorious clustering of the different DUES variants can be observed both in the Pasteurellaceae and in the Neisseriaceae families.These results are even clearer if the phylogenetic tree is built using the 16S sequences only of the species found here to carry DUESs (Fig. 4b).That tree shows how the species of the Pasteurellacean family can be divided into five different clades; one, the largest, composed mainly of species harboring the H. influenza-like DUES and H. parasuis and H. parainfluenzae, that harbor the A. pleuromonieae-type DUESs, while the four others clades, with fewer species, show only species with the A. pleuromonieae-type DUESs.No DUES dendrogram can be built for the just two Pasteurellacean DUESs and the obvious consensus of both sequences is NAMRWGCGGTN (Fig. 4c).The Neisseriaceae family however shows dispersion of the different DUESs with the exception of the N. gonorrhoea-and N. maccacae-types that appear clustered in their respective clades (Fig. 4b).The Neisseria bacilliformis-type DUES, and despite its prevalence and clustering, appears in clearly separated branches.That DUES-type also appears at one extreme (see the possible origin) of a Neisseriacean DUESs dendrogram (Fig. 4d) and the inferred consensus sequence of the Neisseriacean DUESs seems to be NAAAAAGGCYGYCTGAAAAC (Fig. 4c).
In order to infer the ancestral sequence of the Pasteurellacean species, an analysis of the over-representation of each nucleotide along each of the 11 potential ancestral DUES positions (as inferred from the Pasteurellacean DUES consensus sequence) was carried out for all the current mismatched DUESs in the corresponding H. influenza and A. pleuromoneae genomes.As Table 4 shows, the sequences inferred based on the nucleotide over-representation values in each of the analyzed genomes are highly concordant with the real DUES, especially in the GCGG part.The result indicates that AAAAAGCGGT seems to be, or to contain, the ancestral Pasteurellacean DUES sequence.

Discussion
It is known that, apart from few exceptions (such as Azotobacter vinelandii [24], Campylobacter coli [25] and Pseudomonas stutzeri [26]), specificity of the DNA uptake by naturally competent bacteria depends on the presence in the extracellular DNA of defined sequences, called DNA Uptake Enhancing Sequence (DUES).It is also known that such bias is not strict; so that mismatched DUESs are also taken up-although less efficiently (see [3]).Previous works demonstrated that mutations generate new "perfect" and mismatched DUESs that end-up accumulating in the corresponding genome due to the biased uptake by competent bacteria of the DNAs that contain these sequences [3,6].The mechanism responsible for such accumulation was suggested to resemble a molecular drive that gradually "grows up" the preferred DUES in regions of the genome where these sequences would not disturb as to affect the cell's fitness [3,6].
Over-representation of specific short DNA sequences in the genome, understood as the difference between the expected and the observed numbers in a genome, could therefore be used in order to identify the actual sequence of the preferred DUES.The logic that I applied here is based on the expectation that adding an actual DUES nucleotide to the right position of a partial DUES would result in a more complete and, thus, more over-represented sequence in the genome, while adding a nucleotide that does not form part of the actual DUES to a partial or complete DUES would result in a less "perfect" DUES and, thus, less over-represented sequence.This way one can identify the actual DUES as the DUES-like sequence that looses over-representation in the genome if we take from it or add to it any nucleotide.As to the inferences on competence itself, it is legitimate to take the over-representation of the DUES as indicator both of the likelihood of a bacterium to be competent for uptake of specific DNAs, of the efficiency of the DNA uptake, and of the frequency of the competence and DNA uptake episodes (i.e., of how competent is the bacterium).The logic here is that a bacterium that ceases to be competent would see the DUESs degenerate and disappear in parts or all its genome.Conversely, the more often a bacterium becomes competent, the more DNA it would take up, and the stronger the bias of its DNA uptake towards a DUES is, the more DUESs its genome would accumulate.With such logic in mind, and according to the results of the current work, competence for DUES-biased DNA uptake seems confined to species from the Pasteurellaceae and Neisseriaceae families; since over-representation of short DUES-like DNA sequences was not found in the sequenced genomes of any bacterium belonging to families other than Pasteurellaceae and Neisseriaceae, not even other Pasteurellales or Neisseriales families.
I want to draw the attention to the fact that there was no difference, as to the density of the DUESs; between the finished and unfinished genomic sequences used for this work; meaning that the unfinished sequences considered here are representative samples of the corresponding full genomes-a fact also supported by the inferred proportion of the sequenced genomes (all the analyzed sequences were of over 1.8 Mb, for genome sizes that range between 2 and 6 Mb).Still, I cannot fully discard potential, although not likely, biases in the availability of genomes that might, although not likely, have led to the current results.For instance, I cannot discard that there could be a, or some, non pasteurellacean species that might harbor DUESs but whose genomes were not sequenced.Neither can I discard the possibility that sequencing was more biased towards DUES harboring species in the Neisseriaceae family than in the Pasturellaceae.Similarly, I cannot discard the possibility that bias in the sequenced species could be behind the differences between the Neisseriaceae and Pasturellaceae results.Still, with those cautious observations having been made, there are reasons to think that such biases are not likely to be the cause of the results reported in this work.The fact that the number of species analyzed here is similar between the two bacterial families speaks in favour of the lack of significant effect of any potential bias in the availability of the data on the results obtained here.
Sorting the analyzed species by density of the DUESs in their genomes should reflect the efficiency of their DNA uptake bias and the frequency of the episodes of their spontaneous competence for transformation.With that in mind, two direct aspects are to highlight regarding the results reported here on the species distribution of the DUESs.The first is the fact that the results show higher genomic densities of the DUESs in the Neisseriaceae species than in the Pasteurellaceae species, suggesting that bacteria of the first family have higher

Table 3
The prevalence and full DUES sequences corresponding to each of the core sequences identified for the Neisseriaceae species in Table 2 susceptibility to become competent and/or more efficient DNA uptake (i.e., they take up more DNA) than bacteria of the latter family.In such case, competent species of the Neisseriaceae could have more propensity to horizontal gene transfer and its consequences.Another result to highlight here is the higher diversity of the DUESs in the Neisseriaceae family than in the Pasteurellaceae.The Neisseriacean DUESs seem not only to differ in their nucleotidic composition but also in their sizes as well.In fact, while each of the two Pasteurellaceae DUESs are of 10 bp, the eight Neisseriaceae sequence types have sizes ranging between 14 and 19 bp.This result is also concordant with, and reinforces, the conclusion that the Neisseriaceae are more competent and/or more efficient at DNA uptake than the Pasteurellaceae.Still, and conversely, more interesting is probably the thought that; given that the differences in DUESs would hamper inter-species uptake of DNA, the results of this work suggest that inter-specific DNA-uptake, and thus cross-species movement of DNA, seems to be more frequent and efficient between Pasteurellaceae species than between Neisseriaceae species.The situation seems therefore to reflect some sort of "compromise" between the efficiency of competence for transformation by DUES-biased DNA uptake and the diversification of the DUESs, whereby one can interpret that bacterial species of the "more competent" taxon have different DUESs in order to prevent excessive cross-species movement of DNA whereas species of the "less competent" taxon have less DUES types as their less efficient and/or frequent competence state would make the need for establishing barriers against uptake of DNA from other species insufficient for requiring (selecting for) the evolution of new DUES types.Of course, this is a purely evolutionary way of interpreting the results.A mechanistic interpretation of such result (i.e., more frequent and more diverse DUESs in a family comparing to the other family) could be that there might be more DNA uptake and a higher mutability and/or lower stabilizing selection (and thus more diversification) of the molecule (protein) that binds the extracellular DUES in species of the Neisseriaceae family than in Pasteurellaceae species.¿Could there be a relation between the efficiency or frequency of the  DUES-biased DNA uptake events (competence) and the appearance of new DUES types?

L a ri b a c te r h o n g k o n g e n s is M ic ro v ir g u la a e ro d e n it ri fi c a n s A q u a s p ir ill u m d is p a r
The answer is very likely yes! Considering as indicator the fact that H. influenzae shows growthphase dependent 10 − 4 transformation frequency [27] while N. gonorrhoeae has growth-phase independent ~2 × 10 − 1 transformation frequency [28]; it is legitimate to conclude thus that the in silico-based data and interpretations reported here seem to suggest that while within-species uptake of DNA (uptake of conspecific DNA) seems more likely for the seemingly more competent Neisseriaceae species, cross-species uptake of DNA is more likely between the seemingly less competent Pasteurellaceae species.It is therefore possible that the appearance, maintenance and accumulation of new DUES types become selected for when the transformation efficiency and inter-species movement of DNA surpasses some threshold-DUESs could in such case be seen as barriers against excessive inter-specific DNA exchange.Such evolutionary dynamics would necessarily be based on mutability and selection of the protein that binds the extracellular DNA.
In line with the interpretation made above, the results show how the A. pleuromonieae-type DUES-harboring species have two different DUESlike sequences with very similar densities, probably because of some higher flexibility in the specificity of the DNA uptake bias (i.e., in the DNA binding by the receptor) of the competent cells of those bacteria.The results show how not all the Pasteurellaceae species nor all the Neisseriaceae species have DUESs.Thus, either the DUES-lacking species have lost the DUESs, very likely after losing the capacity to become spontaneously competent for DNA uptake, or the DUES-harboring species have acquired these sequences while evolving DNA uptake bias concomitant or posterior to the evolution of competence.The fact that within each of these two DUES-harboring families, the genomes of most of the species do contain potential DUESs, means that biased DNA uptake is ancestral in each of these families-the probability of DNA uptake bias and DUESs being acquired independently (in parallel) in so many species is negligible.DUESs, and very probably competence before them, have therefore very likely been lost in some Pasteurellaceae and Neisseriaceae species and did not evolve independently in the so many competent and DUES-having bacteria.
The Pasteurellales and Neiseriales 16S phylogenetic tree highlights M. Bakkali the absence of DUESs in non-neisseriacean and non-pasteurellacean species.It, together with the DUES-harboring species 16S tree, support the ancestral nature of the neisseriacean and pasteurellacean DUESs in their respective families; as these sequences appear scattered throughout the respective clades-i.e., the Pasteurellaceae and the Neisseriaceae clades.Such distribution is more likely explained by common ancestry than by an unlikely independent evolution.Given the association of DUESs with biased uptake of DNA, we can infer that the biased uptake of DNA is ancestral in both bacterial families.While the Pasteurellales branch of the tree shows only Pasteurellaceae species and, result of that, DUES-harboring species in all its clades, the Neisseriales tree shows subdivision into two large clades; one containing DUES-harboring species, while the other completely lacks DUES-harboring bacteria.While the absence of 16S sequences from non-Pasteurellaceae species explains the Pasteurellales branch, the absence of DUESs in the genomes of non-Neisseriaceae species explains the Neisseriales branch.The scattering of the DUESs through the Pasteurellaceae and Neisseriaceae clades suggests that DNA uptake, or at least its bias towards DUESs, is not posterior to the diversification of any of these family.The lack of DUESs in the genomes of Pasteurellales and Neisseriales species other than Pasteurellaceae and Neisseriacea, if not an unlikely product of any probable sample size issue, would suggest that the emergence of the DNA uptake bias towards the DUESs might have been concomitant to, or just after, the split of these families from M. Bakkali their sister species (i.e., the Pasteurellaceae ancestor and the Neisseriaceae ancestor evolved DNA uptake bias and DUESs).
The presence of A. pleuromonieae-type DUES-harboring species in the H. influenzae-type DUES-harboring species part of the tree and absence of H. influenzae-type DUES-harboring species in the A. pleuromonieaetype DUES-harboring species part of the tree seems to suggest that the A. pleuromonieae-type DUES could be the ancestral Pasteurellacean DUES.However, (i) the A. pleuromonieae-type DUES-harboring species part of the tree is small, (ii) there are only two DUES-types in the tree, (iii) only two species appear in a part of the tree that is not congruent with the DUES-type that these species harbor, and (iv) there is no clear clade of species harboring any of the two Pasteurellaceae DUES types.All these make it safer to consider that the 16S tree do not allow pinpointing with enough confidence any of the two Pasteurellacean DUESs as the ancestral one.For its part, the Neisseriaceae tree points towards either the N. bacilloformis-type or the K. denitrificans-or E. corrodenstypes of DUES as possibly ancestral (due to them being the Neisseriaceae DUES-types that appear in separate parts of the tree).Not only the dispersion of the N. bacilloformis-type-harboring species in the tree is higher, also the unrooted dendrogram of the Neisseriaceae DUESs seems to support that DUES-type as the probable ancestral sequence for the Neisseriaceae.The consensus sequence of all the Neisseriaceae DUES types also shows clear resemblance to the N. bacilloformis-type DUES.
If the DNA uptake bias is ancestral, as concluded here, there must be an ancestral DUES for each family.The most likely ancestral DUES of the Neisseriaceae species having been inferred based on phylogeny and sequence similarities, phylogeny and sequence similarity results do not allow pinpointing which Pasteurellaceae DUES-type could be the ancestral one, and whether both have emerged from an extinct ancestral sequence.I therefore opted for a different strategy that is based on the logic that the ancestral sequence could have left its footprint in the pasteurellacean genomes no matter their current DUES.I looked for such footprint in each position of the DUES.When a position of the aligned sequences of both Pasteurellaceae DUES-types shows the same nucleotide, the decision is simple; that nucleotide is ancestral.However, there is discrepancy of nucleotides in some positions of the different pasteurellacean DUESs, I hence opted for a screening of the over-representation levels of each of the four nucleotides in each of the ten positions of the DUESs and their mismatched forms.This way I could reveal the second most important (over-represented) nucleotide after the DUES nucleotide in each position of the pasteurellacean DUESs.In case the second most important nucleotide in a position of a DUES-type is the same as the most important nucleotide at the same position of the other DUES-type, that nucleotide is considered as the most likely ancestral one.The results of this way of looking for a footprint of an ancestral sequence suggest that the ancestral pasteurellacean DUES seems to be AAAAAGCGGTN.The results of this analysis also highlight the GCGG sequence as the most conserved core of the Pateurellacean DUES; just as in silico and wet lab.Experiments have demonstrated in [3].Given that the logic and method used here could be applied to the analysis of other sequences (protein binding, transcription factor binding, repeated…), and hopping that it might be inspiring and helpful to other follow scientists, I have to highlight that its success depends on the compared species and sequences not being too distant (i.e., divergence is expected to significantly erase possible old footprints in the genome).
To conclude, here I show that the DUESs, and hence the biased DNA uptake, seem confined to the Pateurellaceae and Neisseriaceae families.While most bacteria of these families harbor DUESs, not all do, suggesting that the species of these families are mainly competent and have DNA uptake bias towards DUES-containing DNAs.Within each family, the densities of DUESs are very different between species implying differences in the efficiency of the DNA uptake or its bias.One can infer that there seems to be a sort of compromise between the efficiency of the DNA uptake or its bias and between the varieties of DUESs that a group of bacteria show; so that the Pastuerellaceae, that have few DUES types, would have less efficient uptake and whereas the Neisseriaceae, that more DUES types, would have more efficient uptake and their diverse DUESs might very likely have evolved in order to prevent excessive cross species passage of DNA.One could speculate that this situation suggests the existence of thresholds for allowing cross-species transfer of DNA.Indeed, it is logical to expect mechanisms that limit horizontal DNA (gene) transfer, as recombination with distant DNAs could be harmful and excessive permeability to exogenous DNAs would ultimately erase species boundaries.Finally I suggest a way to detect footprints of

Table 4
The possible ancestral DUES of the Pasteurellaceae species as inferred from the nucleotide over-representation values in the mismatched Flu-type and Pleu-type DUESs.

Fig. 1 .
Fig. 1.Schematic representation of the main steps of the workflow leading to the identification and study of the Pasteurellaceae and Neisseriaceae DUESs.

Fig. 2 .
Fig. 2. Density of the DUES-like sequences in the analyzed genomes of the Pasteurellaceae and Neiseriaceae spacies.X-axis: Species, Y-axis: Percentage of the genome covered by the DUESs nucleotides.

Fig. 3 .
Fig. 3. Distribution of the over-representation levels (y-axis) of 3 to 20 base sequences (x-axis) that are contained in or that contain the DUES in the genomes of the Pasturellaceae (A), Neisseriaceae (B) and control species (C).X-axis: Length of the sequence containing or contained in the DUES, Y-axis: Sequence overrepresentation value calculated using the Chi-squared formula (see material and methods).

0 6 P 2 L 27 Haem 9 P 2 P a s te u r e ll a s k y e n s is 1 8 M100
Act inob acil lus cap sula tusAc tino bac illu s hom inis 96 80 Ac tin ob ac illu s su is 79 Ac tin ob ac illu s lig nie re sii Ac tin ob ac illu s pl eu ro pn eu m on iae 10 0 82 Ac tin ob ac illu s ur ea e 99 A ct in ob ac ill us ar th rit id is 10 0 A ct in ob a ci llu s m in or A ct in o b a ci llu s po rc ito n si lla ru m 58 94 95 H a e m o p h ilu s d u cr e yi P a s te u re lla c a b a lli P a s te u re lla la n g a a e n s is 68 7 a s te u re ll a b e tt y a e 9 o n e p in e ll a k o a la ru m ro p s o b a c te r ro s o ru m H a e m o p h il u s in fl u e n z a e m u ri u m A c ti n o b a c ill u s m u ri s e u re lla a e ro g e n e s A ct in o ba ci llu s se m in is 80 99 A gg re ga tib ac te r ap hr op hi lu s H ae m op hi lu s pa ra ph ro ph ilu s Ha em op hi lu s ap hr op hi lu s Ag g. ap hr op 93 42 97 Ha em op hil us se gn is Ag gre ga tiba cte r seg nis 99 96 Agg reg atib act er act ino my cet em com itan s Acti nob acill us actin omy cete mco mita ns 100 Pas teur ella pne um otro pica Ha em oph ilus pitt ma nia e 94 Ha em op hilu s pa rai nflu en za e Ha e.p ara in 0 Te rra ha em op hil us ar om at ici vo ra ns 92 Pa st eu re lla or al is 10 0 H ae m op hi lu s sp ut or um 71 99 P as te ur el la m ul to ci da su bs p. se pt ic a P a st e u re lla m ul to ci da su bs p .m ul to ci d a P a st e u re lla m u lto ci d a s u b s p .g a lli c id a 0 98 P a s te u re lla st o m a ti s P a s te u re lla c a n is P a s te u re ll a d a g m a ti s m o p h il u s p a ra s u is A c ti n o b a c il lu s in d o li c u s 9 h o c o e n o b a c te r u te ri P a s te u r e ll a p h o c o e n a ru m 7 le te ll a s e m o li n a 1 a n n h e im ia s u c c in ic ip ro d u c e n s 6 V o lu c ri b a c te r a m a z o n a e A c ti n o b a c ill u s s a lp in g it id is G a lli b a c te ri u Jeon geu pian aeja ngs ane nsis And rep rev otia chi tinil ytic a 77 87 Iod oba cte r fluv iati lis 39 Ch itin ilyt icu m aq ua tile Ch itin ib ac ter tai na ne ns is 67 De ef ge a riv ul i D ee fg ea ch iti ni ly tic a 99 89 85 C hi tin ol yt ic ba ct er m ei yu an en si s C hi tin ip hi lu s sh in an on e ns is 91 97 F o rm iv ib ri o ci tr ic u s 99 G u lb e n k ia n ia m o b ili s P a lu d ib a c te ri u m y o n g n e u p e n s e P s e u d o g u lb e n k ia n ia s u b fl a v a P s e u d o g u lb e n k ia n ia g e fc e n s is A q u a p h il u s d o lo m ia e 9 e s e ll a p e rl u c id a V o g e s e ll a in d ig o fe ra

8 9 M 9 N 1 K 3 9 2 N e is s e ri a z o o d e g m a ti s N e is s e ri a a n im a lo ri s 9 7 NFig. 4 .
Fig. 4. Maximum likelihood phylogenetic tree of the 16 s ribosomal DNA of the Pasteurellales and Neisseriales species (a) and of the species whose genomes harbor DUESs(b)-in different colors in (a).Each color reflects the same DUES type and the brackets in (b) highlight clustering.In (c) is the dendrogram of the Neisseriaceae DUESs with Pseudogulbenkiania (Lutiella nitroferrum) as Neisseriales outgroup and in (d) are the alignments of these sequences.

Table 1
Sequence and over-representation of the core DUES-like sequences in the finished (F) and unfinished (U) genomes of the Pasteurellales species.No DUES-like sequence was detected in Pasteurellales species other than Pasteurellaceae.Obs.: Observed number of DUESs in the genomic sequence, Exp.: Expected number of DUESs in the genomic sequence, Percent: Percentage of the genomic sequence covered by DUESs.
al. (2013), nine different 12-bp DUES-like sequences were identified.However, there are two potential discrepancies between the present work and Frye et al.'s (2013) data: (i) one is the apparent presence of a DUES-like sequence in Lutiella nitroferrum-reported in the current work but not in Frye et al. 2013-and (ii) the second discrepancy is the fact that AGGCAGCCTGAA, reported as AG-kingDUS in Frye et al. 2013, is more frequent and over-represented in the genome of

Table 2
Sequence and over-representation of the core DUES-like sequences in the finished (F) and unfinished (U) genomes of the Neisseriales species.No DUES-like sequence was detected in Neisseriales species other than Neisseriaceae.Obs.: Observed number of DUESs in the genomic sequence, Exp.: Expected number of DUESs in the genomic sequence, Percent: Percentage of the genomic sequence covered by DUESs.aBoth are the same species and the sequence does not resemble Neisseriaceae DUESs.M. Bakkali Simonsiella muelleri ATCC 29453 (924 sequences) than AGGCTGCCT-GAA, reported by Frye et al. ( . Obs.: Observed number of DUESs in the genomic sequence, Exp.: Expected number of DUESs in the genomic sequence.

Table 4 .
The possible ancestral DUES of the Pasteurellaceae species as inferred from the nucleotide over-representation values in the mismatched Flu-type and Pleu-type DUESs.