Natural selection and recombination at host-interacting lipoprotein loci drive genome diversification of Lyme disease and related bacteria

ABSTRACT Lyme disease, caused by spirochetes in the Borrelia burgdorferi sensu lato clade within the Borrelia genus, is transmitted by Ixodes ticks and is currently the most prevalent and rapidly expanding tick-borne disease in Europe and North America. We report complete genome sequences of 47 isolates that encompass all established species in this clade while highlighting the diversity of the widespread human pathogenic species B. burgdorferi. A similar set of plasmids has been maintained throughout Borrelia divergence, indicating that they are a key adaptive feature of this genus. Phylogenetic reconstruction of all sequenced Borrelia genomes revealed the original divergence of Eurasian and North American lineages and subsequent dispersals that introduced B. garinii, B. bavariensis, B. lusitaniae, B. valaisiana, and B. afzelii from East Asia to Europe and B. burgdorferi and B. finlandensis from North America to Europe. Molecular phylogenies of the universally present core replicons (chromosome and cp26 and lp54 plasmids) are highly consistent, revealing a strong clonal structure. Nonetheless, numerous inconsistencies between the genome and gene phylogenies indicate species dispersal, genetic exchanges, and rapid sequence evolution at plasmid-borne loci, including key host-interacting lipoprotein genes. While localized recombination occurs uniformly on the main chromosome at a rate comparable to mutation, lipoprotein-encoding loci are recombination hotspots on the plasmids, suggesting adaptive maintenance of recombinant alleles at loci directly interacting with the host. We conclude that within- and between-species recombination facilitates adaptive sequence evolution of host-interacting lipoprotein loci and contributes to human virulence despite a genome-wide clonal structure of its natural populations. IMPORTANCE Lyme disease (also called Lyme borreliosis in Europe), a condition caused by spirochete bacteria of the genus Borrelia, transmitted by hard-bodied Ixodes ticks, is currently the most prevalent and rapidly expanding tick-borne disease in the United States and Europe. Borrelia interspecies and intraspecies genome comparisons of Lyme disease-related bacteria are essential to reconstruct their evolutionary origins, track epidemiological spread, identify molecular mechanisms of human pathogenicity, and design molecular and ecological approaches to disease prevention, diagnosis, and treatment. These Lyme disease-associated bacteria harbor complex genomes that encode many genes that do not have homologs in other organisms and are distributed across multiple linear and circular plasmids. The functional significance of most of the plasmid-borne genes and the multipartite genome organization itself remains unknown. Here we sequenced, assembled, and analyzed whole genomes of 47 Borrelia isolates from around the world, including multiple isolates of the human pathogenic species. Our analysis elucidates the evolutionary origins, historical migration, and sources of genomic variability of these clinically important pathogens. We have developed web-based software tools (BorreliaBase.org) to facilitate dissemination and continued comparative analysis of Borrelia genomes to identify determinants of human pathogenicity.


Table of contents
Table S1A Page 2 Isolates whose genome sequences were determined in this study Table S1B Page 3 Other genomes used in this study Table S2A Page 4 Plasmids sequenced in this report (part 1) Table S2B Page 5 Plasmids sequenced in this report (part 2) Table S3 Page 7 Loci with high recombination rates in the genus Borreliella Table S4 Page 9 Introgressed loci analysis with four-taxon D-statistics Table S5A Page 10 Divergence times and ancestral population sizes Table S5B Page 11 Biogeographic models tested using BioGeoBears Figure S1A   Table S2A.Plasmids sequenced in this report (part 1) Table S2 Plasmids sequenced in this study

Table S2 legend. Plasmids sequenced in this report
The table shows the presence and absence of the 35 plasmid types (columns) in the 47 Borreliella genomes sequences presented in this report (rows).Plasmid names lp28-10 and cp32-2 are not used for historical reasons; no lp28-11 plasmids were found in these genomes.Some cp9 and all lp5 plasmids carry no PFam32 gene, and the only partitioning protein that the lp5 plasmids encode is a PFam57 protein.B. burgdorferi isolate 80a carries two cp9's with different PFam57 proteins.The lp5 in isolate CA446 is about 17 kbp in length and encodes a cp32-like PFam57 protein that may impart different compatibility from lp5 itself.

Table S2 footnotes:
# Chromosome is in multiple contigs.* These plasmid sequences are not complete.The sequence does not extend to the tip of one or more telomeres (in most cases comparison to closely related plasmids suggests that less than 200 bp are missing from these linear plasmid termini.In other incomplete cases contigs from plasmids that were predicted to be circular (e.g., cp32-like plasmids) had no telomere consensus and no terminal direct repeat overlap and so were not circularizable.$ These plasmids have syntenic gene contents with the cp32 plasmid family; however, they encode an lp28-4 type PFam32 protein is usually encoded by on circular cp32 plasmid (see text).† These plasmids carry a PFam32 gene that is typical of, and named for a circular cp32 type on a linear lp32 plasmid.^ These are "inverted repeat plasmids" with two identical or nearly identical halves that are inverted relative to one another (see text).@ This plasmid was not present in the CA690 culture sequenced in this project, but was present in the CA690 culture sequenced independently by Margos et al. (2020) (accession number NZ_CP044542) where it was called "cp32-2" instead of "cp32-7".¢ This plasmid was not present in the 100B40 genome sequenced in Maryland.

Figure S2. Borreliella plasmid PFam32 protein neighbor-joining tree
Selected amino acid sequences from all the currently known Borreliella plasmid PFam32 protein types were aligned, and an unrooted neighbor-joining tree was constructed by Clustal X (Larkin et al., 2007).The different PFam32 types are highlighed with different background colors.
Bootstrap values from 1000 trials are shown above the branches, and branches with bootstrap values below 950 were collapsed to multi-branch points.A fractional distance scale bar is shown at the upper left.Plasmid names are indicated at the right of each branch in large text, and the selected Borreliella isolates carrying them are indicated in small text at the branch tips; red isolate names denote those with different geometry (linear or circular) from others in the same PFam32 type.Red stars mark the four new PFam32 types, cp32-14, lp28-12, -13 and -14 discovered in this study.The green star marks lp28-11 whose PFam32 relationship not been previously published (see Schwartz et al., 2021); it is the same type as the protein encoded by the second "orphan" PFam32 gene on strain B31 lp28-1 that was previously not given a PFam32 type name.
We also note that in the over Borreliella 900 plasmids that we have sequenced (this report and Casjens et al., 2017 and2018) there is only one instance of two apparently intact PFam32 genes of the same type in the same genome.In this unique and unsubstantiated instance B. yangtzensis Okinawa-CW61 contains two cp32-11 type PFam32 genes.Given the extreme rarity of such an occurrence, we suggest that the cp32-11 PFam32 gene or its expression in this isolate's cp32-9+11 plasmid might have been inactivated by point mutation, since this plasmid also carries a cp32-9 type PFam32 gene which could mediate its partition.

Figure S4A. Recombination rates and hotspots on the B. burgdorferi chromosome and plasmids cp26, lp17 and lp54
In the left panels black lines show the posterior mean per-site recombination rates (y-axis) estimated with LDhat (McVean et al., 2002) based on bi-allelic SNV segregation plotted against the location in the genome (x-axis) (see text), and red dotted lines show the 95% confidence intervals.In this figure, seven loci and one intergenic region (BB_0608-BB_0610) from chromosome, three loci from cp26, one locus from lp17 and two loci from lp54 with high mean recombination rates peaks are highlighted in orange color.Table S3 lists all the loci with high mean recombination rates located on the four replicons.

Figure S5. Phylogeny of the circular plasmid cp26
A. A phylogenic tree of cp26 plasmid sequences from 23 Borreliella species.
The tree was constructed by IQ-tree as described in the text and MATERIALS AND METHODS.B. burgdorferi is represented by a single cp26 (strain B31) for clarity of presentation.
All branches shown are supported by a bootstrap value of 80% or higher.The tree was rooted with     The top and bottom major branches show convincing parallel branching, indicating that there has been no transfer of cp26 between these lineages.Since the middle branches (blue lines) were collapsed to the leftmost multi-branch node in the chromosome tree (see legend to Figure 4), it is not possible to conclude that there was no transfer in these isolates, but the alignment shown is consistent with no transfer.Homologs of B. burgdorferi strain B31 lp54 "constant region" genes A01, A03, A74 and A76, respectively are shown as orange boxes.lp54s were examined from all the genomes of isolates in Tables S1A and S1B.In species listed without an isolate name, all isolates examined have very similar lp54s.(#) B. lusitaniae strain PoHL1 has only three PFam60 genes at its left end.Large asterisks indicate that one or more lp54s of that type have a complete telomere sequence that marks a bona fide plasmid end, and small asterisks mark pseudogenes.and Miyazaki4E have different center points; it is not known if this is due to independent origins or different deletions that occurred after a single ancestral duplication event.
In the assembled contigs of these sequences the right arms of the repeats are not full length.assemblies have the same length and are the same length as their orthologs in B31.However there are some differences that appear to be damage in one of the two arms.It is not known if such differences between the two arms are the result of decay of one member of some of the duplicated genes or, perhaps more likely, of sequencing errors that accumulate in one or the other of the two arms during the assembly process.et al. (1996) previously showed with Southern blot/restriction analysis that HO14 lp54 is in fact a head-to-head inverted repeat dimer, and as discussed above our sequence agrees with this observation.In that previous study we mapped nine restriction enzyme cleavage sites (symmetrically present in both halves of the repeat) made by six different enzymes that agree perfectly with the locations of these sites in the HO14 lp54 sequence.This is strong evidence that the inverted repeat in the HO14 lp54 sequence is not an assembly artifact, and that it accurately reflects the actual plasmid structure.Marconi et al. (1996) also showed that isolate IKA2 lp54 was also the size of an expected dimer, suggesting that this peculiar lp54 inverted repeat structure may well be a common feature of all B. japonica isolates.Six-frame ORF maps (formatted as described in figure S3) are shown for the B. burgdorferi lp17 plasmids.A typical isolate of each subtype is indicated on the left, and complete membership in each subtype is given in the key below (names in red are those sequences reported here).Selected apparently intact ORFs and pseudogenes are indicated in red and green, respectively.Background colors and green shading between maps indicate similar sequence.Six-frame ORF maps (formatted as described in figure S3) are shown for selected lp17 plasmids.Isolate names are on the left, and species are indicated on the right .Background colors and green shading between maps indicate similar sequence.B31 lp17 gene names are given above the constant (yellow) region, and PFam names are given for selected ORFs in the left end extensions.Red asterisks mark the novel sequence joint between the constant region and left end extension (see text).

B31 Miyazaki2E
Page 12 Borreliella chromosome terminal extensions -North America Figure S1B Page 14 Borreliella chromosome terminal extensions -Eurasia Figure S2 Page 15 Borreliella plasmid PFam32 protein neighbor-joining tree Figure S3 Page 17 B. californiensis CA446 linear plasmid lp5 Figure S4A Page 18 Recombination rates and hotspots on the B. burgdorferi chromosome and plasmids cp26, lp17 and lp54 Figure S4B Page 20 Cross-species exchange (introgression) between B. burgdorferi and B americana Figure S5 Page 22 Phylogeny of the circular plasmid cp26 Figure S6 Page 24 Phylogeny of the linear plasmid lp54 Figure S7 Page 26 Phylogeny of the linear plasmid lp17 Figure S8 Page 28 Loss of guaAB genes from cp26 in B. sinica and B. andersonii Figure S9 Page 29 Comparison of chromosome and cp26 trees Figure S10 Page 30 Borreliella lp54 terminal differences Figure S11 Page 31 B. japonica lp54 structure Figure S12 Page 33 Comparison of chromosome and lp54 trees Figure S13 Page 34 Organizational subtypes of B. burgdorferi lp17's Figure S14 Page 35 Comparison of chromosome and lp17 trees Figure S15 Page 36 lp17 left end extensions References Page 37

Figure S1 .
Figure S1.Borreliella chromosome terminal extensions S1A.North American clade chromosomal extensions S1B.Eurasian clade chromosomal extensions Maps of the chromosome left and right end regions of newly sequenced genomes are shown with genes indicated by boxes with a pointed end that indicates the direction of transcription.Yellow, orange and light blue genes at the left ends are homologs of B. burgdorferi strain B31 genes bb_001, bb_002 and bb_003, respectively, and pink and blue-green genes at the right ends are homologs of B. burgdorferi strain B31 genes bb_842 and bb_843, respectively.Light green genes are those that are present in the terminal extensions.Isolate names are given in center of the maps.Large asterisks indicate that the sequence extends to the tip of closed hairpin telomere (or sequence end where telomere has not been sequenced) and the numbers at the ends of the chromosomes are the distances from the telomere tip to the bb_001 gene homolog on the left (except in B. valaisiana VS116 and 100B40 where it is to bb_002) and between telomere tip and bb_843 (including its stop codon) on the right.Kbp scales are indicated above with bp 1 as the start of the homologs of genes bb_003 and bb_842, for the left and right ends, respectively.Numbers above the genes are gene lengths (bp) that include the stop codon.Paralogous protein families (PF) (Casjens et al., 2000; 2012) are indicated above the genes in the terminal extensions where small asterisks indicate pseudogenes.Backslashes indicate frameshift differences that could be either accurate pseuodgenes or sequencing errors.The constant region of the Borreliella chromosome typically (except in some B. valaisiana isolates, above) extends from bb_001 through bb_843 (strain B31 gene names).In addition to the newly sequenced isolates, B. burgdorferi strain B31and B. valaisiana VS116 and are shown for comparison in panels A and B, respectively.No new B. burgdorferi sensu stricto right end extension types were found among the newly sequenced genomes; they have the following extension types (see type names figure 6 of Casjens et al. 2017): S1 type, 80a; M-1 type, NIH3, NIH5, NIH8, Bol29, Fr-93-1, NE_5248, NE_5261 and NE_5267; type M-3; Z9, 217_5 and SCW-9 (the latter has about 300 bp of unique DNA at its tip); type L-2, Sh-2-82.The B. carolinensis M7p chromosome sequence did not reach either telomere and is not shown in the figure.

Figure
Figure S2 Figure S3CA446 lp5 76 57* 57* 57* 57 57* 44 B L A 3 2 _ 0 5 5 5 0 Figure S4B.A cross-species genetic exchange (introgression) between speciesB.burgdorferi and B. americana the B. chilensis VA1 cp26 as the outgroup.The branch tip circles are colored according to genome sequencing status (present here or previously sequenced), and the heatmap on the right indicates the geographic origins of the sequenced strains.Branches are labeled with species designations.Only three B. burgdorferi are shown, but see part B below.Asterisks (*) indicate a new OspC type which we name "NE".Also note that isolate ZS7 OspC is indicated by "B", although it is a European restricted subtype of B called B2 (Qiu et al., 2008) or Bb (Travinsky et al., 2010).For the dimerized cp26 plasmids in the three B. lusitaniae genomes (see text), the variant-call pipeline maps of the cp26 monomer that was the closest to the B31 cp26 reference sequence was used.B. A phylogenetic tree of cp26 sequences from 28 B. burgdorferi isolates.B. burgdorferi isolates were extracted from the tree described in figure S5A.The first and second columns of the heatmap on the right indicate the geographic origins of the isolates and the within-species B. burgdorferi SNV groups, respectively.The scale bars indicate changes per site.

Figure S9 .
Figure S9.Comparison of chromosome and cp26 treesEquivalent branches were switched to attain maximum alignment of the chromosome and cp26 SNV trees in Figures4 and S5, respectively, and green horizontal lines connect the same isolate in the two trees.

Figure S11 .
Figure S10 This is presumably due to the assembly algorithm which cannot separate the two very similar or identical arms of the inverted repeat.Long PacBio sequencing runs across the center of the inverted repeat did not reach further into the right arm.The HO14 lp54 assembly has 11833 bp of inverted repeat right arm sequence, and Miyazaki has 10836 bp.Here the right arm is trimmed to show about seven kbp of the inverted repeat.Most lp54 genes in our HO14 and Miyazaki lp54

Table S1A .
Isolates whose genome sequences were determined in this study

Table S1B .
Other genomes used in this study

Table S2B .
Plasmids sequenced in this report (part 2)

Table S4 .
Introgressed loci analysis with four-taxon D-statistics

Table S5A .
Divergence times and ancestral population sizes