GWarrange: a pre- and post- genome-wide association studies pipeline for detecting phenotype-associated genome rearrangement events

Abstract The use of k-mers to capture genetic variation in bacterial genome-wide association studies (bGWAS) has demonstrated its effectiveness in overcoming the plasticity of bacterial genomes by providing a comprehensive array of genetic variants in a genome set that is not confined to a single reference genome. However, little attempt has been made to interpret k-mers in the context of genome rearrangements, partly due to challenges in the exhaustive and high-throughput identification of genome structure and individual rearrangement events. Here, we present GWarrange, a pre- and post-bGWAS processing methodology that leverages the unique properties of k-mers to facilitate bGWAS for genome rearrangements. Repeat sequences are common instigators of genome rearrangements through intragenomic homologous recombination, and they are commonly found at rearrangement boundaries. Using whole-genome sequences, repeat sequences are replaced by short placeholder sequences, allowing the regions flanking repeats to be incorporated into relatively short k-mers. Then, locations of flanking regions in significant k-mers are mapped back to complete genome sequences to visualise genome rearrangements. Four case studies based on two bacterial species (Bordetella pertussis and Enterococcus faecium) and a simulated genome set are presented to demonstrate the ability to identify phenotype-associated rearrangements. GWarrange is available at https://github.com/DorothyTamYiLing/GWarrangehttps://github.com/DorothyTamYiLing/genome_rearrangement.git.

Then, assemblies are re-orientated according to the loca4on and orienta4on of the chosen gene in the BLAST output.

Replacement of repeat sequences by short placeholder sequences across the genome set
To detect boundaries of genome rearrangements, repeat sequences likely to be recombina4on sites are replaced with short placeholder sequences of N x 15 in each genome sequence.This allows for the placeholder sequence itself and its flanking regions to be incorporated into the length of a k-mer.Loca4ons of repeat sequences across the genome set are obtained through BLAST using default seZngs.The more relaxed percentage iden4ty and coverage thresholds here allow the detec4on of fragmented repeat sequences in the genome.The list of repeat sequences used in BLAST is supplied by the user in FASTA format (using the -replist flag) and can be informed by the most frequent repeat sequence categories iden4fied in the selected reference genome in the above step.

Extending and merging neighbouring repeat sequences into blocks prior replacement
Repeat sequences, such as IS elements, can some4mes be found in clusters in bacterial genomes, or different types of repeat sequences can co-locate next to one another forming repeat sequence blocks.Since effec4ve detec4on of genome rearrangement boundaries relies on regions flanking repeats mapping uniquely to query genomes, it is necessary to replace the whole repeat sequence block/cluster with placeholder sequences.This can ensure: 1) flanking regions can be incorporated into the length of a k-mer; 2) unique mapping of flanking regions to genome sequences due to the absence of repeat sequences.Complete replacement of repeat sequence blocks can be achieved by extending genome coordinates of each repeat sequence in each genome by a number of base pairs in both direc4ons, and/or merging repeat sequences that are less than a number of base pairs apart.Different extension and merging parameters can be specified by users.The number of base pairs for extension is recommended to be similar to or larger than the es4mated size of the largest repeat sequence clusters in the selected reference genome (see step above).Unless otherwise specified, repeat sequences that are less than 3bp apart (default) are merged.Note that this also merges overlapping repeat sequences.This results in genome sequences with repeat sequence clusters replaced by placeholder sequences.
Although extending repeat sequences and merging repeat sequences in close proximity can improve sensi4vity in rearrangement boundaries detec4on, too large an extension in both direc4ons and/or merging repeat sequences that are too far apart can lead to loss of resolu4on of exact rearrangement breakpoints.In addi4on, any rearrangement that sits completely within the replaced region could not be detected.By default, a minimal extension of 100bp in both direc4ons is performed, as well as merging overlapping/adjacent repeat sequences of 3bp or less apart is also performed.They produce a second independent set of genome sequences containing placeholder sequences.
The coordinates of each extended and merged repeat sequence block (in *mergedIS.txtfile), as well as their maximum and minimum size, and distances between them across all genomes (in *mergedISstat.txtfile) are recorded.

K-mer genera'on and bGWAS
Next, k-mers are generated from the genome set with repeat sequences replaced using kmer genera4on tools such as fsm-lite (heps://github.com/nvalimak/fsm-lite).The chosen size of k-mer should allow it to contain the placeholder sequence (if there is any), as well as the two flanking sequences.The size of flanking sequences should be long enough to generate confident sequence match with the genome sequences during BLAST alignment, but short enough that they do not contain a second placeholder sequence, or any remaining repeat sequence that could result in mul4ple sequence matches.For example, for B. pertussis and E. faecium, a k-mer size of 200bp (i.e.default value for -fsmlite_arg flag) and minimum flanking sequence of 30bp (i.e.default value for -flk_len flag) can be used.The resul4ng k-mers are then used as input for k-mer-based bGWAS to search for k-mers that are associated with a phenotype of interest.In addi4on, in situa4ons that could poten4ally produce a large number of significant k-mers, such as large rearrangement events or the use of large genome sets, we have also incorporated func4onality for uni4gs-based bGWAS to be performed in parallel with k-mers-based bGWAS (See Example 2 in the Result sec4on).Instead of uncompacted k-mers, the use of uni4gs could increase efficiency in visualisa4on of rearranged sequence content (i.e.translocated or inverted sequences), as well as reduce computa4onal 4me.Uni4gs are generated from genome assemblies using tools such as uni4g-caller (37).To our knowledge, uni4g genera4on tools such as uni4g-caller (37) are not compa4ble with genome sequences containing placeholder sequences.Therefore, uni4gs have to be used in combina4on with k-mers, which are compa4ble with genome sequences containing placeholder sequences and allow visualisa4on of rearrangement boundaries.
Significant k-mers are placed into different mul4FASTA files according to whether they contain placeholder sequences or not.The two files then undergo different downstream processing.When uni4g-based bGWAS is performed, uni4gs will be analysed in the same way as k-mers without placeholder sequences.

Processing significant k-mers containing placeholder sequences (Indicators of rearrangement boundaries)
k-mers that contain placeholder sequences are retained only if they have flanking sequences of a minimum number of base pairs in size on both sides of the placeholder sequence, as indicated by the -flk_len flag.For example, for B. pertussis and E. faecium, 30bp (default) can be used as input for the -flk_len flag.This is to ensure that the flanking sequences are long enough to generate confident, unique blast hits.These k-mers are then blasted against all the genome sequences in the genome set in their original form (without repeat sequence replacement) using default BLAST parameters.
The k-mers are further filtered based on BLAST alignment informa4on using 4 criteria: • the k-mer should produce BLAST alignment with ≥ 95% of the genomes in the dataset (when a k-mer does not show any BLAST alignment with a genome, it is likely to indicate an inser4on-dele4on event and is not informa4ve of genome rearrangements.Therefore, it is equivalent to missing informa4on for the genome.Missing informa4on in no more than 5% of the genomes is allowed for a k-mer to pass the filter) • the k-mer should produce two BLAST alignments in each genome, one for each flanking sequence • each BLAST alignment should be ≥ 90% of the flanking sequence's length • each BLAST alignment should show ³ 95% iden4ty match with an E-value of £ 10e-10 For each k-mer that has passed the filter, a summary derived from its blast alignment to each genome is made for 1) StartL (genome coordinate of the start of leW flanking sequence) 2) EndL (genome coordinate of the end of leW flanking sequence) 3) StartR (genome coordinate of the start of right flanking sequence) 4) EndR (genome coordinate of the end of right flanking sequence) (Figure S2).
Based on the genome coordinate summary of each k-mer and the maximum size of repeat sequence cluster replaced by placeholder sequences (i.e.maxrplsize, can be found in *mergedISstat.txtfile), the flanking sequence behaviour for each k-mer in each genome is determined.Behaviours could be "intact_k" (intact k-mer), "mv_aprt" (flanking sequences moved apart), "swp_flk" (flanking sequences swapped in posi4on) and "mv_flp" (one flanking sequence moved away and flipped/inverted).Behaviour for each k-mer in each genome is defined according to the different orders of StartL, EndL, StartR and EndR genome coordinates, as shown and visualised in Figure 1.K-mers that show flanking sequence behaviours of "mv_aprt", "swp_flk" or "mv_flp" are defined as split k-mers.When a k-mer shows "intact_k" behaviour in some genomes and "mv_aprt" or "swp_flk" behaviours in other genomes, transloca4on is suggested by the k-mer; whereas when a k-mer shows "intact_k" behaviours in some genomes and "mv_flp" behaviour in other genomes, inversion is suggested by the k-mer (Figure 2).Flanking sequence behaviours are defined as "undefined_behave" when none of the rules are fulfilled for that k-mer.This should be unlikely, but if it does happen, BLAST results including StartL, EndL, StartR and EndR for that k-mer in each genome will be generated for inves4ga4on (output file name: myundefine_k.txt in /kmers_withN directory).
For each k-mer that is defined as either "intact_k", "mv_aprt", "swp_flk" or "mv_flp", the number and propor4on of case and control genomes displaying that behaviour are calculated.The most parsimonious genome rearrangement event that is associated with the phenotype is recorded (i.e."mv_aprt" or "swp_flk" represents transloca4ons, whereas "mv_flp" represents inversions, Figure 2).In addi4on, the informa4on of where the flanking sequence behaviours occur in the genomes (in the form of summary sta4s4cs of genome coordinates) is also summarised.
If a behaviour is found in fewer than 20% of both case and control genomes, it is not included in the summary for the k-mer, as it is unlikely to be associated with the phenotype of interest.
Summaries for significant k-mers with placeholder sequences showing split k-mers behaviours in at least 20% of case or control genomes can be found in output file mysplitk_out.txt in /kmers_withN directory.Descrip4ons for each column can be found in the "Pipeline and output file descrip4on" sec4on on GitHub.
For significant k-mers containing placeholder sequences that show "intact_k" behaviour in all genomes, they are analysed the same way as those without placeholder sequence (see below).Summaries of count and propor4on of case and control genomes containing the kmers in forward/reverse orienta4on, as well as summary sta4s4cs of their genome posi4ons are stored in the output file myintactkwithN_out.txt in / kmers_withN directory.

Processing k-mers/uni'gs that do not contain placeholder sequences (Indicators of rearranged sequence content)
It has been observed that for k-mer or uni4g (denoted as k-mers in this sec4on) genera4on tools such as fsm-lite (heps://github.com/nvalimak/fsm-lite)and uni4g-caller (37), a k-mer is only defined as "present" in a genome when it shows an exact match in sequence content and sequence orienta4on.If this region is, for instance, inverted in other genomes, this kmer is defined as "absent" in those genomes.This property of k-mer/uni4g genera4on tools can be leveraged to detect rearranged genome sequences that are associated with phenotype of interest.Given the absence of single nucleo4de polymorphisms and inser4ons-dele4ons, significant intact k-mers from bGWAS reflect rearranged genome sequences that are influenced by an inversion associated with a phenotype.This method is, however, not suitable for detec4ng transloca4ons, as they do not involve any change in sequence orienta4on.
For significant k-mers that do not contain placeholder sequences, they are aligned with the genome sequences by BLAST using default parameters, as previously.The k-mers are then further filtered based on their BLAST alignment informa4on using the same criteria 1-3 as previously but using an alterna4ve criterion 4, whereby the k-mer should produce one unique BLAST alignment per genome.Then, for the k-mers passing the filters, summaries of count and propor4on of case and control genomes containing the k-mers in forward/reverse orienta4on, as well as summary sta4s4cs of their genome posi4ons are generated.They are available in the output file myNoNintactk_out.txt in /kmers_noN directory.Descrip4ons for each column are provided in the "Pipeline and output file descrip4on" sec4on on GitHub.
BLAST results for k-mers with more than one hit per genome are also output for downstream inves4ga4on and the possibility of gene duplica4on (output file: kmer_with_mul4_hits_noN.txt in in /kmers_noN directory).

Visualising genome rearrangements by ploTng deduplicated k-mers/uni'gs
For split k-mers containing placeholder sequences, rearrangements are visualised by ploZng where the flanking sequences are found in case and control genomes.Due to the overlapping property of k-mers, mul4ple split k-mers can be found at a single rearrangement boundary.For output simplicity, these k-mers are deduplicated so that each rearrangement boundary is visualised by the plot of one k-mer.They are deduplicated as follows: each k-mer is given a label, which contains its: count and propor4on of flanking sequences behaviour in case/control genomes, genome posi4on (represented by the mean StartL when the k-mer is intact in control genomes, rounded off to two significant digits, as indicated by the -dedupk flag), and forward/reverse intact k-mers count.Only k-mers with unique labels are kept for visualisa4on.Deduplicated split k-mers can be found in output file myshort_splitk_out_uniq.txt in /kmers_withN directory.Addi4onally, since flanking sequences can show slightly different posi4ons in different genomes due to inser4ons/dele4ons, this range of genome posi4ons is shown by ploZng out flanking sequences' posi4ons in all genomes.To improve the clarity of presenta4on, flanking sequences (of the same colour) that are less than a certain number of base pairs apart (indicated by -dist flag, default is 40 000bp) are merged into one arrow by taking the median genome posi4on value.
For intact k-mers with and without placeholder sequence as well as uni4gs (denoted as kmers), rearranged sequence contents are visualised by ploZng where the k-mers are found in case and control genomes.Intact k-mers are deduplicated as follows: each k-mer is given a label, which contains its: median EndL when the k-mer is in forward orienta4on in control genomes, median EndL when the k-mer is in reverse orienta4on in control genomes, median EndL when the k-mer is in forward orienta4on in case genomes and median EndL when the k-mer is in reverse orienta4on in case genomes.For intact k-mers without placeholder sequence as well as uni4gs, "sStart" values in BLAST output are used instead of EndL.AWer rounding off these values to the closest mul4plier of selected value (e.g.100, 1 000, 10 000), as indicated by the -intkrd flag, only k-mers with unique labels are kept for visualisa4on.The list of significant deduplicated k-mers that are used for visualisa4on can be found in files with suffix *kmer4plot.txt.Figure S12: Schematic diagram of the two di@erent genome structures explained by two translocation events, present in 39 simulated genomes (19 genomes with structure "0" and 20 genomes with structure "1").Ribosomal operons that consist of mainly 16S ribosomal RNA, 23S ribosomal RNA, 5S ribosomal RNA and tRNAs are inserted in each of the rearrangement boundaries of the simulated genomes.
Figure S13: Twelve different significant split k-mers that were mapped to each of the translocation boundaries in example 4, split in case/control genomes, and in forward/reverse orientation.

Figure S2 :
Figure S2: Start and end genome coordinates of left and right flanking sequences of a kmer containing placeholder sequence.

Figure S3 :
Figure S3: Schematic diagram of the two di@erent genome structures explained by two nested inversions, present in a subset of 47 B. pertussis genomes.The outer inversion took place between genome coordinates 430 Kbp and 3600 Kbp, while the inner one took place between 1500 Kbp and 2500 Kbp, one inversion nested within the other.

Figure S4 :
Figure S4: Example plots of four split k-mers that indicate inversion boundaries, from genome set with 7 000bp extension.a) Inversion within genome between 430Kbp and 3600Kbp, 430Kbp boundary, k-mer being intact in majority of case genomes in reverse orientation and split in majority of control genomes b) Inversion within genome between 430Kbp and 3600Kbp, 3600Kbp boundary, k-mer being intact in majority of control genomes in forward orientation and split in majority of case genomes c) Inversion within genome between 1500Kbp and 2500Kbp, 1500Kbp boundary, k-mer being intact in majority of case genomes in reverse orientation and split in majority of control genomes d) Inversion within genome between 1500Kbp and 2500Kbp, 2500Kbp boundary, k-mer being intact in majority of control genomes in forward orientation and split in majority of case genomes

Figure S6 :
Figure S6: 1500 Kbp and 2500 Kbp boundaries in example 1 contain additional repeat sequences (highlighted in red) (i.e.peroxide stress protein YaaA, FUSC family membrane protein, MFS transporter) locating immediately adjacent to IS elements clusters (highlighted in green).MAUVE genome visualisation.

Figure S7 :
FigureS7: Plots of intact deduplicated k-mers (with and without placeholder sequence combined) visualise rearranged sequence content in two genome regions as a result of two nested inversions as depicted in FigureS3, significantly associated with genome structure phenotype.When using 100bp (a and b) and 7000bp (c and d) extension, and when intact k-mers are in forward (a and c) and reverse orientation (b and d) in majority of control genomes (genome phenotype group "0").The direction of arrows indicates orientation of k-mers when mapped to genomes.Height of arrow corresponds to proportion of case (genome phenotype group "1") / control (genome phenotype group "0") genomes.Genome position of k-mer is indicated by the tip of arrow.

Figure S8 :
Figure S8: Schematic diagram of the two genome structures observed in 32 E. faecium genomes in example 2. The two structures are distinguished by an inversion.

Figure S9 :
Figure S9: Plots of four split k-mers that indicate inversion boundaries at 720 Kbp and 2100 Kbp, generated using a 100bp extension.

Figure S10 :
Figure S10: Plots of deduplicated significant unitigs that showed rearranged sequence content significantly associated with predefined genome structure phenotype, when using 100bp (a and b) and 17000bp (c and d) extension, and when unitigs are in forward (a and c) and reverse orientation (b and d) in majority of control genomes.Height of arrow corresponds to the proportion of case (genome phenotype group "1") /control (genome phenotype group "0") genomes.Direction of arrows indicates orientation of unitigs when mapped to genomes.Genome position of unitigs indicated by the tip of arrow.

Figure S11 :
Figure S11: Schematic diagram for example 3 showing location of pertactin autotransporter gene is immediately next to a repeat region in the B. pertussis genomes J405 (accession: GCA_008817455.1), resulting it to be completely embedded within the region for replacement when 7000bp extension was applied.