Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus

Aligning sequences for phylogenetic analysis (multiple sequence alignment; MSA) is an important, but increasingly computationally expensive step with the recent surge in DNA sequence data. Much of this sequence data is publicly available, but can be extremely fragmentary (i.e., a combination of full genomes and genomic fragments), which can compound the computational issues related to MSA. Traditionally, alignments are produced with automated algorithms and then checked and/or corrected “by eye” prior to phylogenetic inference. However, this manual curation is inefficient at the data scales required of modern phylogenetics and results in alignments that are not reproducible. Recently, methods have been developed for fully automating alignments of large data sets, but it is unclear if these methods produce alignments that result in compatible phylogenies when compared to more traditional alignment approaches that combined automated and manual methods. Here we use approximately 33,000 publicly available sequences from the hepatitis B virus (HBV), a globally distributed and rapidly evolving virus, to compare different alignment approaches. Using one data set comprised exclusively of whole genomes and a second that also included sequence fragments, we compared three MSA methods: (1) a purely automated approach using traditional software, (2) an automated approach including by eye manual editing, and (3) more recent fully automated approaches. To understand how these methods affect phylogenetic results, we compared resulting tree topologies based on these different alignment methods using multiple metrics. We further determined if the monophyly of existing HBV genotypes was supported in phylogenies estimated from each alignment type and under different statistical support thresholds. Traditional and fully automated alignments produced similar HBV phylogenies. Although there was variability between branch support thresholds, allowing lower support thresholds tended to result in more differences among trees. Therefore, differences between the trees could be best explained by phylogenetic uncertainty unrelated to the MSA method used. Nevertheless, automated alignment approaches did not require human intervention and were therefore considerably less time-intensive than traditional approaches. Because of this, we conclude that fully automated algorithms for MSA are fully compatible with older methods even in extremely difficult to align data sets. Additionally, we found that most HBV diagnostic genotypes did not correspond to evolutionarily-sound groups, regardless of alignment type and support threshold. This suggests there may be errors in genotype classification in the database or that HBV genotypes may need a revision.

HBV has a compact circular genome ~3.2 kb in length, with four genes arranged across 155 overlapping reading frames ( Fig. 1; Norder et al., 1994). Any attempt to use publicly available 156 HBV sequence data for phylogenetic analysis needs to work with the fragmentary nature of the 157 data set and with the issues of limited character sampling (i.e. small genomes). The small size of 158 the genomes coupled with the large number of sequences creates phylogenetic data sets that are 159 "tall and narrow". Furthermore, there is currently no universally agreed starting position for 160 viruses with circular genomes, including HBV, therefore individual researchers are free to report 161 genome sequences starting at any position, and genomes may have arbitrary breakpoints. This 162 results in sequence alignment issues as homologous portions of the genome are not necessarily in 163 the same relative position when represented as linear sequences. 164 To ameliorate these issues in virus MSA, we compared traditional automated-manual 165 alignment approaches to fully automated methods for phylogenetic analyses on a tall and narrow 166 data set generated from publicly available HBV sequence data. Using two different data sets, one 167 comprised exclusively of whole genomes (and therefore with limited missing data) and a second 168 comprised of most publicly available HBV sequences (a highly fragmentary data set) we 169 compared three alignment methods: 1) traditional automated MSA methods , 2) the commonly 170 used hybrid approach of MSA followed by manual adjustment of the alignment, and 3) UPP and 171 PASTA, two modern automated MSA methods. This approach allowed us to assess the 172 effectiveness of different methods for handling the problem of large matrices with highly 173 fragmentary sequence data. Additionally, through MSA evaluation methods we were able to 174 assess the phylogenetic signal of and monophyly within named strains of the virus.  (Edgar 2004). In addition, we used 229 PASTA to align the original 5,244 (i.e. un-linearized) genome sequences. 230 231 Total (Fragments + Genome) alignments 232 Traditional Alignment 233 A traditional, brute force approach to align all 55,353 sequences (including full genomes 234 and fragmentary sequences) was not feasible due to the large number of sequences and alignment 235 gaps from the fragments. Therefore, to align the full set of sequences we first used a custom 236 script to group sequences into 1,269 FASTA files based on their GenBank accession numbers 237 (https://github.com/tacatanach/Hepatitis_B_Virus_alignment). Because GenBank accession 238 numbers are assigned sequentially at submission, a cluster of successive HBV sequences is likely 239 from the same region of the genome. As with the genome sequences, we used our custom script 240 to "linearize" the original GenBank sequences. We also created a consensus sequence of the full 241 HBV genome (3,639 bp) based on the genome alignment, using majority rule (rather than 242 ambiguity codes) for ambiguous sites. This consensus sequence served as a reference for 243 downstream alignment steps. Each author then aligned about 150 files, using the default gap 244 parameters in MUSCLE and checking each alignment by eye against the genome consensus 245 sequence. To facilitate aligning by eye we used custom scripts to "ladderize" the alignments by 246 ordering the sequences according to starting position and length 247 (https://github.com/tacatanach/Hepatitis_B_Virus_alignment) so that in each alignment 248 sequences with similar starting points and lengths were placed next to each other. To combine 249 two alignments we used profile-profile in MUSCLE followed by manual adjustments against the 250 genome consensus. We iteratively repeated this process to gradually combine the sequences. Manuscript to be reviewed 251 This process was successful for several rounds of profile-profile alignments. However, the 252 alignment files eventually became too large for the profile-profile function in MUSCLE to 253 consistently align pairs of files. At this point, we began manually combining pairs of aligned 254 sequence files using the cat Unix command followed by manual adjustments. Adjustments were 255 made by opening gaps in each alignment subset to align with the reference.

256
Additionally, there were 930 individually uploaded sequences ("singletons"; i.e. 257 sequences that did not cluster with other sequences according to accession number). To align 258 these sequences with the larger sequence matrix, we first split all singletons into files of 50-100 259 sequences, each file containing a consensus genome sequence. These sequences were then 260 aligned by eye to the consensus genome sequence, and resulting aligned sequences files were 261 combined as described above. We then manually combined all aligned sequence files as 262 described above, using the genome consensus sequence as a reference.

263
To further check the alignment, we created consensus sequences (with ambiguity codes) 264 for blocks of 73 (the number of lines viewable in a screen without scrolling; for ease of viewing) 265 sequences (https://github.com/tacatanach/Hepatitis_B_Virus_alignment). We then combined all 266 consensus sequences maintaining the alignment structure. By collapsing sets of sequences into a 267 single consensus sequence we could visually check the entire alignment. If consensus sequences 268 were not aligned to each other or contained many ambiguity codes, this indicated a particular 269 portion of the alignment was misaligned. We then adjusted the corresponding alignment 270 accordingly and repeated the process until we could not detect any misaligned regions.

271
During the alignment step, 5,000 sequences could not be aligned because they were 272 unflagged recombinants or from non-humans hosts and were subsequently removed. We also ran 273 five iterations of the NJ QC step and removed an additional 168 sequences. Finally, we checked Manuscript to be reviewed 274 to ensure that sequences from unique hosts, geographic regions, ancient (more than 100 years 275 old) samples, and all genotypes were present in the alignment. If any of these key data were 276 removed through prior filtering steps, they were manually added back into the alignment. 277 Including these sequences is important for having a representative sample, and warranted the 278 additional effort required to incorporate them into our alignment. We also included a sequence of 279 HBV from woolly monkey (GenBank accession JX978431) as an outgroup. The HBV S-region is often sequenced for identification of viral genotype (Galibert et al., 294 1979). Therefore, as a comparison to the genomes-only and total (genomes + fragments) data 295 sets, we identified and analyzed the S-region in the whole manual genome alignment based on 296 the annotation of GenBank accession AJ131956 (Ozaslan et al., 2007). The annotation of the S-297 region combines pre-S1, pre-S2, and the S gene.  To compare the similarity of tree topologies, we did pairwise comparisons of the trees 308 produced using each of the MSA strategies. For each of the trees we collapsed nodes based on 309 three bootstrap support thresholds (50%, 75%, 90%). We then calculated the total number of 310 edges for each tree and the number of edges that were incompatible with the other tree in the 311 pairwise comparison using Phylonet Modified (Mirarab et al., 2014). We also calculated 312 Robinson-Foulds (RF) distances (Robinson & Foulds, 1981) for each tree comparison using 313 CompareTree.pl (Fast Tree-Comparison Tools, 2009). However, due to how RF distances are 314 calculated, comparing this metric across trees can be misleading. For example, two clades which 315 are identical in topology except for the placement of one taxon can have a high RF value if the 316 one non-identical node is placed at the tip of a clade in one tree but at the base in the second tree.
317 Therefore, although we include the RF metric, we interpret our results based primarily on the 318 incompatibility ratios. We performed all possible pairwise comparisons between the different 319 trees for a specific support cutoff. We then converted each comparison to a ratio of the number Manuscript to be reviewed 321 genome and total alignment trees, we pruned the total alignment trees to be the same set of taxa 322 as the genome-only alignments.

323
As a way to assess tree differences in a biologically meaningful context, we mapped 324 genotype labels onto our tree tips. We pruned out tips that did not have any associated genotype 325 data or were recombinant (i.e. were associated with multiple genotypes) based on GenBank 326 metadata. For each genotype, we then identified the smallest possible clade that included all taxa 327 of the genotype, and calculated the percentage of tips in the clade identified as that genotype 328 using a python script mono.py (https://github.com/tacatanach/Hepatitis_B_Virus_alignment). 329 We call this metric the "genotype occupancy proportion." We repeated this analysis for each 330 bootstrap collapse threshold (50%, 75%, and 90%) of each tree.   Table 1 (on Dryad). Trees without any edges 360 collapsed (0% threshold) had the highest incompatibility ratios. The lowest ratio values for a 0% 361 threshold were from comparisons between genome alignment trees, but these were still 362 considerably higher than ratio values from comparisons at all other thresholds. Incompatibility 363 ratios tended to decrease with an increasing bootstrap threshold. The lowest ratio values were for 364 comparisons at the 75% and 90% thresholds. Ratios for comparisons at the 50% threshold tended 365 to be higher than those for comparisons at the 75% and 90% thresholds, but were all 366 considerably lower than comparisons at the 0% threshold. RF values also tended to be higher for 367 comparisons at the 0% threshold, but this was not true across all comparisons. The lowest RF 368 values were from comparisons at higher thresholds. However, there was not a clear pattern Manuscript to be reviewed 369 among comparisons at the 50%, 75%, and 90% thresholds. For example, several 90% threshold 370 comparisons had higher RF values than some 50% and 75% threshold comparisons.

371
Among trees from the different alignment types, trees from the S-region alignment 372 tended to have higher incompatibility ratio values. This was true at each bootstrap threshold. 373 Comparisons including genome alignment trees tended to have the lowest ratios whereas the 374 highest values involved comparisons between the genome and total alignments. Trees from total 375 (genomes + fragments) alignments had intermediate ratios. Trees from the total automated 376 alignments (with UPP) had slightly lower ratios than the total manual alignment, but the two 377 UPP trees had the fewest number of differences between them (Fig. 4 The average absolute difference in genotype occupancy proportions between the 383 different alignments was 4.6% for genomes and S-region trees, and 2.1% for total alignment 384 trees. Overall, the average proportions of genotype occupancy across all alignment types, 385 genotypes, and support thresholds were 52% for genome and S-region trees, and 13.1% for total 386 alignment trees. These patterns are summarized in Fig. 5, and specific genotype occupancy Manuscript to be reviewed 392 S-region trees, but lower than 10% in the total alignment trees. Although these genotypes are 393 relatively rare in GenBank (150 or fewer genome sequence and 530 or fewer total sequences per 394 genotype), they are geographically widespread ). The addition of fragmentary 395 sequence data could have greatly increased the geographic coverage of the data set which could 396 result in an increase in genetic variation within the genotype and an extreme decrease in 397 genotype occupancy. Additionally, as we were limited to GenBank annotations for identifying 398 genotype, it is possible that the inclusion of the fragmentary data, an inclusion that in some 399 instances increased the number of sequences six-fold, resulted in introducing sequences that had 400 been incorrectly annotated.   Manuscript to be reviewed 439 appropriate software (e.g., UPP or MAFFT --addfragments) to align these types of data sets is 440 preferred. The traditional method of checking automated alignments does consume time that 441 could be devoted to other tasks, can be difficult to replicate, and is impractical for large data sets. 442 Yet, accuracy remains the goal of phylogenetic estimation and is dependent on the alignment 443 step and is the driver for retention of manual curation. Here we have demonstrated that manual 444 alignment and curation give comparable results to automated methods. This is encouraging, 445 given that publically available data from high-throughput NGS sequencing projects is increasing 446 exponentially and will likely drive a greater reliance on automated methods. It also suggests that 447 findings based on completely automated methods are comparable to results from earlier studies 448 that may have included a manual step. Integrating complete data with existing fragmentary 449 sequences is crucial for addressing biologically relevant questions and in our data, we have 450 shown this can be done so in a consistent manner across commonly used methodologies.

451
Although minor differences in tree topologies are expected when multiple analytical 452 approaches are used, these differences may not be biologically meaningful (i.e rearrangement of 453 branches within a clade). To test the impact of alignment type on the biological meaning of the 454 resulting phylogenies we used HBV genotypes, a feature often annotated by researchers when 455 submitting HBV sequences to GenBank. We sought to determine if HBV genotypes where stable 456 in their location in the tree (consistent) and if they were monophyletic (accurate with regard to 457 evolutionary relatedness) within our tree. Our results are consistent that the genotype occupancy 458 in a clade remain stable across alignment types for both genome and fragmentary data sets. This 459 indicates the alignment approach for these types of data sets does not have a major effect on the 460 biological interpretation of the inferred topologies, particularly when testing the monophyly of a 461 particular subset of tips. This is consistent with our test of incompatible edges among topologies Manuscript to be reviewed 507 genotypes. Together these results suggest the rapid, automated approach will be useful for other 508 "tall and narrow" data sets, including those of medically important taxa.

509
Another issue facing phylogenetic estimation of HBV is the lack of a universal sequence 510 starting point. Although the genomes are circular, sequences are uploaded to GenBank in linear 511 format. Without a universal starting location, publicly available HBV sequences begin at 512 different places in the genome. Additionally, the 3' end of one sequence may be orthologous to 513 the 5' end of another sequence. In a sequence alignment, the two sequences would only overlap 514 over a fraction of their entire length, and the whole alignment would be much longer than 515 necessary. Indeed, when we initially used PASTA to align HBV genomes downloaded from 516 GenBank, the alignment was approximately 19,000 bp in length, or 6 times longer than the 517 length of the HBV genome. It was therefore necessary to "linearize" the alignments by moving 518 3' sequences to the 5' end before aligning the sequences using a fully automated approach. A 519 similar approach should be taken by future studies focusing on HBV phylogenetics, or for other 520 organisms with similar data issues. Our script chooses an arbitrary cutoff point for 521 "linearization," but perhaps future approaches should choose alignment starting points based on 522 genes or other biologically relevant information. Databases could also implement a standardized 523 starting point for data deposited from these circular genomes.

525
Here we demonstrate that UPP and PASTA, completely automated MSA approaches, 526 resulted in alignments that produced phylogenetic trees that are topologically and biologically 527 similar to those produced using a sequential approach that uses manual interventions. However, 528 the fully automated approaches were completed quickly as opposed to the numerous hours spent 529 on adjusting alignments for the traditional method. Although we used PASTA to generate the   Manuscript to be reviewed    Manuscript to be reviewed Manuscript to be reviewed