Ultra-large alignments using phylogeny-aware profiles

Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique, the ensemble of hidden Markov models, which we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp. Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0688-z) contains supplementary material, which is available to authorized users.


S1 Early termination on large datasets
Many alignment methods failed to complete analyses on the larger datasets, but reasons varied.
Some failed due to insufficient memory, or due to a bug in the software, or were simply unable to produce an alignment within the 24 hour time limit (i.e., they might have been able to produce an alignment if given more time). This section documents each case.
MAFFT-default. MAFFT-default terminated early on the CRW 16S.B.ALL and three of the Indelible 10000M3 datasets. The error messages produced by MAFFT-default have the following template: Cannot allocate <X> character vector.
where X is a large number. MAFFT-default also failed to produce an alignment on the RNASim 100K dataset within the 24 hour time limit on TACC. According to MAFFT's output log, MAFFT was still running when the job was evicted.

S2.2 Backbone size
We examined the impact of the backbone size on alignment and tree accuracy on the RNASim dataset (Table S2.2). UPP run in default mode uses a backbone of size 1000, and run in fast mode uses a backbone of size 100. The comparison between UPP(Default) and UPP (Fast) shows that using the larger backbone generally resulted in lower alignment SP-error and tree error than using a smaller backbone. However, the larger backbone increased the running time, often substantially.
We found that UPP using PASTA and MUSCLE backbones resulted in the most accurate UPP

S2.4 Backbone and final alignment SP-error
We examined the alignment SP-error of the initial backbone alignment and the resulting alignment SP-error of the final UPP alignment (Fig. S2.4). We found that the backbone alignment matches the initial error in the backbone alignment. The explanation is that several clades are omitted from the backbone set due to UPP's restriction of the backbone to sequences considered full-length, so that UPP has difficulty aligning sequences from those clades.

S2.5 Query sequence alignment method
We compared three different techniques for aligning the query sequences to the backbone alignment within the UPP pipeline: using the ensemble of HMMs technique, using MAFFT-Profile "--add", and using MAFFT-Profile "--addfragments". Table S2.2 and Figure S2.5 showed that the ensemble of HMMs technique resulted in lower alignment SP-error and tree error than MAFFT-Profile, whether using --add or --addfragments. In addition, UPP using the ensemble of HMMs technique made it possible to align 200,000 sequences within 24 hours, but UPP using MAFFT-Profile "--add" was unable to align the 200K dataset in that timeframe, and UPP using MAFFT-Profile "--addfragments" could only align up to 10,000 sequences (Table S2.2). Comparing MAFFT-Profile "--add" and MAFFT-Profile "--addfragments", we found that MAFFT-Profile "--addfragments" resulted in lower alignment SP-error and tree error than MAFFT-add ( Fig. S2.5), at a large increase in running time (Table S2.2).

S2.6 Comparison of clade-based versus centroid edge decomposition.
The centroid edge decomposition does not guarantee that the resulting sequence subsets form monophyletic clades in the backbone tree. Thus, the subset alignments can be polyphyletic or paraphyletic, especially if the backbone tree is unbalanced.
In this section, we compared a centroid edge decomposition to a clade-based decomposition.
The clade-based decomposition partitions the backbone tree into two subtrees by breaking the tree on the root node. This process recursively repeats on any subtree that is larger than the maximum alignment subset size (set to 10 for this example). (c) Wall clock running time Figure S2.5: Impact of backbone size and query sequence alignment method on alignment SP-error, tree error, and running time. Methods labeled with "Default" use a backbone size of 1000. Methods labeled with "Fast" use a backbone size of 100. Methods labeled as "MAFFT" used MAFFT as a profile alignment method (either under the "addfrag" or "add" setting) to insert the query sequences into the backbone alignment. ML trees were estimated using FastTree under GTR.
The results on the centroid-edge versus clade-based decomposition for the CRW datasets showed that neither method was consistently better than the other (Fig. S2.6). However, the clade-based decomposition resulted in a larger number of alignment subsets (438 versus 280 for 16S.T), and thus required more time to run (6.1 hours versus 3.9 hours for 16S.T).

S2.7 hmmbuild options
We explored two different ways of running hmmbuild. The first way was to disable the relative sequence weight option. By default, hmmbuild downweights similar sequences and upweights divergent sequences for computing the character frequencies in the profile during the HMM generation. The motivation is to minimize the impact of biased and uneven sampling caused by too many similar sequences being found in the same model. However, this works against the purpose of using an ensemble of HMMs, as the idea is to form profile HMMs from sequence sets have been intentionally partitioned into subsets containing similar sequences. We ran UPP with this flag turned off (labeled "UPP(Default,NoReweight)").
The second way was related to hmmbuild's computation of the number of effective sequences used to generate the HMM (called "entropy-weighting"). HMMER, by default, attempts to reach a per site entropy setting of 0.6 bits per consensus position. By default, HMM computes an effective number of sequence that is typically smaller than N , the number of sequences in the subset used to generate the HMM model, which has the effect of reducing the HMM score per match. This causes longer alignments to require more hits to receive good scores, but as a side effect this causes short sequences to receive lower scores than expected.
We ran UPP with this flag turned off (labeled "UPP(Default,NoEntropy)"). ML trees were estimated on the alignments using RAxML under JTT, LG, or WAG models of protein evolution (using ProtEST (5) to select the amino acid substitution model).
We found that while all PASTA variants resulted in alignments with comparable accuracy, RAxML maximum likelihood trees on PASTA using MUSCLE to merge subalignments resulted in the most accurate trees (Fig. S3.1). We refer to this version as "PASTA-MUSCLE." We then compared PASTA-MUSCLE to alignments and trees computed using standard PASTA commands.
We present information on the external software we used in running PASTA. Each dataset was aligned (when possible) using Opal (2)