AlignStat: a web-tool and R package for statistical comparison of alternative multiple sequence alignments

Background Alternative sequence alignment algorithms yield different results. It is therefore useful to quantify the similarities and differences between alternative alignments of the same sequences. These measurements can identify regions of consensus that are likely to be most informative in downstream analysis. They can also highlight systematic differences between alignments that relate to differences in the alignment algorithms themselves. Results Here we present a simple method for aligning two alternative multiple sequence alignments to one another and assessing their similarity. Differences are categorised into merges, splits or shifts in one alignment relative to the other. A set of graphical visualisations allow for intuitive interpretation of the data. Conclusions AlignStat enables the easy one-off online use of MSA similarity comparisons or into R pipelines. The web-tool is available at AlignStat.Science.LaTrobe.edu.au. The R package, readme and example data are available on CRAN and GitHub.com/TS404/AlignStat. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1300-6) contains supplementary material, which is available to authorized users.


Mathematical Implementation
Quantifying similarity and identifying the most comparable MSA columns When alignment algorithms make different homology predictions for a set of sequences, the columns of the resulting MSAs will contain different residues. The method for quantitative, pairwise comparison between one MSA with respect to a reference MSA is equivalent for nucleotide or amino acid sequences.
Each MSA of n sequences is treated as a matrix of characters (residues plus a gap character) with the same number of rows. The two matrices are therefore defined as (of dimensions n x p) and (of dimensions n x q), where each row represents an aligned sequence. Residues can occur multiple times in a sequence and so are numbered by occurrence such that each character in a row has a designation ( Figure S4). This ensures that alignment columns that contain a non-homologous occurrence of a residue are correctly distinguished.
The first step in comparing the two alternative MSAs is to identify the most comparable columns. For the matrices and , each column vector pair is compared to calculate the similarity measure defined in Equation 1 (where is the th column of , and is the th column of ). The similarity matrix, (of dimensions p x q), measures the similarity between the columns of and ( Figure S5). The matrix is visualised using the plot_similarity_heatmap function of the AlignStat R package.

Equation 1
Where Sij is the similarity score for each column pair between P and Q, the equivalency function ε is defined in Equation 2.

Equation 2
Quantifying dissimilarity For each column in we find its "match" in by finding the index j at which is maximized. The match column in is analysed in more detail to categorise the causes of difference. This leads to the dissimilarity matrix, (of dimensions n x p x 5) based on the function defined in Equation 3. This dissimilarity matrix categorises the types of difference between the reference and comparison alignments.

Equation 3
Where εk(a,b) is the k th equivalency function as defined in Equation 4.

Equation 4
The first equivalency (ε1) is a 'match', in which the two characters are identical and not gaps. The second equivalency (ε2) is a 'conserved gap', when the both characters are gaps. A 'merge' is when P contains a gap, but Q contains any other character (ε3). Similarly, a 'split' is when Q contains a gap, but P contains any other character (ε4). Finally, a 'shift' is when two characters are not identical and neither are gaps (ε5) ( Figure S6). The matrix is visualised using the plot_dissimilarity_matrix function of the AlignStat R package.

Summarising similarity and dissimilarity
For each of the matrix, the column averages are used to describe the sources of dissimilarity between the reference and comparison alignments. The results matrix (of dimensions 5 x p) therefore measures the average dissimilarities between columns of and as defined by the function in Equation 5 (Figure S7).

Equation 5
Where is the results matrix, each row of which is used to summarise a source of dissimilarity from the matrix.
A single, overall similarity score describes the weighted average similarity of the two MSAs, as defined in Equation 6. The treatment of gaps in MSAs is complex (Simmons and Ochoterena 2000;Egan and Crandall 2008). In this case, the most instructive measure is to exclude conserved gaps, to prevent results being skewed by the "similarity" of conserved gaps in low occupancy columns. Therefore, the overall score is the sum of the match characters as a proportion of characters that are not merely conserved gaps (Figure S8). The match row of the matrix ( ) is visualised using the plot_similarity_summary function of the AlignStat R package. The merge, split and shift rows of the matrix ( , and ) are visualised using the plot_dissimilarity_summary function of the AlignStat R package.

Equation 6
Overall similarity score

Figure S4 | Generating a numbered-character sequence alignment
A sequence alignment (A), where each row represents an aligned sequence, is converted to a matrix of numbered characters (P), where each character is numbered based on the occurrence of that character occurrence in that row. For each column of P, the columns of Q with the highest Sij is compared using the dissimilarity function defined in Equation 3 to generate the dissimilarity matrix (D), which describes the dissimilarity of Q with respect to P.

Figure S7 | Generating a results matrix
The columns of the dissimilarity matrix ( ) are summarised using the summary function described in

Equation 5
to generate a results matrix ( ). Each row of R describes the column averages of a different k of the dissimilarity matrix. The columns when k=5 are simply averaged (describing conserved gaps). All other columns are averaged and divided by the inverse of the column average of k=5 (describing average dissimilarities as a proportion of the column that is not conserved gaps).

0.69
Similarity score Figure S8 | Generating an overall similarity score The dissimilarity matrix ( ) is summarised using the summary function described in Equation 6 to generate a final weighted average score, which measures the overall similarity of the initial MSAs. The score is the weighted average of matching characters out of the total characters that are not conserved gaps.