CIAlign - A highly customisable command line tool to clean, interpret and visualise multiple sequence alignments

Background Throughout biology, multiple sequence alignments (MSAs) form the basis of much investigation into biological features and relationships. These alignments are at the heart of many bioinformatics analyses. However, sequences in MSAs are often incomplete or very divergent, which leads to poorly aligned regions or large gaps in alignments. This slows down computation and can impact conclusions without being biologically relevant. Therefore, cleaning the alignment by removing these regions can substantially improve analyses. Results We present a comprehensive, user-friendly MSA trimming tool with multiple visualisation options. Our highly customisable command line tool aims to give intervention power to the user by offering various options, and outputs graphical representations of the alignment before and after processing to give the user a clear overview of what has been removed. The main functionalities of the tool include removing regions of low coverage due to insertions, removing gaps, cropping poorly aligned sequence ends and removing sequences that are too divergent or too short. The thresholds for each function can be specified by the user and parameters can be adjusted to each individual MSA. CIAlign is complementary to existing alignment trimming tools, with an emphasis on solving specific and common alignment problems and on providing transparency to the user. Conclusion CIAlign effectively removes poorly aligned regions and sequences from MSAs and provides novel visualisation options. This tool can be used to improve the alignment quality for further analysis and processing. The tool is aimed at anyone who wishes to automatically clean up parts of an MSA and those requiring a new, accessible way for visualising large MSAs.

before and after processing to give the user a clear overview of what has been removed. 23 The main functionalities of the tool include removing regions of low coverage due to 24 insertions, removing gaps, cropping poorly aligned sequence ends and removing sequences 25 that are too divergent or too short. The thresholds for each function can be specified by the 26 user and parameters can be adjusted to each individual MSA. CIAlign is complementary to 27 existing alignment trimming tools, with an emphasis on solving specific and common 28 alignment problems and on providing transparency to the user. presence would increase the time required for phylogenetic analysis without necessarily 92 adding any additional information. Large gaps in some sequences may also result from 93 missing data, rather than true biological differences and, if this is known to be the case, it is 94 often appropriate to remove these regions before performing phylogenetic analysis [18]. 95 Thirdly, one or a few highly divergent sequences can heavily disrupt the alignment and 96 therefore complicate downstream analysis. It is very common for an MSA to include one or a 97 few outlier sequences which do not align well with the majority of the alignment. One example 98 of this is metagenomic analyses identifying novel sequences in large numbers of datasets. It 99 is common to manually remove phylogenetic outliers which are unlikely to truly represent 100 members of a group of interest (see for example [19][20][21]) but this is not feasible when 101 processing large numbers of alignments.

7
Finally, very short partially overlapping sequences cannot always be reliably aligned using 103 standard global alignment algorithms. It is very common to remove these sequences, 104 manually or otherwise, prior to further analysis. 105 There are also several common issues in alignment visualisation. Large alignments can be 106 difficult to visualise and a small and concise but accurate visualisation can be useful when 107 presenting results, so this has been incorporated into the software. With many alignment 108 trimming tools it can be difficult to track exactly which changes the software has made, so a 109 visual output showing these changes is generated. 110 Finally, transparency is often an issue with bioinformatics software, with poor reporting of 111 exactly how a file has been processed [22][23][24]. CIAlign has been developed to process 112 alignments in a transparent manner, to allow the user to clearly and reproducibly report their 113

117
CIAlign is a command line tool implemented in Python 3. It can be installed either via pip3 or 118 from GitHub and is independent of the operating system. It has been designed to enable the 119 user to remove poorly aligned regions and sequences from an MSA, to visualise the MSA 120 (including a markup file showing which regions and sequences have been removed), and to 121 interpret the MSA in several ways. CIAlign works on nucleotide or amino acids alignments 122 and will detect which of these is provided. A log file is generated to show exactly which 123 8 sequences and positions have been removed from the alignment and why they were 124 removed. Users can then adjust the software parameters according to their needs. 125 CIAlign takes as its input any pre-computed MSA in FASTA format containing at least three 126 sequences. Most MSAs created with standard alignment software will be of an appropriate 127 scale, for example single or multi-gene alignments and whole genome alignments for many 128 microbial species. Measurements on the runtime were conducted for MSAs created by should be run first as they potentially make removing more positions unnecessary and 155 therefore keep processing to a minimum. For example, divergent sequences often contain 156 many insertions compared to the consensus, so removing these sequences first reduces the 157 number of insertions which need to be removed. Sequences can be made shorter during 158 processing with CIAlign and therefore too short sequences are removed last. 159 For each column in the alignment, this function finds the most common nucleotide or amino 165 acid and generates a temporary consensus sequence. Each sequence is then compared 166 individually to this consensus sequence. Sequences which match the consensus at a 167 proportion of positions less than a user-defined threshold (default 0.75) are excluded from the 168 alignment ( Fig 1B). It is recommended to run the make_similarity_matrix function to 169 calculate pairwise similarity before removing divergent sequences, in order to adjust the 170 parameter value for more or less divergent alignments. 171 172

Remove Insertions 173
In order to define a region as an insertion, an alignment gap must be present in the majority of 174 sequences, flanked by a minimum number of non-gap positions on either side, which can be 175 defined by the user (default 5). The minimum and maximum size of insertion to be removed 176 can also be defined by the user (default 3 and 300 respectively) ( Fig 1C). with gaps. This will be described for redefining the sequence start, however crop ends is also 183 applied to the reverse of the sequence to redefine the sequence end. 184 The number of gap positions separating every two consecutive non-gap positions is 185 compared to a threshold and if that difference is higher than the threshold, the start of the 186 sequence will be reset to that position. This threshold is defined as a proportion of the total 187 sequence length, excluding gaps, and can be defined by the user (default: 0.05) (Fig 1D, Fig  188   2). 189 The user can set a parameter that defines the maximum proportion of the sequence for which 190 to consider the change in gap positions (default: 0.1) and therefore the innermost position at 191 which the start or end of the sequence may be redefined. It is recommended to set this 192 parameter no higher than 0.1, since even if there are a large number of gap positions beyond 193 this point, this is unlikely to be the result of incomplete sequences (Fig 2). 194

Remove short sequences 196
Remove short sequences removes sequences which have less than a specified number of 197 non-gap positions,which can be set by the user (default: 50) (Fig 1E). 198 199

Remove gap only columns 200
Remove gap only removes columns that contain only gaps. These could be introduced by 201 manual editing of the MSA before using CIAlign or by running the functions above (Fig 1F). 202 The main purpose of the function is to clean the gap only columns that are likely to be 203

Mini Alignments 212
CIAlign provides functionality to generate mini alignments, in which an MSA is visualised 213 using coloured rectangles on a single x and y axis, with each rectangle representing a single 214 nucleotide or amino acid (e.g. Fig 1, Figs 3-5). Even for large alignments, this function 215 provides a visualisation that can be easily viewed and interpreted. Many properties of the 216 resulting file (dimensions, DPI, file type) are parameterised. In order to minimise the memory 217 and time required to generate the mini alignments, the matplotlib imshow function [25] for 218 displaying images is used. Briefly, each position in each sequence in the alignment forms a 219 single pixel in an image object and a custom dictionary is used to assign colours. The image 220 object is then stretched to fit the axes. consists of many sequences that align well, however there are again a few problems: one 248 sequence has a large insertion, one is very short, one is extremely divergent, and some have 249 multiple gaps at the start and at the end. In order to test CIAlign on real biological sequences, an alignment was generated based on 279 the COI gene commonly used in phylogenetic analysis and DNA barcoding [30]. As CIAlign 280 addresses some common problems encountered when generating an MSA based on de novo 281 assembled transcripts, which tend to have a higher error rates at transcript ends, gaps due to 282 difficult to assemble regions and divergent sequences due to chimeric connections between 283 unrelated regions [11,32], COI-like transcripts were identified by searching the NCBI 284 transcriptome shotgun assembly database. Aligning these transcripts demonstrated several 285 common problems -multiple insertions, poor alignment at the starts and ends of sequences, 286 and a few divergent sequences resulting in excessive gaps (Fig 5A). This alignment was 287 parsed using the default CIAlign settings except the threshold for removing divergent 288 sequences was reset to 50%, as some of the sequences were from evolutionarily distant 289 species. Under these settings, CIAlign resolved several of the problems with the alignment: 290 the insertions and highly divergent sequences were removed and the poorly aligned regions 291 at the starts and ends of sequences were cropped (Fig 5B). One sequence and 6,029 Therefore the tree based on the CIAlign cleaned alignment was generated more quickly, used 311 less memory, and was more similar to the expected tree. 312 While the functionality of CIAlign has some overlaps with other software, for example Jalview 313 [34], Gblocks [7] and trimAl [8], the presented software can be seen as complementary to 314 these, with some different features and applications. Jalview is designed for manual curation 315 of alignments, but it is unsuitable for a simple overview of large alignments and does not 316 provide the option of editing automatically, which is useful in large batch applications and 317 ensures reproducibility. Gblocks is designed to choose blocks from an alignment that would 318 be suitable for phylogenetic analysis, which is too restrictive for many other purposes. Some 319 functionalities of trimAl overlap with those of CIAlign; however, trimAl is designed to 320 algorithmically define and remove any poorly aligned regions whereas CIAlign is designed to 321 remove specific MSA issues, as defined by the user, for different downstream applications. 322 For highly divergent alignments, trimAl can be too sensitive and remove useful regions. 323 CIAlign also provides additional visualisation options. Therefore, CIAlign should be seen as a 324 tool that aims to fill in the gaps that exist in currently available software. 325 Having as many parameters as possible to allow as much user control as possible gives 326 greater flexibility. However, this also means that these parameters should be adjusted, which 327