LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes

Background Computing genomic similarity between strains is a prerequisite for genome-based prokaryotic classification and identification. Genomic similarity was first computed as Average Nucleotide Identity (ANI) values based on the alignment of genomic fragments. Since this is computationally expensive, faster and computationally cheaper alignment-free methods have been developed to estimate ANI. However, these methods do not reach the level of accuracy of alignment-based methods. Methods Here we introduce LINflow, a computational pipeline that infers pairwise genomic similarity in a set of genomes. LINflow takes advantage of the speed of the alignment-free sourmash tool to identify the genome in a dataset that is most similar to a query genome and the precision of the alignment-based pyani software to precisely compute ANI between the query genome and the most similar genome identified by sourmash. This is repeated for each new genome that is added to a dataset. The sequentially computed ANI values are stored as Life Identification Numbers (LINs), which are then used to infer all other pairwise ANI values in the set. We tested LINflow on four sets, 484 genomes in total, and compared the needed time and the generated similarity matrices with other tools. Results LINflow is up to 150 times faster than pyani and pairwise ANI values generated by LINflow are highly correlated with those computed by pyani. However, because LINflow infers most pairwise ANI values instead of computing them directly, ANI values occasionally depart from the ANI values computed by pyani. In conclusion, LINflow is a fast and memory-efficient pipeline to infer similarity among a large set of prokaryotic genomes. Its ability to quickly add new genome sequences to an already computed similarity matrix makes LINflow particularly useful for projects when new genome sequences need to be regularly added to an existing dataset.


87
The Life Identification Number (LIN) system is a genome similarity-based system to classify 88 individual organisms based on reciprocal ANI (Marakeby et

96
To analyze the diversity of a collection of prokaryotic genomes, computing all pairwise 97 comparisons cannot be avoided by any of the above methods and their implementations. However, 98 when dealing with a large number of genomes, pairwise comparisons are computationally expensive.
99 Furthermore, because of the ever-growing number of sequenced genomes and their frequent addition 100 to existing datasets, re-analysis of datasets each time new genomes are added becomes necessary.

101
Here we alleviate the bottleneck of pairwise ANI computations by developing LINflow, a pipeline 102 that efficiently constructs highly resolved similarity matrices from 70% to 99.9% ANI by combining the

118
The LINflow flowchart is shown in Figure 1. When a new genome is added to the dataset, 119 LINflow identifies the genome that is most similar to this new genome among the genomes that were 120 previously added using the computationally efficient alignment-free tool sourmash (Brown & Irber 2016; 121 Pierce et al. 2019). This is accomplished via a two-step procedure, which consists of first identifying the 122 LINgroup to which the new genome belongs and then identifying the most similar genome in this 123 LINgroup. By default, LINflow uses the 95%-level LINgroup in this step, but this can be modified by the 124 user based on the expected genomic similarities in a specific dataset. The precise ANI value between the 125 two genomes is then computed using the more computationally expensive, but precise, pyani tool 126 (Pritchard 2014). LINflow uses this ANI value to assign a LIN to the new genome based on the LIN of its 127 most similar genome (which LIN was previously assigned based on the ANI value to its most similar 128 genome when that genome itself was previously added to the dataset). The assigned LINs can then be 129 used to infer all-against-all ANI values even though only a single ANI computation is performed for each 130 genome.

131
To make the results reusable and easily accessible in terms of reading and writing, a relational 132 database managed by MySQL (MySQL) is used to store data, with the schema shown in    149 ranges from 70% ANI to 99.999% ANI to cope with genus to strain level differentiation and is currently 152 3000-position scheme starting at 70% and reaching 99.99% at 0.01% intervals (recommended for 153 constructing highly resolved similarity matrices for strains belonging to the same species), (4) a 300,000-154 position scheme starting at 70% and reaching 99.99% at 0.00001% intervals. The user can also define 155 any custom LIN scheme with any number of positions using any ANI value of choice (however, it is not 156 advised to use ANI values below 70% since ANI does not reflect evolutionary relationships when ANI falls 157 below 70%).  164 The new genome's (G Query ) signature S Query will be first queried against the representative genomes of all 165 existing 95%-level LINgroups (or F LINgroups) with k=21, using sourmash. Based on the analysis of more 166 than 6000 bacterial genomes from different families with k=21, the Jaccard similarity of 0.2475 was 167 found to correspond to 95% ANI (data not shown). If the highest Jaccard similarity J to one of the 168 representative genomes S Query is above 0.2475, then the corresponding genome represents the 95%-169 level LINgroup (L 95% ), belongs to. If 0.0025 < J <= 0.2475, then the corresponding genome is at least 70% 170 similar to G Query , which means they share at least the A position in the default LIN scheme. If J <= 0.0025, 171 the corresponding genome is less than 70% similar to G Query .

If
, S Query will be queried against all members of L 95% by sourmash with k = 51. The > 0.2475 173 most similar genome G Subject according to Jaccard similarity in L 95% is identified as the most similar

If
, S Query will be queried against all the members of L 95% by sourmash 0.0025 < ≤ 0.2475 176 with k = 21. The most similar genome G Subject according to Jaccard similarity in L 95% is identified as the 177 most similar genome to G Query in the entire database.

178
For the above cases, ANI between G Query , and G Subject , ANI Query , will be calculated with pyani. To 179 assign the LIN to the query genome, G Subject 's LIN, LIN Subject , will be used as the reference from A to the 180 last position that the ANI threshold is lower than or equal to ANI Query , the first position that ANI 181 threshold is larger than ANI Query will be assigned a number that has not been used with the prefix in the 182 database, and the rest of the positions will be filled with 0s. For example, ANI Query = 95.4575%, it is over 183 95% at the F position but lower than 96% at the G position, so it will use LIN Subject from A to F as LIN Query 's 184 A to F. At LIN Query 's G position, a number that has never been used together with LIN Subject 's prefix from A 185 to G will be assigned. Each of LIN Query 's H to T positions will be assigned 0.

If
, no genome in the current database has over 70% ANI to G Query , so that a new ≤ 0.0025 187 number that has never been used in the A position before will be assigned to LIN Query 's A position, and 188 the rest of LIN Query will be filled with 0s.

189
Update of database and signature file system 190 G Query , G Subject , ANI Query , and LIN Query will all be written to the database. If LIN Query creates a new 95%-level 191 LINgroup, a new directory for this LINgroup will be created and S Query will be saved in this directory as a 192 member and as a representative genome with other representative genomes, otherwise, S Query will be 193 only saved in the existing 95%-level LINgroup it belongs to.