Fast and exact quantification of motif occurrences in biological sequences

Background Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce ‘motif_prob’, a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics. Results We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13–31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50–1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60–120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob. Conclusions The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.


Background
Motif discovery and characterization are important for the study of gene evolution, duplication, transcription sites, and protein identification [1], as well as of genetic diseases caused by unstable repeat expansion [2,3].
Assessing the statistical significance of motif enrichment is a fundamental and challenging step of motif discovery, and can severely hamper downstream analytics. Kiesel et al. [16] pointed out that p values "Small enrichment factors can occur frequently in practice simply due to an imperfect background model that slightly underestimates the expected frequency of occurrence". In addition, p values are crucial not only in the discovery phases, but also in motif comparison and motif-motif similarity studies [17]. The classical definition of the motif enrichment problem (in terms of differences among motifs occurrences within background genome contents) has been proven to be NP-hard [18]. The p value calculation is not straightforward, and requires making assumptions on a background model of base frequencies and co-occurrence in order to derive a distribution of motif occurrences in reference genomes [19]. Several formulae-approximated and exact-and algorithms for estimating motif count distributions have been devised and implemented [20][21][22][23][24][25][26][27][28]. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large data sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging [19,25]. Thus, methods for p value estimation can be a bottleneck in large-scale projects. HOMER, Weeder and Peak-motifs do not report motif statistical significance, MEME uses an approximation approach (very conservative), later improved by DREME and the new simple, thorough, rapid, enriched motif elicitation (STREME) [10,15], and MFMD uses information content score and complexity scores [29].
A software that provides a comprehensive occurrence and probability estimation is the bioinformatics toolkit for Motif Statistics and Discovery (MoSDi) by Marschall [30], written in Java, featuring models based on the approximated compound Poisson and nth level Markov order, as well as (quasi-)exact combinatorial formulae to reduce computational complexity (https:// bitbu cket. org/ tobia smars chall/ mosdi). Another tool is motifcounter [31], an R-Bioconductor library implementing existing methods [27,32], as well as an improvement on the compound Poisson model. One limitation of these programs is that calculation of occurrence distribution-even using the fast compound Poissonbecomes impractical with longer motifs (10+) and longer reference genomes (millions of bases), besides large motif datasets.
Prosperi et al. [28] provided an exact formula for counting the distribution of strings that do not overlap with themselves (i.e. non-clumpable), coupled with a mathematical demonstration of its validity, under both Bernoullian and Markovian assumptions. The calculation of the formula was exponential in the genome length by the length of the motif, but the authors demonstrated that it could be calculated efficiently within an arbitrary tolerance level.
This software article describes "motif_prob", a count distribution tool suitable for long motifs and long reference genomes, implementing the exact method by Prosperi et al. [28] with the efficient error-bound algorithm. In addition to the relevance of this software piece for large-scale processing, another motivation for our work is that the majority of probability distribution or p value calculators, even the most recent ones, use heuristics. To our knowledge, the formula by Prosperi et al. is still among the most efficient for exact calculation. The proposed motif_prob implementation thus makes exact quantification suitable with large scale projects, and posits to substitute currently employed heuristics. We compare motif_prob with other tools in terms of run time and precision, showing that its exact algorithm is several orders of magnitude faster even than the approximated methods, and finally we describe use cases for long motifs in bacteria.

Theoretical formulation
The exact formula by Prosperi et al. [28], for the calculation of the frequency distribution j of a string of length m within a text of length n (m < n) over alphabet k, under the Markovian model, is where P(S) = P(a 1 ) · P(a 2 | a 1 ) · … · P(a m−1 | a m ), P(S 0,n ) = P(S 0,n−1 ) − P(S) · P(S 0,n−m ), S 0,n = S 0,n−1 · k − P(S) · k m · S 0,n−m , d 1 … d j+1 are the lengths of the j + 1 segments where the j strings divide the text of length n in exact configurations with d 1 + ··· + d j+1 = n − mj, and Formula (1) has a complexity of O(n j ), which becomes quickly intractable. However, by defining R = P(S 0,n+1 )/P(S 0,n ) as a constant, Prosperi et al. show that for any positive (arbitrarily small) number ε, there is an index η ε such that for every η > η ε then By using this approximation, the summation of the original formula can be reduced to a single step, and calculations can be stopped when the ratio P(S 0,n )/P(S 0,n−1 ) reaches a desired level of tolerance ε. Specifically, after plugging the iterative approximation (3) in (1), we obtain the final formula We note that P(j, m, n) is the same irrespective of the position of the nucleotides in a query string, e.g. AACCC and CCCAA have the same probability. This property permits to extrapolate a probability for clumpable strings by permutation, e.g. ACCA into CCAA, although the value is not guaranteed given possible overlap. Another way is to replace the first or the last character with another one that has the same frequency. All details on the derivation of the exact formula and the proof for its progressive approximation, along with comparison against other state-of-art algorithms, can be found in the original work by Prosperi et al. [28].

Implementation
Two different implementations are produced: one in Perl and another in C++. Both programs take the same input and parameters, namely: (1) a query string or multiple strings to be analyzed; (2) the length of the reference genome; and (3) the nucleotide frequencies of the genome. In alternative to the genome length and nucleotide frequencies, a FASTA file containing the genome string can be passed as input to the program. The output file reports-for each motif-the count distribution and other summary information including a flag for clumped strings, string probability, and statistics on the precision and tolerance levels.
Since the computational complexity of the formula is exponential, motif occurrences are calculated at increasing counts until the occurrence probability becomes lower than given a tolerance level ε, or the upper limit of counts j is reached. We also control estimates at each iteration in order to avoid issues with floating point operations when frequency/length ratios diverge, and to handle relatively ill-posed configurations. Given the motif m and genome g lengths, one can set a tolerance level ε such that P(0, m, n) > (1 − ε), and in general each case where (1 − P(S)) (m−m+1) > (1 − ε). This is equal to (n − m + 1)•log(1 − P(S)) > log(1 − ε), which implies n > m − 1 + log(1 − ε)/log(1 − P(S)). In the source code, we have set ε to 10 −7 and j to 500. Further, we implement the calculation of the expected number of strings and the motif 's (stationary) occurrence probability at any text position, according to Robin et al. [33]. Figure 1 provides a flowchart of the data processing pipeline, showing the required input specifications, the method's internal parameters, and the output fields.
The source code, documentation, sample datasets, and executable files are available under the MIT license at https:// github. com/ DataI ntell SystL ab/ motif_ prob.

Results
An example of the occurrence distribution for motif query sequences of length 6, calculated on a randomly generated genome of 20,000 bases, varying the nucleotide frequencies, is illustrated in Fig. 2. The difference between the equiprobable base and the more general case is evident and demonstrates how the background distribution affects the p value calculation (see real-world use case after the benchmarks). Table 1 shows run time benchmarks on different motif length and motif set size configurations, executed on a laptop machine with Intel(R) Core(TM) i9-10885H CPU @ 2.4 GHz, 32 GB RAM. Both the Perl and the C++ programs exhibit run times several orders of magnitude smaller than MoSDi, even when the latter is executed with the fast compound Poisson approximation. We set a maximum processing time of 30 min for datasets up to 400,000 motifs, and MoSDi can process them only with smaller values of k and the approximated model, while the exact model is not feasible for most of datasets.
The C++ implementation is the fastest, and the expected run time increase due to higher motif lengths is well compensated by the implementation setup.
In terms of precision, we compare the exact probability values yielded by our program with both the compound Poisson and the exact estimates of MoSDi (allowing it to switch automatically to standard/doubling algorithms to improve run time). As previously described, usually the largest errors appears near the probability mass points [28]. For all motif lengths combinations of 4 bases, over a 10,000 bases reference genome, on average the peak probability values of MoSDi and motif_prob exact differ by two orders of magnitude, e.g. if the peak probability is in the range of 10 −2 then the observed absolute difference is 10 −4 . The difference with the compound Poisson approximation is larger, on average double than the exact, but the relative ratio it is still one-two orders of magnitude smaller than the actual values. The difference becomes smaller as the sequence lengths increase.
We further test the concordance among MoSDi and motif_prob using a real motif dataset, the library of DNA-binding site matrices for Escherichia coli (https:// arep. med.  . Once again, the differences with the approximated estimation are larger but of the same level of magnitude. Figure 3 illustrates the absolute difference in probability between motif_prob and MoSDi (exact/compound Poisson) as well as the relative magnitude difference, expressed as the log10(Prob motif_prob /abs(Prob motif_prob − Prob MoSDi )), which well highlights how the difference between the two exact methods (and the compound Poisson too, although larger) is negligible with respect to the actual probability estimates.  As a final use case, we investigate the distribution of frequencies of antimicrobial resistance gene signatures found in bacteria under different GC content. Drug resistance mechanisms in bacteria involve acquisition of genes, often via mobile genetic elements, and in some cases changes within core housekeeping genes. A number of algorithms use k-mers, i.e. motifs of fixed k length, to classify antimicrobial resistance [34], as they can be handled efficiently through ad hoc data structures suitable to process high-throughput data. But assessing the importance of a k-mer with respect to their frequency in drug resistance genes is not straightforward; one issue is that bacteria and genes can have very different GC content [35]. When the GC content varies, the probability distributions of motif occurrence can change over a broad range (given also the underlying, individual A, C, G, and T content), and thus the p values of over-or under-representation. To show how the quantification can have large variance, we analyze k-mers from antimicrobial resistance genes collected in the MEGARes 2.0 database [36]. MEGARes contains 7868 genes, with an average gene length of 1030.29 nucleotide bases, 57 different antibiotic resistance classes, and 220 distinct resistance mechanisms.
From MEGARes, we select all the 3911 genes conferring resistance to beta-lactamase; we then identify all 13-mers, for a total of 453,308 motifs (50% GC content). In Table 2, we show how the count probability distribution of the 13-mers in MEGARes' beta-lactamase genes changes among bacterial species present in the human microbiome of respiratory tract [37], where we select uniformly 18 species on the basis of their GC content. The median probability of finding the aforementioned 13-mers at least once varies between 93 and 99%, and even species with a similar GC content can show different medians and interquartile ranges, such as Stomatobaculum longum (55% GC content, median p = 97%) and Kluyvera intermedia (52% GC content, median p = 93%). This variability is due to: the individual nucleotide content, which can differ even when the GC content is the same, and it directly affects the distribution (see also Fig. 2); the genome length; and the nucleotide content of the query motifs.

Conclusion
The motif_prob software is a multi-platform, open source, efficient solution for calculating exact frequency distributions of (long) motif occurrences in reference genomes using high-throughput data. We showed how our code estimates are consistent with other, slower, exact calculations, and how the run times of our code (both Perl and C++) are competitive even with the non-exact compound Poisson approximation. Specifically, motif_prob is 50-1000× faster than MoSDi exact and 60-120× faster than MoSDi compound Poisson. The current implementation is limited to non-clumpable strings, although it extrapolates a probability for clumpable strings by permutation. As future development of our work we foresee to develop an exact formula for clumpable strings and to extend the approach to generalize over motifs that can include nucleotide changes, insertions or deletions.
In conclusion, our tool can be effectively used in conjunction with motif discovery suites that process high-throughput data, allowing them to compute exact count distributions and associated p values without loss of run time performance, instead of relying on to approximations.

Availability and requirements
Project name: motif_prob Project home page: https:// github. com/ DataI ntell SystL ab/ motif_ prob Operating system(s): Multi-platform (UNIX/Linux/Mac, Windows) Programming language: Perl, C++ Other requirements: None License: MIT Any restrictions to use by non-academics: Permissible under the terms of the MIT license.