KaKs_Calculator 3.0: Calculating Selective Pressure on Coding and Non-coding Sequences

KaKs_Calculator 3.0 is an updated toolkit that is capable of calculating selective pressure on both coding and non-coding sequences. Similar to the nonsynonymous/synonymous substitution rate ratio for coding sequences, selection on non-coding sequences can be quantified as the ratio of non-coding nucleotide substitution rate to synonymous substitution rate of adjacent coding sequences. As testified on empirical data, KaKs_Calculator 3.0 shows effectiveness to detect the strength and mode of selection operated on molecular sequences, accordingly demonstrating its great potential to achieve genome-wide scan of natural selection on diverse sequences and identification of potentially functional elements at a whole-genome scale. The package of KaKs_Calculator 3.0 is freely available for academic use only at https://ngdc.cncb.ac.cn/biocode/tools/BT000001.


Introduction
Detecting natural selection on molecular sequences is of fundamental significance in molecular evolution, comparative genomics, and phylogenetic reconstruction, which can provide profound insights for revealing evolutionary processes of molecular sequences and unveiling complex molecular mechanisms of genome evolution [1]. In principle, estimating selection on DNA sequences requires a reference set of substitutions that is free from selection. As synonymous substitutions do not provoke amino acid changes due to the degeneracy of the genetic code, they are expected to be invisible to selection and thus widely used as a reference that reflects the neutral rate of evolution [2]. Consequently, the ratio of nonsynonymous substitution rate (Ka or d N ) to synonymous substitution rate (Ks or d S ), namely, x = Ka/Ks (or d N /d S ), is widely adopted to differentiate neutral mutation (x % 1) from negative (purifying) selection (x < 1) and positive (adaptive) selection (x > 1), accordingly providing a powerful tool for illuminating molecular evolution of coding sequences (see a popular package in [3]).
Nowadays, a growing body of evidence has shown that non-coding sequences, historically thought as ''junk" due to few knowledge on their function relative to coding sequences, are recognized as functional elements to play important regulation roles in multiple biological processes [4] and associate closely with various human diseases [5][6][7]. Albeit less conserved by comparison with coding sequences, a larger number of non-coding sequences have been identified highly conserved across mammalian genomes [8][9][10]. Importantly, more noncoding sequences are subject to positive selection and negative selection than previously believed, and particularly, long noncoding RNA (lncRNA) sequences do experience natural selection [11]. As a result, several computational methods have been proposed for the detection of selection acting on non-coding sequences [12], which primarily differ in how to choose a reference of unconstrained evolution, such as, synonymous substitutions of neighboring coding gene [13], intron sequences [14,15], and ancestral repeats [16]. However, there lacks of an implemented algorithm to detect the strength and mode of selective pressure on non-coding sequences, particularly considering an increasing number of non-coding studies conducted worldwide. More importantly, an integrated toolkit that is capable of detecting selection on both coding and non-coding sequences is highly desirable, which would help users achieve genome-wide scan of natural selection on diverse sequences.
Toward this end, here we present KaKs_Calculator 3.0, an updated toolkit for calculating selective pressure on both coding and non-coding sequences. Compared with previous versions [17,18] that focus solely on coding sequences, we implement an algorithm in KaKs_Calculator 3.0 that employs synonymous sites of adjacent coding sequences as a reference to estimate selective pressure acting on noncoding sequences. We test it on empirical data and demonstrate its utility in diagnosing the strength and form of molecular evolution.

Algorithm
The major update of KaKs_Calculator 3.0 is to incorporate an algorithm that is capable of estimating selective pressure on non-coding sequences. Specifically, it uses synonymous substitutions as a reference baseline (similar to [13]), which, albeit thought to be under weak selection [19][20][21], has been widely adopted for determining the strength and type of selection operated on coding sequences [22][23][24][25][26][27][28][29]. Similar to the Ka/Ks ratio for coding sequences, selective pressure on non-coding sequences (n) can be quantified as the ratio of non-coding nucleotide substitution rate (Kn) to neutral substitution rate (assumed as Ks), viz. n = Kn/Ks, where Ks is inferred from adjacent coding sequences. As the number of observed substitutions is less than the number of real substitutions, we adopt a nucleotide substitution model (e.g., JC/K2P/HKY) to correct multiple substitutions of non-coding sequences. Taking the HKY model [30] as an example, therefore, Kn can be deduced from the observed transitional and transversional substitutions (S and V, respectively) as well as four nucleotide frequencies (p A , p T , p G , and p C ) , according to Equation (1) (see Equations 1.27 and 1.28 in [31]).
where a ¼ Àlog½1 À To detect and quantify selection on non-coding sequences, KaKs_Calculator 3.0 provides users with two ways to obtain the value of neutral mutation rate or Ks, which is either calculated from adjacent coding sequences uploaded by users or just specified in a straightforward manner by users ( Figure 1). As a consequence, KaKs_Calculator 3.0 is capable of detecting selection on both coding and non-coding sequences.
KaKs_Calculator 3.0 is implemented in standard Cþþ language, enabling higher efficiency and easy compilation on different operation systems (Linux/Windows/Mac). In addition to the new functionality for estimating selection on noncoding sequences as mentioned above, it is also updated by fixing bugs and errors. The package of KaKs_Calculator 3.0, including compiled executables, a Windows application with graphical user interface (GUI), source codes, and example data, accompanying with detailed instructions and documentation, is freely available for academic use only at BioCode (https://ngdc.cncb.ac.cn/biocode/tools/BT000001), an opensource platform for archiving bioinformatics tools in the National Genomics Data Center (NGDC) [32], China National Center for Bioinformation.

Application on empirical data
To test KaKs_Calculator 3.0, we choose three empirical lncRNA genes that are extensively studied according to LncRNAWiki [7] and collect their human-mouse orthologs as well as their adjacent coding orthologs from NGDC LncBook [33] and National Center of Biotechnology Information (NCBI) RefSeq [34]. Specifically, these non-coding and coding gene symbols with accession numbers are: 1) H19 (NR_002196. According to the ratio (n) of non-coding nucleotide substitution rate to adjacent synonymous substitution rate, we reveal that, although the coding genes undergo strong purifying selection (x < 1), these three non-coding genes present diverse selective pressure (Table 1). Strikingly, HOTAIR exhibits positive selection (n > 1), whereas the rest two genes experience negative selection (n < 1). HOTAIR is a $ 2.3-kb intergenic RNA transcribed from the antisense strand of the HOXC gene cluster [36]. The result of positive selection detected on HOTAIR relative to HOXC12 is consistent well with previous findings that HOTAIR evolves faster than the neighboring genes [37]. On the contrary, MALAT1, a $ 8.7-kb noncoding RNA flanked by the highly conserved kinase-like gene SCYL1, is ubiquitously expressed in almost all human tissues, evolutionarily conserved across mammalian species [38], and associated with various cancers [39]. Thus, n = 0.464 indicates strong selective constraint on MALAT1, in accordance with its physiologic and pathophysiological function [40] and conserved RNA structure [41] as documented by previous studies. Likewise, H19, a $ 2.3-kb imprinted maternally expressed transcript located near MRPL23, is known for close association with Beckwith-Wiedemann Syndrome and also involved in tumorigenesis [42]. Our result shows that H19 presents stronger selection constraint as indicated by n = 0.296, conforming well with its conserved sequence and structure [43]. It is worth noting that one non-coding sequence may have multiple adjacent coding genes, which are specified by users and thus can lead to different estimates of Ks and n. Taken together, KaKs_Calculator 3.0 is effective in estimating natural selection on non-coding sequences, which has the potential to reveal evolutionarily selective pressures operated on diverse molecular sequences.
In addition, to test the running performance of KaKs_ Calculator, we collect an empirical large dataset that contains 15,424 human-mouse orthologous genes retrieved from RefSeq [34] and obtain their codon-based alignments by ParaAT [44] -a parallel tool for constructing multiple protein-coding DNA alignments. KaKs_Calculator 3.0 includes ten computational methods for detecting selection on coding sequences, which fall into approximate methods and maximumlikelihood methods. We choose three approximate methods, NG [23], YN [26], and MYN [27], and one maximumlikelihood method, GY [25], and test on a 64 bit x86 Intel Core i7 machine containing 4 CPU cores with each 3.40 GHz and running Windows 10. For this large-scale data analysis, we find that NG, YN, and MYN all take $ 2 min and GY takes $ 11 h, clearly showing that approximate methods are more timeefficient than maximum-likelihood ones. Considering that different users may have different preferences, it should be noted, however, that maximum-likelihood methods are believed to achieve higher accuracy and that different methods adopt different models and strategies and thus can lead to different estimates [45] (see an example in [46] where contradictory findings are produced by different methods).

Discussion
KaKs_Calculator 3.0 is significantly updated by achieving the detection of natural selection on non-coding sequences as well as coding sequences. As testified on empirical data, it is of great utility in calculating natural selection on molecular sequences, thus identifying potentially functional elements at a genome-wide scale. Future developments include the detection of selective pressure on small peptides (less than 300 nucleotides) that are encoded by small open reading frames within non-coding sequences [47][48][49] as well as the implementation of codon-based alignment procedure to help users generate input sequences in an easy-to-use manner.

Competing interests
The author has declared no competing interests.

Acknowledgments
I would like to extend special thanks to Lina Ma for constructive suggestions and discussions on this work and Zhao Li for valuable help on data collection and test. I also thank Zhuojing Fan for designing the logo as well as Qing Guo and Lin Dai for fixing a bug on Windows GUI. I am extremely grateful to a number of users for reporting bugs and sending comments since the first release of ORCID ORCID 0000-0001-6603-5060 (Zhang Zhang)