OrthoGNC: A Software for Accurate Identification of Orthologs Based on Gene Neighborhood Conservation

Orthology relations can be used to transfer annotations from one gene (or protein) to another. Hence, detecting orthology relations has become an important task in the post-genomic era. Various genomic events, such as duplication and horizontal gene transfer, can cause erroneous assignment of orthology relations. In closely-related species, gene neighborhood information can be used to resolve many ambiguities in orthology inference. Here we present OrthoGNC, a software for accurately predicting pairwise orthology relations based on gene neighborhood conservation. Analyses on simulated and real data reveal the high accuracy of OrthoGNC. In addition to orthology detection, OrthoGNC can be employed to investigate the conservation of genomic context among potential orthologs detected by other methods. OrthoGNC is freely available online at http://bs.ipm.ir/softwares/orthognc and http://tinyurl.com/orthoGNC.


Introduction
Currently, sequencing facilities are able to produce large amounts of gene and protein sequences in a short period of time. Hence, many complete genomes of organisms are available today for more in-depth comparative studies. A first step in comparative genomics is the identification of homologous and more specifically orthologous genes. Homologous genes (homologs) are originated from a gene in the last common ancestor. In 1970, Fitch classified homologs into orthologous and paralogous genes [1]. Orthologous genes (orthologs) are homologs that have evolved by speciation event in their last common ancestor. In contrast, paralogous genes (paralogs) are homologs that have evolved by gene duplication in their last common ancestor.
Identification of orthologs is more important and of great interest, since orthologs typically tend to share a similar function [2]. Thus orthology relations can be used to transfer functional annotations (including protein-protein interactions) to newly-sequenced genomes [3,4]. Moreover, by definition, only phylogeny of orthologs can reflect the true evolutionary history of the corresponding species correctly [5]. Therefore, only orthologs can be used to infer species phylogenies [6].
Despite the straightforward definition of orthology, the problem of assigning orthology is not trivial. Evolutionary events such as horizontal gene transfer (HGT) and gene loss often complicate the evolutionary history of genes. Hence, many methods and databases have been introduced to tackle the problem of orthology assignment. More than 40 methods and databases are listed in the ''Quest for Orthologs" website (http://questfororthologs.org/orthology_databases), which can mostly be classified into two major classes according to the approaches employed.
Methods such as OrthoStrapper [7], HOGENOM [8], LOFT [9], PhylomeDB [10], and OrthoReD [11] are based on phylogenetic analysis. Phylogeny-based methods seem to be more precise, with high specificity reported [7,12,13]. However, ambiguities in the inferred gene trees and species trees, as well as wrong placement of the root can lead to incorrect assignment of orthologs. In addition, these methods require large computational cost, making their usage impractical for large datasets [14][15][16].
The second class of methods usually employs a clustering algorithm on a weighted graph that is built from pairwise sequence similarities. Examples in this class include OMA [17], OrthoMCL [18], InParanoid [19], Proteinortho [20], and OrthoDB [21]. These methods, because of their tractability, have gained more popularity, particularly when used for large datasets. However, these methods are based on the molecular clock hypothesis by assuming that orthologous sequences are more similar and would fail to detect orthologs when the molecular clock hypothesis is violated [22]. Furthermore, HGT and convergent evolution as well as linage-specific gene loss ( Figure 1) can introduce false positive relations. Note that duplications prior to a speciation event (in-paralogs) and duplications after a speciation event (out-paralogs) [23] introduce one-to-many or many-to-many orthology relation, further complicating the process of orthology detection. Most similarity-based methods that employ clustering, present the orthology relations as ortholog groups instead of pairwise relations. As a result, these groups can contain in-paralogs and out-paralogs, making their usage inappropriate for studies such as phylogeny inference where one-to-one orthology relations are needed.
To increase the quality of inferred orthologs, some methods attempt to take genomic context into account [24][25][26][27][28][29]. Due to genome rearrangements, as well as gene gains/losses, genomic context is less conserved beyond the genus level. Nonetheless, it can be a strong evidence of orthology when found. Conservation of gene neighborhood can assist in distinguishing orthologs from out-paralogs [30,31]. Similarly, it can prevent misinterpretation of HGTs and genes with convergent evolution as orthologs. Another interesting advantage of employing genomic context is to reveal orthologous genes or proteins that share low sequence similarities [32]. These orthologs can be missed in clustering or homology inference steps due to trade-off in favor of specificity. Moreover, with the advent of next-generation sequencing, many closely related genomes are available today. As a result, comparative studies are extended to closely related species and even strains of a same species where genomic context is highly conserved.
In this paper, we extend the method of Jun et al. [29] in both homology detection and orthology detection to increase the accuracy of predictions. We propose OrthoGNC, a similarity-based method and a software that outputs highquality pairwise orthology relations based on gene neighborhood conservation. Moreover, OrthoGNC can be employed to investigate the conservation of genomic context among potential orthologs detected by other methods.

Method
Jun et al. tested a simple method based on gene neighborhood conservation to extract orthology relations in mammalian proteomes [29]. According to their method, two genes are orthologous if they are homologous and share at least one homologous neighbor in a neighborhood size of three upstream and three downstream genes. Also, homology between two genes are defined as Blastp E-value < 1eÀ5. We have extended this method in both homology detection and orthology detection. Similar to the aforementioned method, OrthoGNC performs three main steps to infer orthology relations: (i) computing pairwise sequence similarities, (ii) identifying homologous sequences, and (iii) inferring orthologs according to gene neighborhood conservation. However, in each step, parameters can be set to various values to provide desired output. These parameters can be easily adjusted by the user in a configuration file or using a user-friendly GUI. Moreover, in OrthoGNC, inference of orthology relation can be done in an iterative routine to produce more accurate and sensitive results.

OrthoGNC steps
OrthoGNC is implemented in Java and accepts both DNA and protein sequences in FASTA format. In addition, the order of Figure 1 An example for lineage-specific gene loss Suppose gene g has undergone a duplication event, resulting in two genes, g 1 and g 2, which is followed by a speciation event. Deletion of g 1 in lineage A and deletion of g 2 in lineage B can lead to the wrong assignment of g 2 from lineage A and g 1 from lineage B as orthologs. According to the duplication event in the last common ancestor, g 1 and g 2 are paralogs. Since duplication event occurs prior to speciation, g 1 and g 2 are out-paralogs. appearance of genes or proteins in the FASTA file must be the same as the order of their appearance in the underlying genomes.
Step 1 Like other similarity-based methods, OrthoGNC requires pairwise sequence similarities at the very beginning step. The wellknown and widely-used heuristic software BLAST is used to compute the similarity score [33]. OrthoGNC can run BLAST in multiple threads in parallel, thus improving the running time of the BLAST step significantly in multi-core CPUs. The number of simultaneous BLAST jobs can be set or auto detected according to the number of available cores and the input parameters of BLAST, such as E-value and substitution matrix, can be adjusted by the user.
Step 2 In the second step, OrthoGNC infers homology relations using pairwise similarity scores. Two sequences are assumed to be homologous if they share a significant sequence similarity (30%-35% for proteins as a rule of thumb according to Ref. [14]). To infer homology relation, OrthoGNC looks into BLAST hits to make sure that not only a certain amount of identical residues is matched but also a certain length of both sequences is covered. The minimum percentage of identical residues (T i ) and minimum percentage of coverage (T c ) can be adjusted by the user.
Step 3 In the third and last step, OrthoGNC extracts orthology relations from homologous sequences based on gene neighborhood conservation. Similar to the adaptive RBAH [20], OrthoGNC uses a ratio T b , 0 < T b < 1, to tolerate possible variances of molecular clock rate. This allows every homolog of a gene that has a score > T b Â score of the best homolog to be an orthology candidate for that gene. Afterward, genes and their candidate orthologs are investigated to determine the number of common homologous neighbors, which can be done via one of two predefined routines, namely One2One mapping and unique intersection. In the first routine, each of the 2n neighbors of a gene (n upstream and n downstream) is checked against its corresponding gene in the neighborhood of candidate orthologs to see whether they are homologs ( Figure S1A). If the number of homologous pairs exceeds a predefined threshold, then the gene and its candidate ortholog are bona fide orthologs. In the second routine, OrthoGNC counts the unique homologous matches between neighbors of a gene and neighbors of its candidate ortholog without considering co-linearity ( Figure S1B). Similar to the first routine, if the number of unique homologous matches exceeds a predefined threshold then the gene and its candidate ortholog are bona fide orthologs. Consideration of gene order in the first routine makes it more stringent while, in contrast, the second routine allows for local rearrangement and gene gains/losses. We also observed that other orthology detection parameters -maximum tolerance ratio (T b ), radius of neighborhood (N), and minimum number of common neighbors (T n ) -can also affect the number of inferred orthologs dramatically. By relaxing these parameters, OrthoGNC is able to find more orthologs; however, more false-positive relations would also be introduced. To maintain the precision of inferred orthologs, OrthoGNC identifies orthology relations iteratively, that is, the user can define more than one parameter set for multiple rounds of orthology inference. Accordingly, if in certain round of orthology inference with certain set of parameters, OrthoGNC finds an ortholog for a gene in a strain, it does not look for another ortholog of this gene in the same strain in the subsequent rounds. To clarify, suppose we find the orthology relation (g, x) in some round; in the next iterations with more relaxed parameters, new homology relations such as (g, y) and (h, x) might be introduced where y and h are in the same genomes as g and x, respectively. Inference of (g, x) using more stringent parameters implies that (g, y) and (h, x) are wrong unless (g, h) and (x, y) duplicated after the speciation event.
The number of orthology-detection rounds and the parameters in each round -T b , N, T n , and the neighbor investigation routine (NIR) to be used -can be easily configured by the user. Finally, the pairwise orthology relations are reported for each pair of input genomes.
Although the main objective of OrthoGNC is to deliver highly accurate and precise orthology relations, it can be combined with other methods to achieve higher recall. To this end, user can choose to combine the output of OrthoGNC with an arbitrary set of orthologs that is predicted by another method. In this case, if OrthoGNC is unable to find any ortholog for gene g from strain S, it outputs genes from S that are introduced as orthologs of g by the other method.

Benchmarking
Evaluation of orthology inference methods is not an easy task, because we do not know the true evolutionary history of genes. Recently, in an effort to standardize orthology benchmark [34], a public web service has been introduced to assess different methods on 66 species. Unfortunately, all of these species are evolutionary distant (beyond the genus level), making it inappropriate for our study. We thus compared OrthoGNC to other methods on both simulated and real data, notwithstanding that high conservation of genomic context is a strong and self-verifying criterion in orthology inference and has already been evaluated [29,35]. Four similarity-based methods that produce pairwise orthology relations are selected for comparison, including OMA [17], InParanoid [19], Proteinortho [20], and EGM2 [36]. In addition, OrthoGNC is also compared to Jun et al.'s method ( Figure S2), which is now a special case of OrthoGNC, where E-value = 1eÀ5, T i = 0%, T c = 0%, N = 3, T n = 1, T b = 0.0, and NIR = ''unique intersection".
OMA, InParanoid, and Proteinortho all use clustering for orthology inference, while EGM2 employs genomic context to perform iterative graph matching. The latest version of each software was acquired from their official website, and was run with default parameters. For OrthoGNC, we used different parameter configurations ( Table 1) to evaluate the effect of parameters chosen on ortholog inference in practice. We first used each configuration (Conf) to infer the orthologs in single rounds. Then, we used configurations 2-8 in sequence to infer the orthologs iteratively, with homology detection parameters fixed for all iterative rounds. We further show how OrthoGNC could be employed to refine the orthology relations that are predicted by other methods. Different parameter configurations (Table S1) have also been done to assess the impact of gene neighborhood conservation ( Figure S3).
For performance evaluation, we applied all methods on two datasets; a simulated proteome dataset and a prokaryotic proteome dataset. Prokaryotic genomes are known to be fluid [37], and many genes are subject to lose their ancestral order, due to significant amounts of rearrangements. We show that even in the presence of many rearrangements, genomic context is still highly informative in detecting accurate orthology relations for closely-related species.

Simulated data
In the absence of a gold standard, we have used the Artificial Life Framework (ALF) [38] to simulate a proteome set consisting of 30 species. ALF was previously employed for simulating bacteria-like and mammalia-like genomes to assess the impact of different evolutionary forces on orthology inference [39]. We used the same set of parameters that was used to generate bacteria-like genomes in an earlier work, by incorporating genome rearrangement event in addition to other predefined evolutionary events.

Evaluation
For the simulated dataset, we first count the number of correctly-predicted orthology relations (true positive; TP), the number of incorrectly-predicted orthologs (false positive; FP), and the number of missed orthology relations (false negative; FN). We then calculate the precision and recall for each method according to Eq. (1).
For real data, we cannot calculate precision and recall due to unknown orthology relations. Instead, we investigate how many genes for which OrthoGNC predicted an ortholog while each competing method was unable to find any orthologs or suggested another ortholog. To this end, we calculate three sets of orthology relations, namely, M Method , M 0 Method , and M 00 Method according to (2). For instance, to obtain M OMA, we look for every orthology relation (g, o) predicted by OrthoGNC, where OrthoGNC predicted ortholog o in species S for gene g, and OMA failed to predict any ortholog for g in S. For M 0 OMA , we look for every orthology relation (g, o) predicted by OrthoGNC, where OrthoGNC predicted ortholog o in species S for gene g and OMA predicted some other ortholog for g in S. For M 00 OMA, we look for every orthology relation (g, o 0 ) predicted by OMA, where OMA predicted ortholog o 0 in species S for gene g and OrthoGNC predicted some other ortholog for g in S. To put it simple, for a total number of |M 0 OMA | genes OrthoGNC predicted orthologs while for the same genes OMA predicted a total number of | M 00 OMA | other orthologs. Furthermore, we calculate the set U OrthoGNC of orthologous genes that were only predicted by OrthoGNC. We computed M Method , M 0 Method , M 00 Method , and U OrthoGNC for both the simulated data and real data For real data, in addition to calculating M Method , M 0 Method , M 00 Method , and U OrthoGNC , we built Venn diagrams using the online tool InteractiVenn [41] to provide an overall picture of the predicted orthologs for all methods. Pairwise orthology relations are introduced to InteractiVenn as a string value, in which two protein ids are separated with a delimiter character. For all predicted x-y orthology relations, we also add the y-x relations manually.

Results and discussion
In order to evaluate OrthoGNC and the effect of parameters chosen on predicting orthologs, we ran OrthoGNC using different configurations of parameters both in single rounds and iteratively (Table 1). In single rounds, we only used one parameter configuration for orthology inference, whereas in iterative mode, rounds of orthology inference were performed with parameters of Conf 2-8 sequentially. For example, in order to iteratively infer orthologs with Conf 4, three rounds of orthology inference with Conf 2-4 is done. If in a round of orthology inference, OrthoGNC finds an ortholog for gene g in a strain, it does not look for another ortholog of gene g in the same strain in the subsequent rounds.

Ortholog inferring performance of OrthoGNC on simulated data
The ortholog inferring performance on the simulated data using various methods is shown in Figure 2. Precision of OrthoGNC converges to one by choosing strict parameters as in Conf 1. However, the amount of predicted orthology relations decreases when parameters are set strictly to output stringent results. It is of note that, even in this case, OrthoGNC is able to find orthology relations that may not be found by any other method. As seen in Table 2 (M Method ), with the stringent parameter set of Conf 1, 2670 (and 99.96% of these are correct), 3750 (99.89%), 1648 (100%), and 37,893 (99.97%) orthology relations predicted by OrthoGNC were missed by Proteinortho, OMA, InParanoid, and EGM2, respectively. Moreover, with Conf 1, 332 (100% of these are correct) orthology relations (= |U OrthoGNC |) found by OrthoGNC are missed by all other methods. By relaxing homology inference parameters in Conf 2, OrthoGNC detects many more new orthology relations that are missed by other methods, although a small amount of false positives is included. This is because other methods use higher similarity cutoffs to maintain the precision, while in OrthoGNC, in presence of its gene neighborhood conservation criteria, the sequence similarity parameters can be relaxed to include more distant orthologs.
Another advantage of OrthoGNC is to distinguish the main ortholog [35] in the presence of multiple candidates. Accordingly, we were curious about the number of genes for which an ortholog with conserved neighborhood exist and are found by OrthoGNC while other methods predicted other orthologs. To this end, we calculated the sets M 0 Method and M 00 Method . As depicted in Table 2, with Conf 1, there are 936 orthology relations (g, g 0 ) in set M 0 InParanoid , such that OrthoGNC correctly predicted gene g 0 from species S as an ortholog for g, while InParanoid predicted some other ortholog(s) like o from S as an ortholog for g, resulting in 989 (co-)orthology relations like (g, o) in M 00 InParanoid . Within these 989 (co-)orthology relations in M 00 InParanoid that are not identified by OrthoGNC, only 45.29% are true positives. |M 0 Method | and |M 00 Method | are slightly better for Proteinortho, OMA, and EGM2 (Table 2). Again, note the increase in |M 0 Method | and |M 00 Method | when parameters of OrthoGNC are relaxed.
Also, it is shown that iterative inference of orthology relations can slightly improve the precision (green square in Figure 2). By relaxing parameters in subsequent rounds, OrthoGNC achieves higher recall while preserving the precision. This is more interesting for Conf 8, where the neighborhood conservation criteria were completely relaxed. In fact, it only detects orthologous genes that are reciprocally best hit (RBH). In single-round inference, RBH algorithm misses some of the orthologous genes that deviated from the molecular clock assumption. Moreover, not all RBHs are necessarily orthologous [22]. With iterative inference, these relations are dismissed if a relation satisfying the neighborhood conservation criterion exists in previous rounds. Corresponding values of |M Method |, |M 0 Method |, and |M 00 Method | for iterative orthology inference with Conf 2-8 are shown in Table 2.
Another interesting observation is that M Method and M 0 Method have consistently higher true-positive rate than M 00 Method (Table 2). This suggests that when OrthoGNC disagrees with another method on the orthologs of a gene, orthologs reported by OrthoGNC are generally more accurate than those reported, if any, by the other method.
The main objective of OrthoGNC is to deliver highly sensitive and precise orthology relations. However, as shown in Figure 2, OrthoGNC (Conf 7 single round) is superior to both OMA and Proteinortho in recall at almost the same level of precision. However, one might argue that the lower recall of OrthoGNC than InParanoid may result in a smaller Fmeasure. To achieve a higher recall, the user can choose to combine the output of OrthoGNC with any set of orthologs that is inferred by other methods as described in Methods. For instance, combination of OrthoGNC (Conf 7 iterative) with InParanoid improved recall, thus resulting in a higher F-measure (0.8570) than both OrthoGNC (0.7961) and InParanoid (0.8385) alone ( Figure 2).

Ortholog inferring performance of OrthoGNC on real data
For real data, we compare the inference output of various methods using Venn diagrams. Corresponding Venn diagrams for Conf 1, 2, 3 (iterative), and 8 (iterative) are depicted in Figure 2 Precision-recall plot of OrthoGNC and other methods on simulated data Precision and recall rates for OrthoGNC in single rounds and iteratively are compared with those using Proteinortho, OMA, InParanoid, EGM2, as well as combination of OrthoGNC and InParanoid. Parameter configurations for Conf 1-8 are listed in Table 1. Conf, configuration. Note: T i , minimum percentage of identical matches in a BLAST hit; T c, minimum percentage of coverage of query and subject sequences in BLAST hit; N, radius of neighborhood to be investigated; T n , minimum number of common neighbors (0 T n 2 * N); T b , maximum tolerance ratio from score of best hit (0 T b 1); NIR, neighborhood investigation routine. I stands for unique intersection and O stands for One2One mapping. Figure 3. We also computed M Method , M 0 Method , M 00 Method , and U OrthoGNC for this dataset ( Table 3). Our results show that although prokaryotic genomes are known to be fluid, gene neighborhood is still highly informative in detecting orthologs at genus level. In particular, some orthology relations are only detected by OrthoGNC. Moreover, orthology inference parameters as stringent as Conf 1 or 2 appear not necessary. Specifically, by changing NIR to ''unique intersection" in Conf 3, 39,148 more orthology relations are detected in comparison to Conf 2; out of which, 38,068 (97.24%) relations are also detected by at least three other methods ( Figure 3B and C). This observation confirms the high degree of local rearrangements in prokaryotic genomes [37]. Therefore, one can relax the gene neighborhood investigation method according to the genomes studied to allow more local rearrangements. Even with iterative inference by Conf 8 (Figure 3D), 132,448 out of 146,412 (90.46%) orthology relations detected by OrthoGNC are also predicted by at least three other methods, indicating the predictions by OrthoGNC agree well with the intersection of three other methods. Therefore, in addition to inferring accurate relations based on high conservation of genomic context, OrthoGNC is also able to infer much more relations that are consistent with other state-of-art methods.

Orthology refinement
In addition to orthology inference, OrthoGNC has a separate interface for refining a given set of orthology relations by investigating gene neighborhood conservation among them. For the input genomes (or proteomes) and a set of input orthologs provided by the user, OrthoGNC investigates the given relations to see whether they follow a user-defined degree of gene neighborhood conservation. As a result, the input relations get processed and saved into two separate files: one for relations that are supported by gene neighborhood conservation and one for the relations without the support. We tested this feature on orthologs predicted by the other four methods on both simulated and real data, by setting easy parameters to investigate a minimal gene neighborhood conservation (Evalue = 1eÀ02, T i = 0, T c = 0, N = 7, T n = 1, and NIR = ''unique intersection"). Percentage of correct orthologs for both supported and unsupported relations was calculated for the simulated data. As shown in Figure 4, the percentage of correctly-predicted relations are significantly higher for supported relations than unsupported ones.

Running time comparison
We also compared the running time of OrthoGNC with other methods. To do this, we ran all methods on two proteomes (M. ulcerans Agy99 and M. tuberculosis H37Rv) using a personal computer with an Intel Core i7-4702MQ 2.20 GHz processor and 6 Giga bytes of RAM. As Proteinortho automatically sets the number of concurrent threads to available cores, we manually set the number of threads to 8 for OMA and OrthoGNC to facilitate comparison; however, InParanoid and EGM2 accept no parameter on the number of concurrent threads. As shown in Table 4, only EGM2 is faster than OrthoGNC, probably because EGM2 uses its own heuristics instead of BLAST to compute the similarities.  It is worth mentioning that OrthoGNC can also be used to find co-linear blocks within species. Simply, by adjusting parameters to N = n, T n = 2n, and NIR = ''One2One Mapping", the predicted orthologs will be centered in syntenic blocks of size 2n + 1. Finding syntenic blocks is of a great interest [42][43][44], because genes residing in a syntenic block have been under evolutionary pressure and are more likely to interact and be co-expressed [45].

Conclusion
We have presented here OrthoGNC, a similarity-based software for detecting accurate orthology relations. To maintain higher accuracy, OrthoGNC is capable of inferring orthology relations in multiple rounds. OrthoGNC is very flexible and user-friendly in accepting user-defined parameters. Also, multithreaded implementation of OrthoGNC makes it fast and efficient for pipelines where high-quality orthology relations are needed. To achieve high specificity, OrthoGNC investi-gates genomic context of potential orthologs. Accuracy of OrthoGNC is validated by comparison against four competitive methods on both simulated and real data.
In addition to delivering accurate orthology relations, OrthoGNC can be employed to investigate the gene neighborhood conservation for refinement and assessment of other orthology inference methods.

Supplementary material
Supplementary material associated with this article can be found, in the online version, at https://doi.org/10.1016/j.gpb. 2017.07.002. Figure 4 Gene neighborhood conservation of predicted orthology using four other methods For each method, the stacked bar on the left side indicates the performance for the simulated data and the right one for the real data. The blue bars depict the number of orthology relations that follow the predefined parameters (E-value = 1eÀ02, T i = 0, T c = 0, N = 7, T n = 1, and NIR = ''unique intersection") and the gray bars indicate the number of orthology relations without gene neighborhood conservation. Precision of predicted orthologs on simulated data is provided in percentage on the bars for both supported (blue) and unsupported (gray) relations. Note: All methods were run on a personal computer with an Intel Core i7-4702MQ 2.20 GHz processor and 6 GB of RAM.