Paralog Explorer: A resource for mining information about paralogs in common research organisms

Graphical abstract


Introduction
Genes that arise as a result of gene duplication are known as paralogs. In cases where they retain overlapping function, this can represent a particular challenge to functional analysis [1]. Specifically, while loss-of-function experiments have been enormously successful to characterize individual gene function, this approach can fail when a target gene has a redundant or partially redundant paralog that can compensate in its absence. For example, large-scale studies in yeast provide evidence that, in aggregate, knocking out singleton genes (those without any paralog) tends to produce a stronger phenotypic effect than knocking out one member of a paralog pair, likely due at least partially to paralog-based redundancy [2]. Similarly, gene essentiality studies across numer-ous human cancer cell lines revealed that paralogs were far less likely than singleton genes to be essential [3,4]. Importantly, paralogs are not a minor curiosity in eukaryotic genomes. In fact, gene duplication and functional divergence have long been recognized as one of the most fundamental and widespread sources of evolutionary novelty [5], and paralogs are extraordinarily widespread in genomes. In the human genome, for instance, 70.5 % of all genes are estimated to have at least one paralog [6]. Similarly, 67 % of Drosophila genes have been estimated to have at least one paralog [1]. While many paralogs have diverged functionally and encode proteins of unrelated or nonoverlapping roles, there is undoubtedly a large amount of functional diversity yet to be characterized which has thus far remained invisible to standard gene loss-of-function studies.
In recent years, it has become increasingly possible to perform double-or multiplex loss-of-function experiments using scalable techniques as double RNAi or CRISPR-based techniques. To date, massively parallel double-knock CRISPR approaches have been primarily applied to cell culture experiments, where it is possible to introduce dual-sgRNA libraries targeting tens of thousands of gene pairs [7][8][9][10][11][12][13] dual-and multiplex loss-of-function experiments are rapidly being developed for model organisms [14][15][16][17][18].
To facilitate the functional studies of paralogs in model organisms, both in cell culture and in vivo, we have developed a simple but effective bioinformatic tool, Paralog Explorer (https://www.flyrnai.org/tools/paralogs/web/) to identify and explore paralogous genes. Paralog Explorer allows users to retrieve paralogs based on a single or multi-gene query, across a wide range of sequence similarity, and to provide relevant comparative information about the retrieved paralog pairs. Paralog Explorer is based on Drosophila RNAi Screening Center Integrative Ortholog Prediction Tool (DIOPT), which was developed to identify orthologous genes between species using an integrative approach [19]. By focusing the DIOPT algorithm within, as opposed to between, species, Paralog Explorer identifies paralogs within a given genome. Further, the resource retrieves associated public data and annotations such as chromosomal location, gene ontology annotation and proteinprotein or genetic interactors as well as expression data from various tissues and cell lines for Drosophila and human. By providing paralog predictions alongside information such as expression profiling datasets, Paralog Explorer can help researchers predict which paralogous genes might act redundantly or otherwise in concert with one another, and thus to assist in designing targeted smallor large-scale experimental studies [4,[11][12][13]20,21].

Paralog information
Paralog information was obtained from DIOPT database release 8 [19]. DIOPT integrates 17 existing algorithms/resources and use a simple voting system for rapid identification of orthologs and paralogs among major model organisms. The organisms included in the Paralog Explorer resource are the nematode worm C. elegans, the fruit fly D. melanogaster, the mouse M. musculus, the zebrafish D. rerio, and human H. sapiens. Protein alignment information, including alignment length, percent similarity, and percent identity, were also imported from DIOPT. In addition, for genes in each paralog pair, the orthologs in more ancient species such as yeast orthologs for human, mouse, zebrafish, worm and Drosophila paralogous genes, and Drosophila orthologs for paralog genes in vertebrates are also analyzed. The common orthologs shared by both genes in a pair are identified and stored in database for display. Data files were exported from DIOPT in text format and were further processed using a local program. The output files are uploaded into a mySQL database.

Integration of omics datasets
For each gene in a paralog pair, we retrieved and integrated protein-protein interaction and genetic interaction data from MIST [22]. In addition, we also identified interactors in common for each paralog pair. Tissue-or cell line-specific expression datasets were also integrated. For each Drosophila paralog pair, modENCODE tissue-, developmental stage-, cell line-, and treatment-specific expression profiles provided by FlyBase were integrated [23][24][25][26]. For human paralog pairs, tissue-specific expression data from GTEx Portal (https://gtexportal.org/home/) as well as expression data for 490 ATCC cell lines from the Cancer Cell Line Encyclopedia (CCLE) (https://sites.broadinstitute.org/ccle/) were integrated [27,28]. In addition, Pearson correlation co-efficient scores were calculated for each dataset and synexpression analysis was done.

Integration of other annotation
The 'slim' versions of gene ontology (GO) annotations were retrieved from NCBI. For each gene pair, common GO slim terms were identified and stored. Genome coordinates were retrieved from NCBI EntrezGene. Phenotype annotations from FlyBase (r6.45) and gene group annotations from GLAD [29] were retrieved. This type of information is subject to update periodically in Paralog Explorer.

Web-based tool development
The Paralog Explorer web tool (https://www.flyrnai. org/tools/paralogs/) can be accessed directly or found at the 'Tools Overview' page at the DRSC/TRiP Functional Genomics Resources website (https://fgr.hms.harvard.edu/tools). The backend was written in PHP using the Symfony framework and the front-end HTML pages take advantage of the Twig template engine. The JQuery JavaScript library with the DataTables plugin is used for handling Ajax calls and displaying table views. The Bootstrap framework and some custom CSS are used on the user interface. A mySQL database is used to store the integrated information and analysis results (e.g., Pearson correlation co-efficient scores for synexpression). Both the website and databases are hosted on the O2 high-performance computing cluster, which is made available by the Research Computing group at Harvard Medical School.

Curation of a human paralog test list
To generate a list of predicted human paralog pairs to test the reliability of Paralog Explorer, we downloaded a list of 3,132 non-redundant paralog pairs from literature [30]. This list is comprised of two published datasets: 1,436 gene pairs from recent small-scale duplication events, and the rest from ancient wholegenome duplication events. From this list, we excluded 15 gene pairs that we could not confidently map to NCBI Entrez gene IDs, resulting in a total of 3,117 gene pairs in our human paralog test list.

Database content and user interface features
To build Paralog Explorer, we retrieved all paralog predictions from DIOPT for human, mouse, zebrafish, fly and worm (Fig. 1). The 'DIOPT score' is the number of algorithms (eg. 7 out of 16 for human and Drosophila ortholog mapping) that support a given prediction, which we previously showed provides a measure of confidence in each prediction [19]. Protein alignment information, including the alignment length, percent similarity, and percent identity, was also imported from DIOPT. We find that 34 %-69 % of paralog pairs in Paralog Explorer are supported by 4 or more algorithms and 15-39 % have score equal or>6 (Table 1). We also imported Gene Ontology (GO) terms [31,32], protein-protein and genetic interaction data from MIST [22], expression data from publicly-available databases such as modENCODE [24], GTEx [27] and CCLE [28], and phenotype data for Drosophila from FlyBase [33] (Fig. 1).
With the Paralog Explorer web-tool, users can query a specific gene of interest, a list of genes, or any one of several precomputed gene lists from GLAD [29]. In addition, users can establish a filter based on DIOPT score, and for Drosophila and human genes, can establish a cut-off of transcriptional expression level in transcriptomic datasets from various tissues and cell lines for both genes in a pair. Altogether, the user interface is designed to allow users to address a variety of questions. These include very straightforward questions such as, does my gene of interest have one or more paralogs? Or, which of the genes in a list (e.g. hits from a genetic screen) have paralogs? Paralog Explorer also supports more complex queries such as, what are all the paralogous genes expressed in a given tissue or cell line? What are all the paralogous genes encoding transporters that are expressed at high levels in the adult digestive system? For each query, the Paralog Explorer web-tool reports the total number of paralogs identified within a genome, each of which is shown on a separate line. For each paralog pair, Paralog Explorer displays information including the DIOPT score of the paralog, the genomic location of each gene, as well as various measures of protein alignment and Gene Ontology (GO) annotation for each member of the paralog pair, as well those GO terms common to both paralogs.
For each gene in a paralog pair, Paralog Explorer also reports the top-scoring ortholog(s) from the distantly-related outgroup yeast (S. cerevisiae), if such orthologs exist. This allows users to assess whether both paralogs in an animal model correspond to a single ortholog in yeast, which may assist in generating functional hypothesis. Similarly, for all vertebrate organisms, Paralog Explorer returns the closest fly ortholog for each member of a paralog pair. For example, PTPN11 and PTPN6, a paralogous gene-pair in humans, are both orthologous to csw in the Drosophila genome. This information can help to clarify whether a given paralog pair is the result of a lineage-specific gene duplication, or whether the duplication predated the divergence of these lineages [34].
The tool also integrates several -omics datasets of protein-protein and genetic interaction, to identify genetic and physical interactors of each gene in the paralog pair. Previous research has shown that protein interactions can be conserved after gene duplication [35], and in some cases paralogous genes which share common protein interactors may be more likely to be functionally related. This information may therefore be useful when prioritizing paralogs for further study or designing functional experiments.
For two paralogs to have redundant or partially redundant function in the cell, they must be expressed in the same cells and at the same time. Thus, when generating such hypotheses, it can be very helpful to compare expression patterns between paralogs. To facilitate this, we integrated tissue-specific and cell line-specific RNAseq data from publicly available resources such as the GTEx and CCLE portals for human genes, as well as various modENCODE RNAseq datasets for Drosophila. Pearson correlation co-efficient scores for co-expression patterns are calculated based on each dataset respectively and are retrieved for users. For example, users can assess the co-expression of each human and Drosophila gene pairs based on either tissue-specific or cell line specific dataset. For each paralog pair, users also have the option to view the expression levels of each gene in the paralog gene pair from various datasets side-by-side as a bar graph (Fig. 2).
Users have the option to view a list of interacting partners for each gene or a list of interacting partners common to both genes (Fig. 2). The choice of columns to be displayed can be customized by the user and a results table can be exported as an Excel or tab-delimited text file so that the list by a parameter of choice can be easily filtered.

Application
Paralogs exist across a very broad range of evolutionary scenarios. In the conceptually simplest cases, a gene may have a single, evolutionarily recent paralog that is highly conserved at the sequence level, and perhaps located at an adjacent location in  the genome. For example, the Drosophila zinc finger transcription factors gcm and gcm2 share 48 % similarity at the amino acid level, are located just 26 kb apart from one another on the second chromosome, and have been experimentally shown to retain partially redundant functions [36]. In many other cases, a gene duplication event or the duplication of part or all of the entire genome may have occurred deep in evolutionary history, creating complex gene families composed of related genes at various degrees of sequence and functional similarities. For example, the Hox genes [37] and most of the major developmental signaling pathways [38] underwent duplication and diversification events very early in animal evolution, leading to a scenario today where all metazoan genomes contain varying copy numbers of each member of these gene families.
In still other cases, gene families may have dramatically expanded in certain animal lineages creating exceptionally large gene families with dozens or even hundreds of members, such as the over 900 odorant receptors encoded in the mouse genome [39].
Thus, in order to be useful to researchers with various interests, Paralog Explorer should quickly, accurately, and comprehensively identify paralogs at many different scales of similarity and genomic organization, and allow the user to investigate and rank the resulting hits based on their specific research context.
We sought to test the usefulness of Paralog Explorer to identify and characterize paralogs in three typical contexts, representing a range of gene similarity and paralog number: (1) amongst recently-diverged, highly conserved pairs/triplets of conserved paralogs; (2) amongst modestly-sized gene families that duplicated and diverged early in animal evolution and have been conserved as such in modern genomes; and (3) in a large gene family containing many dozens of paralogs.
To test the usefulness of Paralog Explorer on relatively simple cases, we examined a recently published list of 25 paralog pairs or triplets in the Drosophila genome that are closely related and physically linked in the genome, and for which there is evidence of transcriptional co-regulation via shared enhancers [40]. For each gene pair or triplet investigated by Levo et al., we used Paralog Explorer to identify all predicted paralogs and ranked the results by DIOPT score. The results are presented in Table 2.
In 23 of 25 cases, Paralog Explorer identified the same topscoring paralog as was identified by manual curation [40] (Table 2), and in the remaining two instances, additional examination provided an explanation. Among the former 23 cases, Paralog Explorer returned the predicted paralog as the best-scoring DIOPT hit and allowed the viewer to quickly confirm the chromosomal location of each gene, as well as to ascertain the co-expression patterns of the gene pairs in multiple high-throughput modENCODE datasets. In several instances, Paralog Explorer identified additional highranking paralogs that were not listed by Levo et al. but which appear to be bona fide paralogs. For example, bowl is a closely related paralog of drm, sob, and odd, and is also located in the same genomic region. Similarly, comm3 is closely related to comm and comm2, and is located in the same genomic region (Table 2). Importantly, the existence of these additional paralogs may or may not reflect a functional conversation, but it allows researchers to systematically identify such genes for further study.
In addition to identifying the correct paralog as the top-scoring hit, Paralog Explorer also provides additional information that may be of interest. For nearly every gene query, Paralog Explorer identified a number of additional paralogs at varying degrees of similarity (Table 2). These results can be ranked by DIOPT score and/ or by amino acid similarity, measures that are highly correlated with one another and serve as loose proxies for evolutionary conservation. Moreover, a user can also quickly determine whether such paralogs are physically linked in the genome and quickly access high-throughput co-expression datasets via hyperlinks.
Regarding the two cases for which Paralog Explorer did not return the same top hit as was identified via hand curation: in one case, ac and sc, Paralog Explorer identified an additional paralog, l(1)sc, as the top hit for ac, and sc as the second-highest hit. Thus, in this case, Paralog Explorer revealed biologically-relevant information. In the other case, pyr and ths, Paralog Explorer failed to return this pair because the current algorithms integrated by DIOPT database do not identify this pair as paralogs due to the low homology of FGF ligands [41], despite the fact that they are reciprocal best BLAST hits with one another in the Drosophila genome (E-value e-08, 33 % amino acid identity). Because Paralog Explorer is based on the DIOPT database, this error was propagated.
To extend these observations beyond Drosophila, we turned to a curated set of 3,117 human paralog pairs that includes paralogs across a wide range of sequence similarity and presumed duplication age (see Methods and [30]). For each of these paralog pairs, we inputted the first gene as a query in Paralog Explorer and asked whether the literature-predicted paralog appeared as the first, second, or third highest-scoring DIOPT score. Paralog Explorer identi-fied the predicted paralog among the three top-scoring hits in 3,059 cases (98.1 %). In 2,301 of these cases (73.8 %), the predicted paralog was the top DIOPT hit, in 616 cases (19.8 %) it was the second-highest hit, and in 142 cases (4.6 %) it was the thirdhighest hit. In 55 cases (1.8 %), the predicted paralog was identified but ranked less than third highest. We observed just three cases (0.1 %) for which the predicted paralog was not identified at all. Subsequent evaluation suggested that two of these might be nomenclature-related issues, while the other one belongs to a large family of zinc-finger proteins containing over 100 members (Supplemental file). Altogether, the results of our analysis with a curated set of human paralog pairs demonstrates that Paralog Explorer reliably identifies known paralogs.
Many genes belong to ''gene families" comprised of multiple paralogs that duplicated and diverged at varying points during evolution, rather than as a simple pair or triplet of recently duplicated, highly-similar paralogs. For example, the TGF-b genes are a family of secreted signaling ligands that arose and diversified very early in animal evolution, and today are present in varying numbers of paralogous genes in metazoan genomes; in Drosophila, there are seven TGF-b genes. Phylogenetically, the seven Drosophila ligands fall into three sub-families: the BMP-family ligands dpp, gbb, and scw, the Activin-family ligands daw, myo, Actb, and the mav gene which does cleanly fall into either sub-family [42]. We searched Paralog Explorer using the canonical Drosophila ligand dpp, and successfully recovered all six paralogous ligands (Fig. 3). Furthermore, we noted that DIOPT scores between paralogs was generally reflective of the phylogenetic structure of the gene family [43] (Fig. 3). For example, gbb and scw display the highest DIOPT score (5), and both individually score next-highest to dpp, resembling the taxonomic structure of these three BMP-family ligands. However, we emphasize that DIOPT scores do not directly reflect phylogenetic relationships, and can depart significantly in cases where there has been significant evolutionary change along a specific branch. For example, based on DIOPT score alone, the Actb gene is most closely related to daw (DIOPT score = 4) and equally similar to myo and the other four ligands (DIOPT score = 2), whereas phylogenetic analyses reveals that Actb falls into a monophyletic Activin-like group with both daw and myo, and is more closely related to both of these two paralogs than it is to the remaining four [43] (Fig. 3).
We expanded our search of gene families to include several other highly conserved signaling pathways: the seven Drosophila Wnt ligands, the three JAK/STAT ligands (upd genes), the three Pvf ligands, and the five Spatzle ligands. Each of these gene families play important roles during development, each one expanded very early in animal evolution, and each family has been expanded and/ or contracted in various animal lineages. For each, we entered a single family member into Paralog Explorer, and in 100 % of these examples Paralog Explorer correctly returned the entire family of related paralogs (Table 3). We note that these gene families contain a broad range of divergence amongst family members, demonstrating that Paralog Explorer is able to robustly and accurately predict the full suite of paralogs for a given gene across a wide range of evolutionary divergence and amongst complex gene families. For the Wnt family ligands, we repeated the exercise of comparing DIOPT scores to known phylogenetic relationships [44] (Fig. 3). Again, dominant phylogenetic patterns of sequence conservation were reflected by DIOPT scores, while not precisely mirroring the known phylogenetic relationships. Specifically, reciprocal DIOPT scores identified wg, wnt6, and wnt4 as closely related, and a close relationship between wnt2 and wnt5, while the divergent wntD gene stood out as distinct from all other family members, all of which is reflective of known phylogenetic patterns [44]. Importantly, as with the example of TGF-b ligands shown above, results from Paralog Explorer should not be interpreted as directly reflective of phylogenetic relationships or functional conservation, but can provide potentially helpful information to generate hypothesis about genetic similarity amongst paralogs. We then wished to know how well Paralog Explorer performed on very large gene families. We chose the odorant receptor (Or) gene family, of which there are 60 members in the Drosophila genome, as well as one pseudogene [45]. Remarkably, using Or1a as our query, Paralog Explorer returned exactly 59 paralogs, only failing to return the single pseudogenic member of this family noted in [45]. Thus, in even in the case of highly expanded gene families such as the Or genes, Paralog Explorer correctly identifies all known paralogs.
In addition to providing users with the ability to identify paralogs for individual queries or lists, Paralog Explorer also has the potential to assist in large-scale bioinformatic analyses. To demonstrate one such use case, we compared the paralog annotation with a synthetic lethality screen using CRISPR-Cas9 dual targeting [46] in human cell lines. Out of 406 heterogenous gene pairs, 21 pairs are annotated as paralogs in Paralog Explorer. Furthermore, 9 of the 21 paralog gene pairs (43 %) are scored as synthetic lethality interactors with one another in at least one cell line by the criteria of FDR < 0.1while only 20 out of 385 other gene pairs (5 %) scored. Paralogous gene pairs are much more likely to score in functional screens than are pairs of unrelated genes, and not surprisingly, more recent studies are focused on paralog gene pairs rather than randomly selected gene pairs [7,[9][10][11][12][13]. Thus, we expect that Paralog Explorer will facilitate the experimental design of highthroughput screens and mapping of functionally related genes.
There is an important caveat to Paralog Explorer, which is likely common to all paralog prediction methods. Because hypotheses of paralogy are primarily drawn from sequence conservation, gene queries which contain individual protein domains that are highly conserved may return many putative paralogs, based on the presence of shared domains across proteins that are otherwise only distantly-related. Furthermore, the sequence length of these conserved domains will impact paralog predictions, such that longer domains are more likely to score higher while relatively short domains may not reach the threshold to score via DIOPT. While the presence of shared domains across proteins may in fact reflect a true evolutionary history of gene duplication, from a practical standpoint it can lead to complex results that require sophisticated manual analysis.
As one example, we examined the Drosophila Hox genes, which are transcription factors that include the highly conserved homeodomain, a $ 60aa domain that is widespread across many transcription factors [37]. Inputting the anterior-most Drosophila Hox gene lab as a query, Paralog Explorer returns 23 predicted paralogs. These include the other Hox genes themselves (pb, Dfd, Scr, Antp, Ubx, Abd-A, Abd-B), as well as the paralogous homeodomain proteins bcd, zen, and zen-2 that are located nearby in the genome, all at a DIOPT score of 2. In addition, however, Paralog Explorer returns as its highest hit the homeobox gene ro (DIOPT score 3), which is not considered a Hox gene, as well as 12 additional genes, all but one of which are known homeobox-containing genes. Importantly this list is not comprehensive of all homeoboxcontaining genes. FlyBase identifies a total of 102 homeobox genes in the Drosophila genome, indicating that this hit list includes those that reach some similarity threshold based on DIOPT scores. Thus, for genes that contain specific highly conserved domains found in many genes, users should carefully analyze the results when forming hypotheses about paralogy.
Users of Paralog Explorer can rank paralogs based on the DIOPT score, which serves as a proxy for protein sequence similarity. Importantly, the tool also provides additional valuable ways to analyze paralog relationships aside from sequence similarity. For example, to generate hypotheses about which paralog pairs are likely to have overlapping functions, users can perform coexpression analysis, as genes showing synexpression (i.e., high degrees of correlated co-expression) often operate in similar pathways and/or processes. We analyzed the correlation of similarity in protein sequence and expression pattern by comparing the DIOPT scores and the percentage of gene pairs with high synexpression scores (Pearson correlation co-efficient score > 0.5) calculated based on cell line RNAseq data sets. RNAseq data from single cells or groups of homogeneous cell populations has been shown to be better than tissue-based datasets for synexpression analysis [47].  We found that gene pairs with higher sequence similarity were more likely to have higher synexpression scores for both Drosophila and human paralog pairs (Fig. 4). We emphasize that paralog pairs that have high synexpression scores but are not necessarily the absolute top DIOPT-scoring pairs can also be functionally related. For example, the human genes DUSP4 and DUSP6 have been demonstrated to be functional paralogs that share a digenic dependence in MAPK pathway-driven cancers [9]. Both genes are part of a larger gene family. Based on DIOPT scores alone, the highest-ranking paralogs for DUSP4 are DUSP1 and DUSP10, while DUSP6 ranks third. However, DUSP4 displays the highest synexpression score with DUSP6 compared to any other DUSP, reflecting the fact that they are often coexpressed. We examined three other well-characterized human paralog pairs (SMARCA2/SMARCA4, DDX3X/DDX3Y, and STAG1/ STAG2) [48][49][50] and found that the well-characterized paralog pairs are the top-ranked for: both DIOPT and synexpression (SMARCA2/SMARCA4); synexpression but not DIOPT score (STAG1/STAG2); and DIOPT score but not synexpression (DDX3X/ DDX3Y). Thus, we designed Paralog Explorer to allow users to rank Paralog candidates according to multiple rubrics and thus to generate context-specific hypotheses about functional relevance (Table 4).

Discussion
Paralog Explorer is a tool which allows users to quickly and reliably identify paralogs of any gene(s) of interest, as well as relevant measures of their similarity, genomic location, co-expression patterns, genetic and protein interactions, and GO terms. It is important to note that identifying two or more genes as paralogs is a hypothesis about their evolutionary history -i.e. that they arose via gene duplication -rather than about molecular function and or whether they may be functionally redundant. Thus, we designed Paralog Explorer to be a flexible search tool that will allow researchers with diverse interests to generate hypotheses about paralogous genes.
We have shown that Paralog Explorer can reliably and robustly identify known paralogs across a wide range of sequence similarities. We emphasize that there is no ''one size fits all" approach to deciding which paralogs are relevant for different biological questions. For this reason, Paralog Explorer allows users to rank results based on a number of measures, including DIOPT score, sequence similarity, chromosomal location and synexpression scores. We have shown that the DIOPT score is often a useful, though very coarse, proxy for phylogenetic proximity, and have also provided several examples for which the 'functionally relevant' paralog may not necessarily be the top-scoring DIOPT hit. Paralog Explorer is designed to accommodate this wide range of biological realities and to provide users with easily accessible bioinformatic information to help generate hypotheses.
We note that Paralog Explorer is built based on predictions in DIOPT, which in some instances may fail to include certain paralogs that are validated by experimental data or published literature. As DIOPT is updated and improved via user-submitted data, Paralog Explorer will be updated accordingly.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The correlation of similarity in protein sequence and expression pattern was analyzed by comparing the DIOPT scores and the percentage of paralog pairs with high synexpression scores (Pearson correlation co-efficient score > 0.5) calculated based on cell line RNA-seq data sets. We observed that the gene pairs of higher sequence similarity were more likely to have synexpression pattern for both Drosophila and human paralog pairs.