Data and programs in support of network analysis of genes and their association with diseases

The network-based approaches that were employed in order to depict the relationships between human genetic diseases and their associated genes are described. Towards this direction, monopartite disease-disease and gene-gene networks were constructed from bipartite gene-disease association networks. The latter were created by collecting and integrating data from three diverse resources, each one with different content, covering from rare monogenic disorders to common complex diseases. Moreover, topological and clustering graph analyses were performed. The methodology and the programs presented in this article are related to the research article entitled “Network analysis of genes and their association with diseases” [1].


a b s t r a c t
The network-based approaches that were employed in order to depict the relationships between human genetic diseases and their associated genes are described. Towards this direction, monopartite disease-disease and gene-gene networks were constructed from bipartite gene-disease association networks. The latter were created by collecting and integrating data from three diverse resources, each one with different content, covering from rare monogenic disorders to common complex diseases. Moreover, topological and clustering graph analyses were performed. The methodology and the programs presented in this article are related to the research article entitled "Network analysis of genes and their association with diseases" [1]. &

Value of the data
The need for integrating complementary data from different sources to biological networks is further highlighted in this study.
Important, previously unknown, associations between genes and diseases were revealed. Based on the constructed disease-disease networks, diseases with apparently distinct phenotypic manifestations were found to share a common genetic background. This finding could be utilized in network pharmacology.

Data
The overall procedure of the data analysis is shown illustratively in Fig. 1. The Perl (Supplementary Files 1-5) and R (Supplementary File 6) programs used for data analysis are indicated. A complete description of the data and methodology is presented in [1].

Data collection
Disease-gene association data were collected and integrated from three diverse publicly available, comprehensive resources (NCBI's OMIM [2], NIH's GAD [3] and NHRI GWAS Catalog [4]). As a given disease can be associated with more than one gene, a script was written in Perl to separate the multiple entries (Supplementary File 1; separate.pl).

Disease and gene nomenclature
In order to maintain a consistent nomenclature and classification for diseases in our analysis, the naming conventions described in the International Classification of Diseases (ICD) were used. The disease terms from the three databases were converted to ICD terms with the use of a Perl script (Supplementary File 2; ICD.pl). Moreover, in order to maintain a uniform nomenclature across all datasets, all genes from our three databases along with the ones from UniProtKB [5] were converted to the official HGNC (HUGO Gene Nomenclature Committee) [6] gene symbols using a Perl script (Supplementary File 3; Hugo.pl).

Network processing and analysis
The bipartite networks of gene-disease associations were converted to monopartite networks of gene-gene and disease-disease interactions, by using a Perl script (Supplementary File 4; Bipartite.pl). This functionality is not available in other network analysis packages and we incorporated it in a publicly available web-server, PowerClust, which is available at: http://www.compgen.org/tools/ powerclust. PowerClust, is an easy-to-use web application for clustering analysis, network processing and visualization. Moreover, randomization procedures were performed in order to determine whether the highly connected nodes in the original networks have a degree that cannot occur simply by chance given the other properties of the networks (Supplementary File 5; Random.pl). Finally, the robustness of the topological features of the projected gene-gene and disease-disease networks was assessed by employing a bipartite-specific rewiring algorithm [7] to test whether the degree distributions of the projected monopartite networks are kept stable in the randomized gene-gene/disease-disease networks compared to the initial ones (Supplementary File 6; Rewire.R). The JOINT gene-disease network (generated by combing data from the individual databases) is provided as a cytoscape network file.

General Secretariat for Research and Technology of the Greek Ministry of Education and Religious
Affairs, Culture and Sports.

Transparency document. Supporting information
Transparency data associated with this article can be found in the online version at http://dx.doi. org/10.1016/j.dib.2016.07.022.