Dataset of the frequency patterns of publications annotated to human protein-coding genes, their protein products and genetic relevance

We present data concerning the distribution of scientific publications for human protein-coding genes together with their protein products and genetic relevance. We annotated the gene2pubmed dataset Maglott et al., 2007 provided by the NCBI (National Center for Biotechnology Information) with publication years, genetic metadata corresponding to Online Mendelian Inheritance in Man (OMIM) Hamosh et al., 2005 entries and the frequency of their appearance in Genome-Wide Association Studies (GWAS) Buniello et al., 2019 provided by the European Bioinformatics Institute (EBI) using the KNIME® Analytics Platform Berthold et al., 2008. The results of this data integration process comprise two datasets: 1) A dataset containing information on all human protein-coding genes that can be used to analyse the number of scientific publications in context of the potential disease relevance of the individual genes. 2) A table with the annual and cumulated number of PubMed entries. For further interpretation of the data presented in this article, please see the research article ‘Target 2035 - probing the human proteome’ by Carter et al. https://doi.org/10.1016/j.drudis.2019.06.020 Carter et al., 2019.


Data
The main dataset that is provided as Supplementary File 1 and contains a list of: Specifications Table   Subject Chemical biology Specific subject area Data integration and mapping Type of data Tab separated files How data were acquired Download, integration and filtering of publicly available datasets in a KNIME workflow. Data format Filtered, Summarised Parameters for data collection Information about publications on human genes was identified via the gene2pubmed dataset. Relevance for disease phenotype was assessed via GWAS catalog and OMIM. Description of data collection Publicly available datasets from the NCBI (gene2pubmed, mim2gene_medgen, gene_info) and the EBI (GWAS catalog) were downloaded. Data from the GWAS catalog were filtered for p-value thresholds and after mapping of ENSEMBL gene identifiers and gene symbols to NCBI gene identifiers via Biomart, all datasets were integrated and summarised in KNIME.

Data source location
The data were gathered from the National Center for Biotechnology  Value of the data The data will be useful to analyse the research activity (as evidenced by scientific publications) on human protein-coding genes and their protein products. The dataset also provides information on the potential disease relevance or phenotype of the individual genes. As shown by Carter et al. [5] the data indicate that researchers tend to focus on a relatively small, already well-studied fraction of the proteins coded by the human genome despite evidence that many understudied proteins are potentially important for human disease phenotypes. The analysis of these data allows the identification of genes that are understudied despite a link to disease phenotypes or an association with specific disease traits. This could stimulate research and promote the development of pharmacological tools to interrogate the understudied proteins encoded by these genes. Dataset entries have been tagged with a number of different ID types. This allows mapping to other datasets as a basis for generation of further insights. Moreover, we are also publishing the KNIME workflow that has been used to compile and integrate the data. This will allow researchers to reproduce an updated dataset at any future point in time.
The data also provides access to the frequency of scientific publications on an annual or cumulated basis.  (Table 2) with overall PubMed entries per year, PubMed entries related to any genes and PubMed entries related to human genes. The gene-related data were additionally restricted to only protein-coding genes. Table 2 Overall Publication counts in PubMed with or without restriction to human and/or protein-coding genes since 1980. All PubMed: Number of all publications in PubMed for a given year. All genes: Number of all publications in gene2pubmed for a given year. All other columns are subsets of this column according to their title and based on entries in gene2pubmed.

Experimental design, materials, and methods
We downloaded the gene2pubmed dataset [1,6] and all other datasets mentioned below on 25 March 2019 to derive information on publications annotated to each gene. All of the data integration steps were carried out using a KNIME [4] workflow (KIME version 3.5.3) that is included as Supplementary File 2 with this publication.
To annotate all PubMed identifiers (PMID) with the corresponding publication year we used an internal PubMed index. Alternatively, this could also be done using the NCBI E-Utilities [7]. The resulting list was used to generate overall counts of PMIDs per year and was joined to the initial gene2pubmed dataset. The resulting dataset was filtered for human (Homo sapiens) genes via taxonomy identifier 9606.
We then joined the human gene2pubmed subset with Homo sapiens gene information [8] via NCBI gene identifiers (IDs) to annotate the list with information on types of genes and descriptive metadata for later interpretation, and used this information to filter the dataset for protein-coding genes only. The dataset was grouped by gene IDs to create a table containing gene IDs, gene symbols, gene description metadata, number of publications, year of earliest publication and year of latest publication.
To obtain information about disease relevance of the individual genes, we downloaded the mim2gene_medgen file [9] from the NCBI and the GWAS [3] catalog from the EBI [10]. The mim2-gene_medgen dataset was used to link genes to OMIM [2] entries. Using reported and mapped genes, upstream, downstream and SNP gene IDs we created a list of potentially relevant genes per study, mapped to NCBI gene IDs. The GWAS catalog data were filtered for p-values < 10 À6 to select for potentially more reliable hits. Mim2gene_medgen data were mapped directly via NCBI gene IDs, for the GWAS catalog the gene symbols and Ensembl gene identifiers were mapped to NCBI gene IDs using Biomart and Ensembl version 86 [11]. Descriptive metadata (OMIM IDs, number of GWAS studies) from those tables and additional protein identifiers (Uniprot, Swissprot, Interpro and PFAM) obtained via Biomart were also added to the dataset.
Finally, genes were ranked according to their highest number of annotated publications and earliest publication year to generate the dataset that is provided with this publication as Supplementary File 1.
We also merged the global PMID to publication year table with the complete gene2pubmed dataset via PMID and subsequently with the gene information file for all species [12] via gene IDs to obtain the overall yearly publication counts for PubMed with or without restriction to human and/or proteincoding genes ( Table 2).