GTRD: a database on gene transcription regulation—2019 update

Abstract The current version of the Gene Transcription Regulation Database (GTRD; http://gtrd.biouml.org) contains information about: (i) transcription factor binding sites (TFBSs) and transcription coactivators identified by ChIP-seq experiments for Homo sapiens, Mus musculus, Rattus norvegicus, Danio rerio, Caenorhabditis elegans, Drosophila melanogaster, Saccharomyces cerevisiae, Schizosaccharomyces pombe and Arabidopsis thaliana; (ii) regions of open chromatin and TFBSs (DNase footprints) identified by DNase-seq; (iii) unmappable regions where TFBSs cannot be identified due to repeats; (iv) potential TFBSs for both human and mouse using position weight matrices from the HOCOMOCO database. Raw ChIP-seq and DNase-seq data were obtained from ENCODE and SRA, and uniformly processed. ChIP-seq peaks were called using four different methods: MACS, SISSRs, GEM and PICS. Moreover, peaks for the same factor and peak calling method, albeit using different experiment conditions (cell line, treatment, etc.), were merged into clusters. To reduce noise, such clusters for different peak calling methods were merged into meta-clusters; these were considered to be non-redundant TFBS sets. Moreover, extended quality control was applied to all ChIP-seq data. Web interface to access GTRD was developed using the BioUML platform. It provides browsing and displaying information, advanced search possibilities and an integrated genome browser.


Quality metrics
The common practice to assess the quality of ChIP-seq datasets is to apply well-known quality metrics developed within the ENCODE project. For instance, metrics such as NRF, PBC1, PBC2, NSC, and RSC measure the quality of the alignment of reads to individual genomes. To estimate the quality of the products of the peak callers, the fractions of reads in the obtained peaks are analysed and metrics like FRiP and IDR are determined. However, these metrics do not enable the researcher to control the number of false positive and false negative peaks generated by different peak callers. To avoid this disadvantage, we proposed two quality control metrics, namely FPCM (False Positive Control Metric) and FNCM (False Negative Control Metric), both of which are based on well-known and commonly used capture-recapture approaches, for example, in ecology to estimate the abundance of individuals of particular species, as well as the total number of species present in a given area.
To evaluate the quality of a given ChIP-seq dataset using FPCM and FNCM metrics, it is initially necessary to merge all peaks generated by the application of four peak callers -MACS, GEM, SISSRS, and PICS (which are used in the GTRD ChIP-seq pipeline)to the same set of aligned reads. After this, the FNCM for each peak caller is calculated as the ratio of the The second metric, FPCM, is determined under the natural assumption that almost all false positive peaks must be orphans, i.e., they do not overlap with another peak. In other words, one can expect that false positive peaks generated by a given peak caller are not confirmed by peaks generated by other peak callers. The FPCM is determined as the ratio of the observed to the expected number of orphan peaks, where the expected number of orphan peaks is calculated using Poisson's distribution. If the value of the FPCM is closer to 1.0 then one can conclude that false positive peaks are almost completely absent. A high FPCM value (for instance, FPCM > 2) indicates that the majority (at least, more than half) of the observed orphans are false positive peaks.
To demonstrate the helpfulness of FNCM and FPCM, we have performed an analysis of six datasets: devoted transcription factor c-Jun. According to FPCM values, three merged datasets -PEAKS037012, PEAKS037013, and PEAKS037011are extremely saturated by false positive orphans; see Table S2a. This saturation can be explained by the wrong results of PICS. Thus, the numbers of peaks generated by four peak callers (see the last four columns of Table S2a) indicate that the PICS peak caller over-generated a large number of peaks in comparison with other peak callers. In this case, the FPCM recommends that all orphans are discarded because almost all of them are false positives. On the other hand, it is not necessary to remove orphans from PEAKS033434, PEAKS033441, and PEAKS033494. Considering FNCM values (see Table S2b), it is not difficult to conclude that SISSRs and MACS outperformed GEM and PICS when generating PEAKS037011, PEAKS037012, and PEAKS037013, while MACS and GEM outperformed SISSRS and PICS during the generation of PEAKS033434, PEAKS033441, and PEAKS033494. In particular, if the user chooses nonmerged datasets, then FNCM recommends selecting PEAKS037011, PEAKS037012, and PEAKS037013 generated by SISSRS, as well as PEAKS033434, PEAKS033441, and PEAKS033494 generated by MACS. It is also important to note the fruitful relationship between FNCM and FPCM; on the one hand, the merged dataset of PEAKS033494 contains an unexpectedly small number of orphans because the corresponding value of FPCM = 0.27 is too small. On the other hand, the FNCM values are also too small for the datasets generated by PICS and SISSRS. One can conclude, therefore, that PICS and SISSRs overlooked numerous genuine peaks. As a result, a large number of genuine orphans were also ignored.
Finally, the usefulness of FPCM can be demonstrated by the prediction of site motifs in merged peaks. For this purpose, we used two position weight matrix models (namely, MATCH and HOCOMOCO) in which the same matrix, JUN_HUMAN.H11MO.0.A from the HOCOMOCO database, is used. Following traditional methods, we assessed the quality of motif prediction with the help of ROC (receiver operating characteristic) curves and their corresponding AUC (area under curve) values. We considered the full versions of the merged datasets, as well as their truncated versions (without orphans). Table S2c contains the AUC values calculated. Obviously, transitioning from full versions to truncated versions of the PEAKS037011, PEAKS037012 and PEAKS037013 sets essentially increases the qualities of motif predictions measured by AUC values. In particular, Figure S1 also demonstrates the changing ROC curves obtained for the PEAKS037011 set. According to FNCM values, these increments were expected because the majority of the discarded orphans were false positives without genuine binding sites. In the case of PEAKS033434, PEAKS033441, and PEAKS033494, this transition does not change the AUC values because these sets contain few false positives. It is important to note that the obtained conclusions are invariant with respect to the choice of the position weight matrix model.
(a) (b) Figure S1. ROC curves obtained for (a) the full set of PEAKS037011 and (b) the truncated set.