Gene Ontology representation for transcription factor functions

Transcription plays a central role in defining the identity and functionalities of cells, as well as in their responses to changes in the cellular environment. The Gene Ontology (GO) provides a rigorously defined set of concepts that describe the functions of gene products. A GO annotation is a statement about the function of a particular gene product, represented as an association between a gene product and the biological concept a GO term defines. Critically,


Introduction
The Gene Ontology (GO) develops a computational model of biological systems, ranging from the molecular to the organism level, across all species in the tree of life. GO aims to provide a comprehensive representation of the current scientific knowledge about the functions of gene products, namely, proteins and non-coding RNA molecules (1)(2). GO is organized in three aspects. GO Molecular Functions (MF) describe activities that occur at the molecular level, such as "DNA binding transcription factor activity" or "histone deacetylase activity". Biological Processes (BP) represent the larger processes or 'biological programs' accomplished by multiple molecular activities. Examples of broad biological process terms are "transcription" or "signal transduction". Cellular Components (CC) are the cellular structures in which a gene product performs a function, either cellular compartments (e.g., "nucleus" or "chromatin"), or stable macromolecular complexes of which they are parts (e.g., "RNA polymerase II"). Together, annotations of a gene to terms from each of those aspects describe what specific function a gene product plays in a process and where this activity occurs in the cell. Ideally every gene product should have an annotation from each of the three aspects of GO.
The specific genes expressed in a given cell define the identity and functionalities of that cell.
Regulation of transcription is highly complex and leads to differential gene expression in specific cells or under specific conditions. In human cells, it has been estimated that several thousand proteins participate in gene expression and its regulation, directly or indirectly (3) (Velthuijs, in preparation). This includes the general transcription machinery, the factors that make the chromatin more or less accessible, specific DNA-binding transcription factors, and the signaling molecules that regulate the activity of all those proteins. This complexity is difficult to accurately represent in ontological form. Tripathi et al. (4) redesigned that part of the ontology in 2013 to define precise molecular functions for the various proteins involved in transcription and its regulation. Nearly 10 years after its implementation, we had to acknowledge that this framework was too complex and difficult to navigate, leading to inconsistent annotations and thus poorly serving the user community. The work described here was also motivated by the GREEKC consortium, whose goals include curation tools development, reengineering of ontologies, development of curation guidelines and text mining tools, developing platforms to analyze and render the molecular logic of transcription regulatory networks for which a robust infrastructure is needed. Therefore, we thoroughly reviewed the Gene Ontology representation of molecular activities relevant to transcription, with a simpler and more pragmatic approach, more aligned with available experimental data.
We have revised the GO MF terms representing the activities of proteins involved in transcription, with the input from domain experts. In addition to RNA polymerase, we defined three different types of activities that take place on the DNA to mediate or regulate transcription: general transcription factors (GTFs), DNA-binding transcription factors (dbTFs), and transcription coregulators (coTFs).
Here we present the annotation approach recommended by the GO consortium (5), applied to the recent refactoring of the transcription domain of GO. This approach aims to 1) help biocurators -annotation producers -interpret published data and correctly assign the MFs terms for GTF, dbTF, or coTF to a protein, and 2) help users understand how the data is generated and how to interpret them. The annotation of factors involved in transcription and its regulation is challenging for multiple reasons. Contrary to other molecular functions, for example enzymes, where one protein or a well-defined complex catalyses a precise reaction, the measurable output of transcription activities is the result of multiple nearly simultaneous activities of GTF, dbTF, coTF, as well as RNA polymerase, hence, individual activities can be hard J o u r n a l P r e -p r o o f Journal Pre-proof to distinguish experimentally. Moreover, these factors often form large complexes, such that the level of resolution of the experimental setup is essential to determine the precise activity of any given protein. Older experimental methods often did not provide enough details, leading to inaccurate classifications of certain proteins. In addition, researchers use "transcription factor" loosely, at times meaning GTF, dbTF, or coTF. This complicates the annotation process and necessitates solid expertise for correct interpretation of the data. The experimental data itself is difficult to parse for unambiguous assignment of a function to a protein: typically, a single experiment is insufficient for accurately determining the function of these proteins, thus, interpretation of experimental results that investigate dbTFs must rely on pre-existing knowledge. Also, many proteins presumed to function as dbTFs have never been experimentally demonstrated to bind DNA, but their role is indirectly inferred by the presence of known specific DNA-binding domains and in some cases, evidence of an effect on the transcription of putative direct target genes. To add to the complexity, the presence of a DNAbinding domain in a protein does not always imply that the protein functions as a dbTF (6).    To support the association of a gene with a GO term from homologous sequences from other species, only closely related orthologs whose function have been unambiguously characterized can be used if those are consistent with the experimental data presented in the article.
-GTFs function as the molecular machine that assembles with the RNA polymerase at the promoter to form the pre-initiation complex (PIC). GTFs have been characterized in several organisms, from archaea to yeast and mammalian cells (14) (15), and therefore orthology should provide strong support for the decision to associate these proteins with a child specific for RNA polymerase I, II or III of the MF term "GO:0140223 general transcription initiation factor activity". In addition, the naming of GTFs is well established across human and model organism nomenclature groups and can be used to help guide these decisions. Thus, for human GTFs the HUGO Gene Nomenclature Committee (HGNC, www.genenames.org) provide the gene symbol TAF#, for TATA-box binding protein associated factors, and GTF2#s and GTF3#s, for general transcription factor II and III subunits respectively. One example of this is CPF1, that binds the CpG dinucleotide and helps most CpG islands gain epigenomic marking (19) (20)(21).
It is important to keep in mind that DNA binding proteins that regulate transcription are not necessarily dbTFs. Key points that help distinguish between the three activities discussed above are that (i) dbTFs bind DNA in a sequence-specific manner, and regulate precise sets of genes; (ii) coTFs usually do not directly bind DNA, and when they do they don't exhibit strong sequence-specificity (iii) coTFs often have catalytic activities (such as histone methyltransferase, protein kinase, or ubiquitin ligase), which is highly unusual in dbTFs; (iv) GTFs are required for core promoter activity and are considered to act at each promoter to promote transcription initiation (14)(22) (6)). Phylogenetic annotations are assigned by a group of biocurators with expertise in evolutionary biology, and require experimental evidence for at least one member of a clade of evolutionarily related proteins (11). The GO knowledgebase also contains GO terms assigned by automated pipelines based on protein domain (InterPro2GO) and orthology (Ensembl). InterPro2GO (12) is based primarily on local (partial) homology: protein domains are mapped to specific GO terms, and any protein with one of these domains will be annotated to the appropriate GO term(s). Ensembl Compara (13)  In cases where the primary data is conflicting across different articles (for example a protein is sometimes described as a transcription factor, and sometimes as a coregulator), then the literature will be reviewed carefully to decide whether the annotation is incorrect (bad choice of term, wrong protein annotated), whether the knowledge has evolved, if the protein plays multiple roles under different conditions (i.e., acts as a DNA-binding transcription factor in J o u r n a l P r e -p r o o f certain contexts and as a cofactor in others). If no activity has yet been established, no MF annotation will be made.
Note that individual DNA-binding transcription factors can act as both activators or repressors dependent on the context, hence association of both activator and repressor terms with a single protein is not considered inconsistent. The specific conditions under which this happens, such as relevant signaling pathways, cell type, as well as specific target genes, etc., may be further specified through additional context details ((31); see an example in Figure 4).

Pitfalls in annotating transcription regulators
During the review of dbTF GO annotations (6), in which over 3,000 GO annotations were reviewed, a variety of common errors in data interpretation were identified. One of the most common errors was caused by the difficulty in distinguishing a dbTF from a coTF, as the evidence for those two functions can be quite similar. To prevent this error, biocurators ensure that the protein has a sequence-specific double-stranded DNA-binding domain and conduct an J o u r n a l P r e -p r o o f exhaustive review of the literature, including articles associated with the protein's close orthologs. Furthermore, the literature supporting the dbTF activity of a protein that also has evidence for another function, in particular, RNA binding, will be carefully checked before assigning a dbTF activity. The work on the human dbTF catalogue added a GO 'DNA-binding transcription factor activity' annotation to 583 proteins, and removed erronous assignments for 256 proteins.
Transcription regulators most often act as members of complexes, some of which also contain proteins with other activities. In some cases, only some subunits of a complex interact with DNA: for instance, while the RFX complex contains three members: RFX5, RFXAP and RFXANK, only RFX5 binds DNA directly. But the DNA-binding ability of the complex is facilitated by all three subunits so RFXAP and RFXANK are not coTFs (32). In this case, RFXAP and RFXANK are annotated using the "contributes to" qualifier, to indicate that they participate in, but are not directly responsible for the activity.
Another activity that can easily be confused for a coTF is a dbTF inhibitor. These proteins interact with a dbTF, but not at the DNA, to prevent the dbTF from reaching its target genes.
Well characterized examples are the I-SMADs, SMAD6 and SMAD7 (33), that act by competing with active SMADs at receptors, thus blocking further intracellular signalling, and should be annotated to "GO:0140416 transcription regulator inhibitor activity".
It must be noted that these approaches to avoid errors in dbTF activity assignment are not unequivocal, as some proteins do have multiple functions. For example, the glucocorticoid receptor (NR3C1), which is a canonical dbTF, has recently been shown to bind double-stranded RNA motifs (34); ATF2 (activating transcription factor 2) and CLOCK are dbTFs that have been reported to also exhibit histone acetyltransferase activity (35)(36)(37) (38); some dbTFs, such as NFIB (nuclear factor I B), also function as dbTF inhibitors (39). Finally, general and sequencespecific effects can be difficult to separate, as has been established for the MYC dbTF (40).
J o u r n a l P r e -p r o o f Conclusion The annotation approach presented here is designed to help biocurators annotate factors involved in transcription and its regulation, as well as for users of GO annotations to understand their meaning and the evidence behind them. This work complements the redesign of this part of the GO to significantly simplify the ontology structure. The new ontology structure and the present standards were applied to the review of human proteins associated with GO terms describing dbTF activity (6). We anticipate that adoption of this annotation approach by all groups who produce GO associations will increase annotation consistency across all species, for transcription and also more widely across all areas represented by GO.