The gene regulation knowledge commons: the action area of GREEKC

As computational modeling becomes more essential to analyze and understand biological regulatory mechanisms, governance of the many databases and knowledge bases that support this domain is crucial to guarantee reliability and interoperability of resources. To address this, the COST Action Gene Regulation Ensemble Effort for the Knowledge Commons (GREEKC, CA15205, www.greekc.org) organized nine workshops in a four-year period, starting September 2016. The workshops brought together a wide range of experts from all over the world working on various steps in the knowledge management process that focuses on understanding gene regulatory mechanisms. The discussions between ontologists, curators, text miners, biologists, bioinformaticians, philoso-phers and computational scientists spawned a host of activities aimed to standardize and update existing knowledge management workflows and involve end-users in the process of designing the Gene Regulation Knowledge Commons (GRKC). Here the GREEKC consortium describes its main achievements in improving this GRKC.


Introduction
Understanding how complex biological systems operate is not possible without computational modeling of data, information and knowledge. In fact, biological knowledge discovery itself is becoming increasingly dependent on computational modeling and simulation. The construction of computer models requires comprehensive knowledge of biological entities and their interactions, and abundant efforts are dedicated to providing such information in databases [1][2][3]. Despite all this, multidisciplinary collaborations between stakeholders that represent the different expert areas necessary to specify and design the various knowledge domains, formats, content and access (the knowledge life cycle) have been scant, explaining why many of these valuable knowledge domains have remained only modestly interconnected.
The analysis of gene regulation mechanisms is of high importance to systems approaches because it is key to understanding how information in the genome governs cellular differentiation and function. The complex machinery that determines which genes are active requires a dynamic interplay between different types of transcription factors, the DNA regions where they engage in gene-specific transcription regulation, and the specific epigenetic context that affect the accessibility of these regions. Progress to comprehensively improve knowledge repositories that provide detailed information about each of these types of gene regulators and their causal interactions, needs input from expert groups that may not normally interact or collaborate.
The European Cooperation in Science and Technology (COST) Action Gene Regulation Ensemble Effort for the Knowledge Commons (GREEKC) is the result of an initiative which started in 2013: The Gene Regulation Consortium (GRECO, www.theGRECO.org). GRECO acquired funding from COST in 2016, allowing us to commence on a four-year journey using the different COST mechanisms (most importantly: Workshops and Working Group meetings, Training Schools and Short Term Scientific Missions). The main aim of GREEKC was to advance the coordinated building of the Gene Regulation Knowledge Commons (GRKC). This GRKC is defined by the GREEKC consortium as: "The collection of freely accessible gene regulation information resources, containing information that is well annotated with unambiguous descriptors according to quality criteria and standards that allow seamless integration and interoperability as well as automated computational access with third-party software".
From September 2016 to March 2021, GREEKC organized a series of workshops to discuss and assess efforts to produce and exploit 'knowledge' pertinent to this domain. In doing this, we followed a Responsible Research and Innovation (RRI) approach, which Schomberg [4] defined as engaging all stakeholders to optimize the deliverables of a scientific process, and to align scientific processes and outcomes to societal needs. The GREEKC consortium took this strategy as an iterative process of identifying and including stakeholders in discussions about, for instance, data curation or data sharing issues, starting with key players in the knowledge life cycle [5]. This RRI approach proved to be an extremely good fit with the main mechanism of COST Actions for facilitating discussions and establishing multidisciplinary partnerships to achieve scientific progress.

GREEKC field of operation and design
Gene regulatory mechanisms involve a complex interplay of many molecules and their causal relationships, some of which are described in Fig. 1. Different classes of biomolecules (Protein, RNA and DNA), acting often in multi-molecular complexes, are responsible for processes that enable transcription (e.g., accessibility of regulatory sequences at the DNA), drive transcription (DNA binding Transcription Factors (dbTF) and transcription cofactors (co-TF) in complexes bound to DNA) and support the use of transcripts for protein biosynthesis.
The research in this area has resulted in a wealth of information and knowledge, available in scientific publications and as large-scale datasets.
Yet, scientific results cannot be effectively shared for computational use through publications or data repositories alone. The information content of publications needs to be carefully checked, or curated, and archived in standardized formats in publicly available resources, if it is to become broadly available for computational integration and analysis [6,7]. Similarly, large-scale data must be curated and archived with proper metadata to provide well annotated resources for obtaining knowledge through computational processing and integration with other information sources.
Central to this value creation is the biocurator, who is typically an expert in a biology or bioinformatics domain. A trained biocurator is able to identify and characterize specific biological entities and interactions described in papers or large-scale data repositories and can investigate their contents for experimental or other evidence that supports particular claims about their biological function. These claims are described, or annotated, with the help of controlled vocabularies that provide standardized terms, descriptions and definitions for concepts that are relevant for a (sub)domain of biology. Domain Ontologies consist of machine-processable formal axioms and definitions of types of domain entities, hierarchically organized so that they can facilitate analysis at different levels [8][9][10] thus constituting the building blocks for representing human knowledge [11]. Describing biological entities and their relationships in specific contexts with the help of unambiguously defined ontology terms is performed in annotation workflows that follow well-defined curation guidelines, so that different biocurators are able to interpret and annotate knowledge from a paper in identical ways. Their work is supported by curation tools, which often provide additional guidance as to the annotation details that need to be provided. There are many subdomains of biology that require such annotation efforts. The focus of the GREEKC consortium has been the area of gene regulatory mechanisms (Fig. 1), but their efforts in developing knowledge gathering and sharing principles likely has value across all biological domains.
In addition to the curation of the various sources of information relevant for gene regulatory mechanisms, two other technology areas are also relevant to consider: text mining and data sharing. The curation of information from scientific literature starts with the identification of papers that have curatable information. Finding such content can be facilitated by text mining algorithms that identify and mark paper sections appropriate for subsequent manual curation. However, whereas the potential of text mining for assisting manual curation is wellestablished, its direct integration into curation workflows has not yet been widely adopted. For those curation workflows that produce information relevant to the GRKC, the breadth of annotation detail impacts their representation, storage in a database schema and subsequent sharing mechanisms. For instance, annotations need to meet welldefined curation guidelines and storage formats, and stored data require specific 'exchange languages' (e.g. based on XML or JSON) for downloads or web services.
Taken together, these different elements of the gene regulation knowledge management life cycle served to formulate four challenges that were addressed by four working groups of the GREEKC COST Action: WG1: The development and maintenance of ontologies and controlled vocabularies; WG2: The development of curation guidelines and workflows for the annotation of gene regulators at different levels: a. protein level b. non-coding RNA level c. nucleotide sequence recognition level (e.g. transcription factor binding sites) d. genome level (DNA methylation status, histone modifications) e. level of interactions, regulatory complexes and network information flow; WG3: The exploration of text mining to identify or extract information useful for annotation of gene regulators and to facilitate the identification of literature evidence that can be used to annotate regulatory molecular entities and their regulatory interactions; WG4: The storing and sharing of annotations of gene regulatory interactions.

Ways of working and accomplishments
COST Actions can organize the scientific domain, stimulate discussions, strive for consensus and achieve progress [12] through the organization of Workshops, Training Schools and Short Term Scientific Missions (STSMs). This paper elaborates on the results of the workshops and some of the STSMs, as they have been most instrumental in generating new ideas and consensus about approaches to develop the structure and add content to the GRKC.
While biocuration and annotation efforts relevant for the GRKC have been the central topics of GREEKC workshops, many times discussions also involved the need for improving ontologies and controlled vocabularies as well as text mining for gene regulation knowledge management. This means that much of what we achieved in GREEKC cannot be uniquely assigned to one particular Working Group but rather to the joint efforts of all groups.

WG1: ontologies
Bio-ontologies form the semantic framework for the annotation of what we know and understand about the function of biological entities and their interrelationships. Both the Gene Ontology (GO), [13,14] and the Sequence Ontology (SO) [15] are central to the description of chromatin, gene, protein and RNA components involved in gene regulatory events.
The development and maintenance of ontologies is intrinsically linked to established annotation processes and refinements thereof to keep up with evolving biological insights. Significant efforts have been made by GREEKC members to improve the annotation quality of the class of mammalian DNA binding transcription factors (dbTF) (Lovering et al., 2021, this issue), and, as a consequence, the GO molecular function subtree describing the regulation of gene expression by RNA polymerase II annotation has also undergone major restructuring (Gaudet et al., 2021, BBAGRM-D-21-00006, this issue). In addition, the SO has been critically reviewed. In several workshops, GREEKC members talked with the external experts responsible for constructing and using the SO, and arrived at a consensus on restructuring the part of SO that specifies the description of Gene Regulatory Elements within the genome. Since the original conception of the SO, the knowledge about the nature of gene regulation and the importance of the binding of proteins to regulatory control elements in the genome (most importantly the dbTFs) has advanced considerably and has revealed an abundance of transcription factor binding sites at multiple gene regulatory locations in the genome. In addition, the new notion of Topologically Associating Domains (TADs) was not yet supported by the SO and a restructuring has now been proposed (Sant et al., 2021, this issue) to align the definition and hierarchy of the SO regulatory element subtree with our current understanding of the full breadth of protein-DNA interaction events and chromatin conformation states that impact gene expression. Finally, efforts have been launched to follow up on the Gene Regulation Ontology [16], proposed as an application ontology for capturing broadly the entities and relationships that are essential for describing gene regulation at multiple levels (protein, RNA, small molecule, genome, DNA level and epigenetic level). The concept of the Gene Regulation Application Ontology (GRAO, https://github.com/greekc) provides an ontology framework for a knowledge base able to semantically integrate all available knowledge about gene regulatory events, allowing for complex queries addressing many aspects about regulatory context simultaneously, going well beyond the examples published for the Gene Expression Knowledge Base [17].

WG2: curation guidelines
Biocuration involves a manual or computational assessment of the validity of a particular claim that may characterize a biological entity or relation, upon which this claim can be specified with the help of proper entity identifiers (IDs), ontology terms, evidence descriptions and provenance, e.g. the identifier of the publication based on which the biocurator made the assertion. It is the central process that generates knowledge base content that provides users with high quality information. GREEKC has addressed five different subdomains of biocuration in its workshops and in several areas notable progress was made:

The protein level
GREEKC members have collaborated on the task of bringing together the knowledge that currently supports the classification of proteins as dbTFs (Lovering et al., 2021, this issue). The central role of these proteins in linking the cellular signaling machinery to the decoding of the regulatory genome has made them a prime focus of dedicated characterization and curation efforts over the years and the GREEKC review drove the re-design of the GO transcription regulation molecular functions branch and an updated set of curation guidelines (Gaudet et al., 2021, BBAGRM-D-21-00006, this issue). The updated GO transcription regulation branch also encompasses improvements in the GO structure and terms for co-transcription factors (coTFs) and general transcription factors (GTFs) and thus provides fertile ground for improved GO annotation of these protein entities with important roles in gene regulation.

The RNA level
The gene regulatory network also includes RNA molecules that interact with proteins, with other RNAs or directly with genes to mediate their action. In the last decade, strong efforts have been launched to annotate both functional and physical RNA interactions in public repositories. While there were guidelines to use the Gene Ontology to capture the role of microRNAs in gene regulation [18], no specific guidelines had been developed for the majority of other RNA roles, with the result that knowledge extracted from one source is sometimes difficult to integrate or compare with other sources. Discussions among GREEKC members led to the definition of common standards for the annotation of microRNA-mRNA and microRNA-lncRNA interactions [19]. MicroRNAs are the best-characterized regulatory RNAs, and their binding partners can be predicted using bioinformatic approaches that map the interaction site to its target genes. However, as each prediction tool provides different sets of targets for each specific microRNA, the value of experimental confirmation of a microRNA-mRNA interaction should not be underestimated [20]. Meetings and round table discussions between members of the Working Groups 1 and 2 have led to recommendations for the annotation of interactions and ontologies focusing on microRNA regulatory mechanisms [19], and annotation guidelines have been tested through an STSM. However, we have yet to do the same for functional interactions of the lncRNAs with genes and their role in transcriptional regulation.

The DNA level
Whilst the dbTFs represent the protein side of the decoding of genome information, their specific binding sites in the genome uniquely target dbTF regulatory activity to specific genes. Because of their importance, the transcription factor binding sites (TFBS) have been extensively studied to characterize their nucleotide patterns (sequence motifs) and determine features that define binding specificity [21]. A sequence motif recognized by a dbTF reflects the binding energy of a dbTF to a particular DNA segment [22], and there are many approaches to represent this relation in a computational model, from a basic consensus string to a 'black box' of advanced machine learning [23,24]. However, the gold standard is still defined by position weight matrices (PWMs) which were suggested as early as 1982 [25] and remain the most widespread and accepted way of describing dbTF binding specificity as a quantitative rather than a qualitative phenomenon [26]. PWMs are massively used to predict TFBS in the genome and annotate regulatory sequence variants [27,28]. Many TFBS motif discovery algorithms have been proposed over the years, and many experimental data sets were generated and analyzed, resulting in a multitude of motif collections, such as TRANSFAC [29], HOCOMOCO [30], CIS-BP [31], and JASPAR [32]. Creating a common understanding for how these PWMs should be used, represented, shared and interpreted was discussed in several workshops. As a result, a large-scale benchmarking was designed and carried out (aided by an STSM), resulting in a large set of publicly available performance measures that may improve the use of PWMs in practical analyses of new datasets [33].

The genome level
The SO is an essential source of terms that, among others, describe sequence concepts necessary to annotate regulatory sequences and TFBS for a range of resources (e.g. Ensembl [34]). SO was improved by the restructuring of terms related to cis-regulatory modules (CRMs), which are regulatory regions where transcription factor binding sites are usually clustered to regulate various aspects of transcription. CRMs include enhancers, silencers, locus control regions, and insulators. A special type of CRM that was added to SO is the 'DNA_loop_anchor', which represents the ends of a DNA looping region. DNA looping is necessary to allow for areas of DNA that are separated by many kilobases to remain in close proximity within the cell, allowing CRMs to interact with distant genes [35]. Another set of updates to SO is the addition of terms related to topologically defined regions, which are areas where self-interaction of DNA occurs more frequently than expected by chance. An instance of self-interaction is a topologically associated domain, bordered by topologically associated domain boundaries. During interphase, DNA loop anchors are CCCTC-binding factor (CTCF) binding sites. Several studies have investigated CTCF binding to determine topologically defined regions [35].

Level of interactions, regulatory complexes and network information flow
The annotation of proteins in the GO database is based on wellestablished guidelines [36], but the underlying data model and output, the Gene Product Association Data (GPAD) file, does not fully support all functional details about interactions between a protein and its interacting partners. One of the most significant shortcomings is caused by the limitation of the 'annotation extension' field in the tabular GPAD file. Target genes (TGs), and other protein interacting partners, bound by the transcription factor (dbTF) of interest, are captured in the annotation extension column but the result of transcription factor binding to a gene can only be summarized by the limited vocabulary of the annotation extension [37]. The GO-CAM data model [38] aims to remedy this, by allowing a biocurator to define linked annotations that use multiple ontologies to represent all aspects involved in biological functions involving multiple biological entities, essentially from a molecular function activity flow perspective. The GO-CAM approach has been discussed in several GREEKC workshops and its members have engaged in defining a set of templates in the Noctua curation tool that will guide a biocurator in the definition of new dbTF-TG interactions (Juanes Cortés et al., 2021, BBAGRM-D-21-00018, this issue).
Transcription factors often bind as homo-/heterodimers which then bind to co-factors to assemble the protein machinery required for transcription. GREEKC members (Velthuijs et al., 2021, this issue) used data from the IMEx Consortium databases [39] and BioGRID [40] to develop a pipeline to predict transcription factor coregulator complexes, which were subsequently validated using the CORUM (http://mips.helmholtz -muenchen.de/corum/) and hu.MAP (http://proteincomplexes.org) protein complex databases. Efforts to manually curate transcription factor and coregulator complexes in the Complex Portal database [41] have also been inspired by the GREEKC Action.
The PSI-MI standards that have been developed under the umbrella of the Human Proteome Organization's Proteomics Standards Initiative (HUPO PSI) were the starting point [42][43][44] for discussions about future needs of the network modeling community. Although the existing data formats developed by this group were capable of describing TF-TG binding, the format was not designed to describe either the upstream dataflow from a cellular signaling pathway to an up-or down-regulation of a set of genes. GREEKC was able to organize several events together with the Proteomics Standards Initiative and ELIXIR to define an extension of HUPO-PSI MITAB2.7 that would cover the causality associated with (gene) regulatory interactions. The general importance of this type of interaction for the use in building conceptual and mathematical models of regulation networks called for a multidisciplinary agreement involving all relevant stakeholders (WG2 and WG4 members, many also active in the PSI-MI and ELIXIR community). This resulted in the definition of CausalTAB [45], which is also known as PSI MITAB2.8. The work on causal molecular interactions also exposed the need for a set of guidelines that describe the necessary and desirable contextual details that a user would need to find in order to be able to select and incorporate such causal statements in a model. These guidelines were created and are now published as the MI2CAST checklist [46], which has been endorsed globally by a broad group of biocurators, ontology developers, curation tool developers and users of molecular causal interaction statements. To the biocurator, the MI2CAST standard provides guidance in identifying contextual details that have to be minimally supplied in new annotations; to the curation tool developer it specifies the semantic resources and identifiers that should be chosen; to the user, the MI2CAST standard provides a summary of the contextual handles that are available for selecting proper data; and to the biological experimentalist, it defines the domain of study and reporting that will yield information most valuable for future computational integration and analysis. The MI2CAST standard has been implemented in the prototype curation tool causalBuilder [47], to illustrate how a Visual Syntax Markup (VSM-box) data entry template engine [48] can be used to support the presentation of an annotation standard in an organic way to a biocurator.

WG3: text mining for knowledge curation
The GREEKC consortium considered the value of text mining for aiding the curation workflow. These discussions have shown that the worlds of manual biocurators and text miners have many possible connections, but an active engagement where both sides benefit equally remains to be pursued. Text mining is an accepted method for triage, meaning the identification of e.g. a scientific paper that is likely to contain information that would satisfy a curation effort, implying it may contain the necessary information to warrant an annotation for a database. Conversely, curation is an accepted practice used to support text mining, both to assemble and prepare a text corpus that can be used for training of a text mining classifier, and for assessment of the quality of text mining results. But the results of manual curation (high-quality annotations of a limited subset of the available texts) and text mining (lower quality annotations of the widest possible range of texts) are unsatisfactory to the other expert group, which stands in the way of mutual efforts to marry the two without reservations. And to some extent the outcomes of both types of efforts may also serve different user communities: the high-quality curation resources serve the careful, cautious user, whereas the text mining result may serve the computational network analyst in settings where she is willing to accept that some of the information she is using may be of lower confidence than manually curated knowledge.
Several events have been organized by the text mining working group, but most notably the results from the collaboration between GREEKC members NTNU and BSC are worth mentioning. They have performed a text mining effort to specifically identify and retrieve gene regulatory interactions between a DNA binding transcription factor and a target gene (TF-TG relationships). The results (www.extri.org) were integrated and compared with several established curated resources with TF-TG relationships and indicate the sizable corpus of MedLine literature with information currently not represented in curated data resources (Vazquez et al., 2021, this issue). Moreover, they also indicate the potential gap of information pertaining to proteins currently not covered by functional studies reported in the literature, as about half of the putative dbTFs do not return any MedLine record of involvement in the regulation of a target gene. The ExTRI resource is available to the computational biologist through the BioGateway database and a Cytoscape app [49], and the potential problem of false positive records is mitigated by providing full provenance to the TRI sentence detected by text mining in its PubMed abstract, so that a user may check the validity and, if wrong, omit it from analysis.

WG4: databasing and sharing
The storing and sharing of curated information in databases provides the basis for dissemination of GRKC and thus has received particular attention in the GREEKC workshops. Among other issues, we were interested in the user perspective for GRKC and the standardization of information exchange. Regarding the former, we found that many commonly asked questions in gene regulation can be covered by a set of use cases (i.e. what are the known or predicted regulators of a gene?). For this reason, we have started to provide protocols for such use cases on the GREEKC website (https://www.greekc.org/use-cases). Regarding standardization and exchange of GRKC, the ELIXIR initiative has adopted criteria to assess the governance of knowledge bases and data repositories with the aim to identify Core Resources that comply with high governance and thus reliability standards. The identified Core Resources include several resources that contain information relevant for the GRKC, for instance GO, IntAct, UniProtKB and Ensembl. However, many additional valuable resources exist, making it imperative that careful consideration is given to make sure that their content is compliant with formats endorsed by ELIXIR Core resources and the FAIR principles. To assess the FAIRness of GRKC tools and datasets, a semiautomated tool was developed (Bonello et al., 2021, this issue) to score resources in terms of their compliance with the FAIR principles. Each principle is individually scored and a breakdown of the criteria is provided in a report generated by the scoring tool. The SIGNOR database, for instance, abides by the FAIR principles and was an early adopter of the PSI-MI standards endorsed by IMEx. The development of the CausalTAB / PSI.MITAB2.8 format described earlier poses new demands for data exchange mechanisms, most notably the webservice PSICQUIC (Protein Standard Initiative Common Query Interface [50]), which, at the time of writing, is only able to serve queries for the PSI-MI 2.7 format. The GREEKC discussions led to an STSM that resulted in a prototype PSICQUIC 1.0 webservice that has been implemented for communication with the SIGNOR database. Future work is needed to upgrade PSICQUIC web service functionality with common tools like Cytoscape [51], which supports the import of data through the Network from Public Databases / Universal Interaction Database Client. The MedLine extracted information on TF-TG interactions from the ExTRI text mining effort described above are available now through standard PSICQUIC web service (see http://www.ebi.ac. uk/Tools/webservices/psicquic/registry/registry?action=STATUS#, tfact2gene service). Other web services that provide access to TF-TG interactions can be launched through Cytoscape Apps. The BioGateway App [49] uses SPARQL queries (SPARQL Protocol and RDF Query Language [52]) to fetch regulatory information from the semantic web database BioGateway [53], in the form of documented interactions between transcription factors and their target genes (see www.extri.org). Likewise, the OmniPath App [54] uses a REST type service to fetch TF-TG relationships from the dedicated transcription factor activity knowledge base DoRothEA [55].

Discussion and future challenges
An overview of the published results of the GREEKC COST Action is shown in Table 1. In each of the areas of the Action, results have been published, either as part of this BBA-GRM special issue or elsewhere.
In the discussions about bottlenecks and solutions to enhance the GRKC, the needs of two groups were considered: on the one hand bench biologists who access detailed information on particular genes and proteins of interest and how they interact, and on the other hand computational biologists who need an abundance of computationally accessible and well-structured information resources. This requires that the content of the GRKC is both 'human readable' and browsable through a web interface, and available through an API or web service, for computational processing. Regardless of their use, annotations need to be enhanced by including information with 'richer' expression of the functions of molecular entities, the relations between entities, the 'emergent' effect of their interactions, as well as experimental evidence and biological context so as to underpin and enhance the use of this information in regulatory network building and computational analysis. To achieve this, further improvements and innovations of curation approaches and tools will be needed, so that the annotation process of not only biological entities, but of their systems interactions becomes and remains manageable. The curation tool Noctua [38], and new experimental technologies like VSM [56] provide significant steps in the direction of annotating biological systems rather than biological entities. These tools accommodate multiple entities, activation state and relation types, and provide for annotations based on multiple ontologies and supported by an elaborate set of evidence and contextual metadata. Although at the semantic level sufficient resources may be available to cover these domains individually, integrated resources are needed that interlink and support complex queries for obtaining regulatory information that spans the different levels. The design of the Gene Regulation Application Ontology has paved the way to produce a prototype semantic knowledge base where GRKC information is integrated together with SO regulatory sequence concepts, information from the Complex Portal and GO molecular function and biological process terms to allow users to query for regulatory mechanism information that meets both location/sequence constraints, macromolecular assemblies and gene regulatory action constraints.
Users will also need the Knowledge Commons to be as comprehensive as possible. Current literature curation efforts are too limited to cope with the increasing amount of information published on a daily basis. Therefore, the access of information generated by text mining [57] as well as by automated and manual curation [58], needs to gain more attention. Furthermore, improvements are needed in the associated metadata so that it is clear to the user what the quality and inclusion criteria are for a particular piece of information [59]. Demanding computational users will then be able to implement their own selection criteria for incorporating data into their analysis. In practice this can help ameliorate a well-known challenge in digital knowledge management, which is that in their annotation work, biocurators generally focus on including only cases with strong evidence ('true positives') in their database content, and discard cases with weak evidence (including possible false negatives). Information that is not included in a resource may, upon closer inspection of additional or new data, find support from sufficient evidence to meet the database's inclusion criteria. Such information might be flagged by appropriate evidence codes, so that users may apply their own filters when exploring it either in a 'cautious' or 'greedy' mode (Chatterjee et al., this issue).
While modern sequencing technologies provide great power at low cost to detect transcriptional activity (e.g. by RNA-seq or Ribo-seq), or TF binding (e.g. by ChIP-seq) on a genome-wide scale, no experimental technology exists that comprehensively captures TF activity across the genome. Therefore, an area where further coordinated work is essential concerns the computational prediction of 'active' binding sites of transcription factors (including those of homo-and heterodimers) combining evidence from multiple experimental, often large-scale data sources, to infer transcription factor-target gene interactions. For more than 30 years, efforts of decoding a "regulatory code of transcription factors" have been undermined by the notorious ability of transcription factors to recognize quite dissimilar DNA sequences depending on the availability of different protein partners for complex formation and local and overall chromatin accessibility profiles. Yet, massive efforts in comparative studies of dbTF binding in vitro and in vivo in a variety of cell types are gradually providing an understanding of rules controlling recognition of particular DNA loci by dbTFs in a particular cell type or biological condition. Main bioinformatics efforts try to account for contributions of chromatin accessibility and dbTF affinity when predicting locus-specific DNA recognition, which may help to combine dbTF specificity assayed in vitro and data from chromatin accessibility profiling of the particular cell type. If successful, such bioinformatics Table 1 Major results achieved by GREEKC. Progress in the four areas of the GREEKC COST Action is published in this special issue (BBAGRM), or elsewhere. Holmås et al. [49] strategies would save the researchers from exhaustive assessment of the active regulome of DNA binding transcription factors substituting it with reliable prediction of dbTF binding profiles at single base resolution and further pinpoint dbTF target genes. This is especially important for hardto-get or transient cell types, and thus vital in the context of developmental biology or in studying the transcription response of different cells to particular physiological, environmental or stress conditions. Fortunately, future prospects to tackle such challenges are brightened by emerging opportunities to obtain single cell data relevant for gene regulation, such as transcriptomics, transcription factor binding and chromatin states and topologies. With support from comprehensive and well documented prior knowledge resources, such data might allow the researcher to unveil cell state-specific gene regulatory (sub)networks, which control behavior and transformation of cells existing in small quantities and/or short time frames but having a crucial impact on critical biological processes. Precision medicine is an emerging approach that aims to develop personalized therapies for individual patients, by taking into account patient-specific disease factors to increase the efficacy of drug treatment [60,61]. Precision medicine may be based on large scale omics data collections to obtain high-resolution molecular insight into health [62], or on patient-specific mathematical models that serve as in silico patients, or 'digital twins' [63]. The builders and users of these patientspecific models are often involved in curation themselves, to make models complete and to audit literature in order to verify database information against contextual details of the processes that they are modeling. For instance, the Consortium for Logical Modeling Standards and Tools (CoLoMoTo [64]) represents scientists engaged in constructing logical models and the Disease Maps consortium generates biological process information [65] to support the analysis of many diseases. It is noteworthy that despite the large efforts in building resources that describe regulatory information that involves molecular components, be it genes or proteins, additional efforts are still needed to obtain the information to construct process diagrams or mathematical models that capture what we know about gene regulatory mechanisms adequately checked to have validity in a specific biological setting or context. Having an integration of the curation world with the modeling world through these types of collaborations, possibly with the help of a future COST action, has the potential to further optimize curation and annotation processes for the Knowledge Commons.

Declaration of competing interest
The authors declare no conflict of interest.