Integration of background knowledge for automatic detection of inconsistencies in gene ontology annotation

Abstract Motivation Biological background knowledge plays an important role in the manual quality assurance (QA) of biological database records. One such QA task is the detection of inconsistencies in literature-based Gene Ontology Annotation (GOA). This manual verification ensures the accuracy of the GO annotations based on a comprehensive review of the literature used as evidence, Gene Ontology (GO) terms, and annotated genes in GOA records. While automatic approaches for the detection of semantic inconsistencies in GOA have been developed, they operate within predetermined contexts, lacking the ability to leverage broader evidence, especially relevant domain-specific background knowledge. This paper investigates various types of background knowledge that could improve the detection of prevalent inconsistencies in GOA. In addition, the paper proposes several approaches to integrate background knowledge into the automatic GOA inconsistency detection process. Results We have extended a previously developed GOA inconsistency dataset with several kinds of GOA-related background knowledge, including GeneRIF statements, biological concepts mentioned within evidence texts, GO hierarchy and existing GO annotations of the specific gene. We have proposed several effective approaches to integrate background knowledge as part of the automatic GOA inconsistency detection process. The proposed approaches can improve automatic detection of self-consistency and several of the most prevalent types of inconsistencies.   This is the first study to explore the advantages of utilizing background knowledge and to propose a practical approach to incorporate knowledge in automatic GOA inconsistency detection. We establish a new benchmark for performance on this task. Our methods may be applicable to various tasks that involve incorporating biological background knowledge. Availability and implementation https://github.com/jiyuc/de-inconsistency.


Introduction
Literature-based gene ontology annotations (GOA) (Carbon et al., 2021) are biological database records that provide associations between genes and information about their molecular functions, biological processes, and cellular locations, described utilizing Gene Ontology (GO) terms, and derived from evidence found in the scientific literature (Ashburner et al., 2000).The GO provides a standardized and hierarchical description of gene function.GOA can be used for a range of biological research tasks, such as gene-category analysis (Bauer, 2017), gene enrichment analysis (Huang et al., 2009), and drug discovery (Haendel et al., 2018).
Several quality issues have been identified in GOA, including inconsistency, duplication, and errors (Carbon et al., 2021;Bateman et al., 2021;Bult et al., 2019;Chen et al., 2017b,a).Faria et al. (2012) found orthologous and homologous genes contain inconsistent GOA in the UniProtKB database (Bateman et al., 2021).A large proportion of inconsistent GOA are associated with broad GO terms located at higher levels of the Gene Ontology direct acyclic graph (DAG) (Camon et al., i i "output" -2024/3/18 -0:26 -page 2 -#2 i i i i i i 2 Chen et al. 2005;Škunca et al., 2012;Poux and Gaudet, 2017), which results in insufficient specificity in describing gene function information.Chen et al. (2021) found that certain instances of inconsistent GOA were linked to irrelevant GO terms.These terms were identified as keywords within the direct literature evidence, but did not describe the specific gene's molecular activity, cellular compartment, or biological function.Through an analysis of different guidelines from biological databases and oral communication with expert curators, Chen et al. (2022) established four main types of inconsistencies related to GOA.
The GO curation community has prioritized the review of GOA consistency in their routine quality maintenance procedures (Carbon et al., 2021).Compared to automatic approaches, manual detection of inconsistency in GOA is considered more reliable due to humans' comprehensive understanding of biological background knowledge.However, manual approaches for addressing inconsistencies in GOA can be both time-consuming and resource-intensive, relying on the expertise of curators in both biology and the scientific literature, as well as comprehensive guidelines (Balakrishnan et al., 2013;Poux and Gaudet, 2017;Thomas, 2017).Furthermore, these manual approaches cannot scale up with the rapidly evolving knowledge related to gene function.As a potential solution, automatic approaches incorporating human-in-the-loop curation have been proposed (Chen et al., 2021(Chen et al., , 2022)).
One existing automatic approach to GOA inconsistency detection (Chen et al., 2022) is not yet practical as it fails to consider the complexity and variability of biological information, such as the linguistic variation in the expression of biological concepts in evidence text (e.g., apoptosis vs. cell death), particularly where this variation extends beyond simple synonymy.This deficiency relates to its limited use of background knowledge, relying only on the title and abstract of direct literature, which does not fully capture the complexity of or contexts for GOA.
GOA are important resources for supporting modern biological research.However, GOA are subject to inconsistencies such as the selection of overly broad or specific GO terms or incorrect association to gene mentions.These inconsistencies are problematic because they can influence the reliability of downstream biological research tasks or cause cascading errors in biological databases (Gilks et al., 2002;Goudey et al., 2022).Existing automatic approaches for detecting inconsistencies within GOA are constrained in their real-world applicability due to their limited use of relevant background knowledge.To address this issue, we investigate several types of background knowledge related to GOA and propose integration methods to improve the automatic detection of GOA inconsistencies.
Biological background knowledge plays an important role in GOA curation.Taking this instance into consideration.The evidence text reads as follows: "RHGF-2 RhoGEF activity is specific to the C. elegans RhoA homolog RHO-1 as determined by direct binding, GDP/GTP exchange, and serum response element-driven reporter activity" (PMID:22363657).In a supported annotation, the GO term "GDS (GO:0005085)" is erroneously assigned to the wrong gene"RHO-1", the misidentified gene is mentioned alongside the correct gene "RHFG-2" in the presented evidence text.Human curators can easily spot the misidentified gene with reference to the GO DAG and existing annotation.Here, "GTP binding" (GO:0005525) is an existing annotation to the gene "rho-1" which is positively regulated by "GDS" (GO:0005085), referring to GO DAG.Thus, the function "GDS" is only possible to occur in the related gene"RHGF-2", co-mentioned in the evidence.However, such kind of latent constraints cannot be solely captured from reading direct evidence of a single GOA instance without referring to the GO DAG.In fact, many GO annotation patterns have been discovered and utilized as background knowledge to support the quality assurance.For example, the reference of GO constraints to taxons (Huntley et al., 2014) which have helped the correction of 1.6 million inconsistent annotations (Dimmer et al., 2010).In short, a thorough understanding of background knowledge related to GOA is critical for facilitating practical automatic GOA inconsistency detection.
To the best of our knowledge, this is the first empirical investigation of approaches for integrating background knowledge to facilitate automated inconsistency detection in real-world GOA datasets.We make three key contributions: • We find integrating background knowledge related to the GOA can lead to a substantial improvement in the performance of GOA inconsistency detection.As a result, our developed approaches have achieved a new state-of-the-art in this area.
• We establish a framework to investigate various approaches for integrating different types of background knowledge in order to facilitate automated detection of GOA inconsistencies.• We extend an existing dataset (Chen et al., 2022) by sourcing and adding GOA-related background knowledge to each instance, such as using GeneRIF statements as alternative evidence text (Brown et al., 2015), encoding existing GO annotations patterns obtained from QuickGO (Binns et al., 2009), encoding GO hierarchical knowledge using GO DAG, and other relevant sources.We substitute the test set in the original dataset (Chen et al., 2022) using GOA instances directly sampled from real-world databases.

Background
The GOA inconsistency detection task was first proposed by Chen et al. (2021Chen et al. ( , 2022)), aiming to detect GOA self-consistency and four primary types of inconsistencies at record level, including over-broad, overspecific, irrelevant GO mention, and incorrect selection of gene mention.These (in)consistencies are modeled as typed semantic relationships in a triplet of GO term, gene product, and naturally written evidence.We follow the same multi-class single-output classification setting to detect typed inconsistencies.In this framework, the classifier determines whether a given GOA triplet is self-consistent or not and identifies the type of inconsistency reflected in the instance.Our approach involves integrating GOA-related background knowledge into the same modeling framework.Below, we summarise several types of GOA-related background knowledge, which are commonly used by human curators to assess GOA quality.Identifying effective methods to integrating this background knowledge could significantly improve methods to automatically detect GOA inconsistencies.
Gene Reference Into Function (GeneRIF) statements provide a concise summary of key gene function knowledge written in natural language (Brown et al., 2015).Although their original purpose was to improve the efficiency of manual evidence review, they have been repurposed to facilitate the automated extraction of gene-disease associations, surpassing Online Mendelian Inheritance in Man (OMIM) database (Hamosh et al., 2005) as the source for tracting gene-disease relationships.There are strong semantic connections between GO and GeneRIF (Lu et al., 2006).The findings of a study suggested that the identification of GO terms in the texts can aid automatic selection of candidate sentences for generating GeneRIF (Gobeill et al., 2008).Therefore, it is worth investigating whether incorporating GeneRIF statements can improve the detection of inconsistencies in GOA.
Biological concepts are mentioned as words within evidence text.The identification of these concepts have helped the implementation of many automatic GOA production systems in the BioCreative IV task Mao et al. (2014).Since GOA capture the relationship between a gene product and a GO term, both of which are fundamental biological concepts, it is assumed that the identification of these concepts can facilitate the automated detection of inconsistencies in GOA.Currently, several systems i i "output" -2024/3/18 -0:26 -page 3 -#3 i i i i i i Integration of Background Knowledge for Automatic Detection of Inconsistencies in Gene Ontology Annotation 3 can be directly leveraged to extract biological concepts from PubMed articles.For instance, PubTator is capable of automatically identifying mentions of biological concepts from raw text (Wei et al., 2019), whereas ConceptMapper (Tanenblatt et al., 2010;Funk et al., 2014), MetaMap (Aronson, 2001), and NCBO annotator (Jonquet et al., 2009) can extract GO term mentions from unstructured text.The hierarchical structure of the GO can be modelled as a directed acylic graph (DAG).Curators use the GO DAG to find GO terms with the most appropriate specificity for representing the function of a specific gene.Previous work proposed a graph neural network model to encode the specificity relationships on the GO DAG and improved the detection of GOA inconsistency caused by over-broad and over-specific selection of GO terms (Chen et al., 2022).This study explores whether incorporating the GO DAG or other related knowledge, including the biological concept mentions, the GeneRIF statements, or existing GOA patterns, can improve the identification of inconsistencies in GOA.
Consistency evaluation of a single GOA instance often considers the full collection of existing GOA records that are associated with a curated gene.An indicator of potential inconsistency in GOA arises when a gene product is assigned multiple distinct GO terms that essentially describe similar gene functions.This suggests a potential issue if annotations for the same gene product are not uniform when characterizing similar gene functions.Such co-annotation patterns for a given gene (Carbon et al., 2021) can help detect inconsistencies in GOA.In addition, having multiple annotators work on the same set of gene products is a strategy sometimes used to support error detection (Balakrishnan et al., 2013).If there are discrepancies in the annotations made by different annotators, it indicates that there may be inconsistencies in the GOA database.Wood et al.
(2020) utilized co-annotation patterns to develop heuristic rules aimed at detecting erroneous annotations across five species.Building on the idea of considering a collection of related annotations together, our work proposes automatic approaches for encoding the knowledge that is inherent in the existing GOA, with the aim of investigating its potential utility for automatic detection of inconsistencies in GOA.

Methods
We modelled GOA self-consistency or inconsistencies at the record level as typed semantic relationships in a triplet of GO term, gene product, and naturally written literature title and abstract as evidence.This approach extends prior work (Chen et al., 2022) by integrating various kinds of background knowledge into each record.Our goal is to develop classification models that, for a given GOA triplet and related background knowledge as input, make a classification decision as to whether the triplet is self-consistent, and if not, what type of inconsistency is reflected in the instance.Background knowledge is integrated into the classification process in several different ways, such as replacing or augmenting the input evidence text, or by encoding the background knowledge and embedding within the classification framework.That is, the classifier makes a determination about whether the textual evidence supports the GO term annotation to the given gene product.To simplify our study, we assume that a GOA can only have one type of inconsistency and that the typed inconsistencies are independent of each other.:: The ::::: outline :: of ::: our ::::: methods :: is ::::::: illustrated : in :::: figure :: 1.

Data
We prepared the data by sourcing relevant background knowledge for each instances in the GOA inconsistency dataset published by Chen et al. (2022).
The dataset (Chen et al., 2022) and the modified test set have been specifically designed to facilitate the investigation of automatic approaches with regard to their capability to discern the nuanced semantics of GO specificity.The over-specific instances in the dataset were constructed by utilizing directly connected children GO terms, while over-broad instances have been created using ancestor GO terms.As an ancestor may not be directly linked to a descendant on the GO DAG, it can have a greater semantic distance.Consequently, over-specific instances are generally more difficult for automatic semantic inference than over-broad instances.

GeneRIF Statements as alternative evidence text
We retrieved GeneRIF statements from NCBI as alternative context of evidence to GOA, replacing the literature title and abstract for each instance as evidence in the original dataset.GeneRIF statements typically comprise a single, manually created, short sentence that describes the essential gene function knowledge.Each GeneRIF statement is linked to its supporting literature, indexed by PMID and GeneID combined (PMG).All instances in the selected GOA dataset can be linked to a specific GeneRIF statement using PMG.In contrast, a GeneRIF statement may be linked to multiple GOA instances with shared PMG but different GO terms.For example, evidence from the literature (PMID:22343943) indicates that "(Gene:2475)" enables the cellular function of "protein serine/threonine kinase activity (GO:0004674)" and is involved in the biological process of "cellular response to starvation (GO:0009267)", creating two GOA instances.However, it should be noted that a GeneRIF statement is not necessarily guaranteed to contain the gene function information represented by the linked GOA.GeneRIF only describes a single piece of key gene function information in the direct literature, while the literature may contain multiple pieces of gene function information and support multiple GOA instances.Therefore, it is important to investigate the effectiveness of using GeneRIF as an alternative source of evidence.

Pre-extraction of biological concepts
To identify biological concepts.we applied PubTator (Wei et al., 2019) to identify the token span of several kinds of biological concepts in the literature title and abstract, including disease, chemical compound, species, mutation, and cell line, and apply ConceptMapper (Funk et al., 2014) to identify the token span of GO mention.Step 2 Construct background knowledge graphs from existing GOA and GO DAG.Encode these knowledge graphs into node embeddings.
Step 1 Synthesize inconsistent GOA by applying heuristic rules to consistent GOA data obtained from real-world databases.
Step GeneRIF serves as a replacement for the literature evidence associated with each instance.
GNN-PubMedBERT integrates multiple alternatives for knowledge node embedding.
Entities (concepts) were encoded either within the text using linguistic markers or through position encoding within the distributional language model.
Send flagged suspicious records back to human for manual quality review

Pre-processing of GO Direct Acyclic Graph
The inference of the specificity of a single pair of GO terms cannot be achieved by leveraging the entire GO DAG, as it is unrealistic given the size of the graph.Consequently, we opted to select only a portion of the GO DAG that has direct connections to the GO terms present in the dataset.
Utilizing QuickGO (Binns et al., 2009), we retrieved the children of GO terms that are related either through the "is_a" or "part_of" relation types.
GO terms in other relation types, such as "regulate" are not retrieved as they do not provide any GO specificity knowledge.We then reconstruct a sub-part of the GO DAG with vertices representing GO terms and edges representing directed hierarchical relations.This sub-graph is used for encoding GO specificity knowledge with a Graph Neural Network (GNN) introduced in Section 3.3.Direction-reversed edges with the "reverse" edge type are added to the graph, allowing for bi-directional message passing between specific and broad GO terms.It is important to note that we do not specify the type of reversed edges which is different from prior work (Chen et al., 2022), such as "parent_is_a" or "parent_part_of".This enables the GO DAG encoder to concentrate exclusively on capturing the broad hierarchical relationships between pairs of nodes, irrespective of the particular hierarchy types involved.This modification contributes to a more effective encoding of hierarchical relationships.The GO specificity knowledge graph created in this manner is shared among all subsets, mimicking the curation scenarios in real-world situations where human curators refer to a part of the GO DAG during GOA consistency checks.

Retrieval of existing GOA
We retrieved existing GOA from the NCBI-gene2go resource as a type of background knowledge of each GOA instance in the dataset.Consider the GOA instance indexed by a triplet of identifiers, "GO:0015804_Gene:836729_PMID:27925655".We retrieved and added a list of GO terms to this instance if they are existing annotations of "Gene:836729" and supported by either the same or other direct literature, such as "GO:0015175", "GO:0005774", and "GO:0015173".

Modeling
We extended a previously proposed model, which involves the direct finetuning of PubMedBERT (Gu et al., 2021;Chen et al., 2022), on the training and development set.Our study extended this model, integrating the sourced background knowledge either as input or within the inconsistency detection model.As a baseline, we adopted a random guess strategy that assumes a normal distribution for all categories of inconsistencies.
Pre-encoding of existing GOA pattern and GO specificity knowledge We employed the DGL toolkit (Wang et al., 2019) to model GOA as an undirected heterogeneous graph (Fig 2).We represented each instance as a pair of two connected nodes, where one node is a tuple of PMID and Gene (PMG), and the other node is a corresponding GO term.Edges exist only between PMG and GO term nodes, which represent an annotation relation.
The local topology of the modeled GOA graph can be used to identify existing GOA patterns.For example, Fig 2 illustrates the co-annotation patterns of "GO:0015175" and "GO:0015804" across "Gene:836729" and "Gene:818801", supported by "PMID:27925655".The co-annotation of GO suggests that the two genes are highly related in their function.In fact, they are members of the same gene family that encodes transmembrane amino acid transporter proteins, based on information from Entrez Gene (Brown et al., 2015).The identification of annotation patterns is crucial in detecting GOA inconsistencies.To achieve this, we set up a learning objective to classify the edge type on the GOA heterogeneous graph using a GNN model designed to automatically learn the local topological features in existing GOA.We constructed three heterogeneous knowledge graphs based on the GOA inconsistency subsets, including training, development, and heldout test.To label the edges in our study, we employed the following approaches: An edge connects a PMG and a GO term node, representing either a consistent or an inconsistent GOA instance.Consistent GOA instances are labeled as "annotate" edges, whereas inconsistent GOA instances, irrespective of the type of inconsistency, are labeled as "not_annotate" edges.Additionally, existing GO terms that serve as background knowledge for each individual GOA instance are labeled as "annotate" edges with the PMG node.After conducting multiple trials, we determined that the optimal initialization of node representation is 8dimensional all-ones vectors for PMG nodes and random 8-dimensional vectors for GO term nodes.
We designed a GNN structure with three HeteroGraphConv layers (Wang et al., 2019) to encode existing GOA patterns into node embeddings (Figure 3).A HeteroGraphConv layer comprises two aggregation steps, an activation function, and batch normalization.The initial step consists of two GraphSAGE sub-layers that are responsible for aggregating information propagated through neighboring nodes of the same edge type.The second step involves aggregating the output of the two GraphSAGE sub-layers using a convolution function.This approach is specifically devised for inductively training node features across diverse graphs.We use the rectified linear unit (ReLU) as the activation function for each HeteroGraphConv layer, followed by batch normalization.
To enable our implemented GNN for edge type classification, we added a single-layered multi-layer perception ::: with a ::::: single :::::::::: fully-connected :::: hidden :::: layer : (MLP) at the end :: to ::: form ::::::::: classification :::: head.The input to the MLP is two concatenated 8-dimensional node vectors, representing the vectorization of an edge between a pair of nodes.The output of the MLP is a 2-dimensional logit vector.We apply the argmax function to the logit vector, predicting the type of edge as either "annotate" or "not_annotate".Following the training of the knowledge graph constructed from the training and development GOA inconsistency dataset, when in evaluation, we apply the pre-trained GNN to the held-out test set to obtain the node embeddings of PMG and GO terms.We retrieve the node embeddings from the output of the last HeteroGraphConv layer.We encoded GO specificity knowledge into 16-dimensional node embeddings using a similar approach proposed in previous research (Chen et al., 2022).However, we made a slight alteration by merging the "is_a" and "part_of" edges into one type, "children_of", and the "parent_is_a" and "parent_part_of" edges into "parent_of".This modification was implemented because we observed that the previous GNN model could potentially exploit the type of reversed edges in the pre-training objective.For instance, the reversed "parent_is_a" edge from "behavior" to "feeding behavior" could aid the GNN model in predicting the "is_a" edge type in the opposite direction.This merging of edge strategy is appropriate for training the GNN model to incorporate GO specificity knowledge as it does not alter the hierarchical relationships between paired GO terms.

Integration of background knowledge in the input
• GeneRIF substitution: We performed fine-tuning of a new PubMedBERT model on instances where the evidence texts were substituted with GeneRIF (Gene Reference Into Function) statements.

• Entity reformulation: We performed fine-tuning of a new
PubMedBERT on instances where mentions of biological concepts in the input were pre-extracted and reformulated using linguistic markers.

Integration of background knowledge within model
We developed a joint model, referred to as GNN-PubMedBERT (shown in Figure 4), which integrates the pre-trained topological features of ontology and annotation with distributional semantics for automatic GOA  inconsistency detection.This integration is achieved by flat concatenation of node embeddings and the last hidden layer of PubMedBERT, which serves as the input for a single-layered MLP to form a multi-class classifier.The design of the joint model allows for different combinations of topological feature integration, such as GO DAG, existing GOA, or a combination of both (referred to as GoDg & exGOA).
We extended PubMedBERT model with an implementation of an entity encoding layer (or entity embeddings) to process the input instances (Fig 4).This layer encodes the text spans of pre-extracted biological concepts using consecutive 1s and other input tokens using consecutive 0s.For example, assuming an input consists of three tokens: "ClpXP", "degradation", "machine", the corresponding entity encoding will be (1, 1, 0).As "ClpXP" is a gene, "degradation" is a GO term, while "machine" is not a biological concept.

Metrics
Evaluation follows a one-vs-all strategy for the detection of each specific type of inconsistency, e.g. for the over-specific label, we assume positives are instances with the over-specific label and negatives are instances with any other label.In this case, True positives (TP), False positives (FP), False negatives (FN) are counted as follows: • TP: The predicted label is over-specific for a positive instance • FP: The predicted label is over-specific for a negative instance • FN: The predicted label is not over-specific for a positive instance We use Precision (P = T P T P +F P ), Recall (R = T P T P +F N ), and F 1 score (F 1 = 2 × P ×R P +R ) as evaluation metrics.

Experiments
We conducted a set of experiments to examine the effectiveness of integrating background knowledge through two approaches: either incorporating it into the input or embedding it within the model.In particular, we compare the performance of different types of background knowledge integration alternatives (as summarized in Table 1).As a baseline, we use a naive strategy of random guessing (baseline), which could theoretically achieve an average precision, recall, and F-score of 0.20, given that there are five classes with 500 instances in each for evaluation.The fundamental competitor model is direct fine-tuning of PubMedBERT without any integration of background knowledge.We also fine-tuned PubMedBERT on instances where literature evidence was replaced by GeneRIF statements (+GeneRIF) to determine whether these statements, which include concise gene function descriptions, were suitable for automatic GOA inconsistency detection.Furthermore, we fine-tuned another PubMedBERT on instances that had pre-extracted biological concepts and process the instances using entity embedding We then examined the individual (+GoDg / +exGOA) and combined integration of GODAG and existing GOA (+GoDg & exGOA) using the GNN-PubMedBERT model.In addition, we generated a set of dummy node embeddings (+dummy embeddings) in a Gaussian distribution, serving as a baseline for the evaluation of graph knowledge integration.This allows us to determine whether the integration of pre-trained graph knowledge embeddings in the GNN-PubMedBERT model was effective or not, compared to the dummy node embeddings.

Results
Table 2 provides a summary of the experimental outcomes.Overall, each system with variation in leveraged background knowledge demonstrates strong effectiveness over the baseline.This indicates that automatic approaches are apt for detecting inconsistencies in GOA, demonstrating their ability to make informed and reliable predictions that surpass random chance.
The joint model, GNN-PubMedBERT for integrating graph-structured background knowledge, stands out by combining the integration of GO DAG and existing GOA.It notably outperforms other background knowledge integration approaches, such as +GeneRIF, +EntityEmbedding, and +EntityReformulation, in detecting three out of the four most prevalent types of inconsistencies (excluding irrelevant GO mentions).This underscores the effectiveness of the deployed strategy of joint GNN-PubMedBERT in integrating background knowledge, further supporting its success in improving the detection of GOA inconsistencies.
Utilizing GeneRIF statements as an additional source of background knowledge has demonstrated a notable enhancement in Recall (+0.14) for the detection of IG type inconsistencies comparing with the competitor model (PubMedBERT).However, it yields a slightly lower F 1 (−0.01),particularly in detecting other types of inconsistencies, with a decrease of −0.22 in F 1 , especially for over-specific inconsistencies.
Utilizing entity embeddings (+Entity Embedding) for the integration of biological concepts results in slightly better performance compared to in-text reformulation using linguistic markers (+Entity Reformulation), except for the identification of irrelevant GO mentions (IM) (0.88 < 0.94 in F 1 ).This approach does not improve the detection of selfconsistency (PubMedBERT F 1 =0.43 cf.+EntityEmbedding F 1 =0.40 or +EntityReformulation F 1 =0.39).
Incorporating the GO DAG (+GoDg) as an additional crucial source of background knowledge improves the detection of self-consistent(+0.20 in F 1 ), over-specific(+0.03 in F 1 ), and over-broad inconsistencies(+0.05 in F 1 ) as compared to PubMedBERT.However, this integration does not improve the identification of inconsistencies related to irrelevant GO mentions (IM) and incorrect genes (IG), aligning with the findings drawn in previous research using GNN-BERT (Chen et al., 2022).
On the other hand, the integration of existing GOA(+exGOA) enhances the detection of all four prevalent types of inconsistencies when compared to both the fundamental competitor (PubMedBERT) and the integration of the GO DAG (+GoDg).Despite this improvement, entity reformulation continues to outperform +exGOA, particularly in the detection of irrelevant Gene Ontology mentions (IM).Notably, combining two sources of background knowledge, GO DAG and existing GOA (+GoDg & exGOA), exhibits the best performance in detecting most types of inconsistencies among all experiemnted approaches.

Discussion
In this paper, we investigated the value of utilizing biological background knowledge and developed several approaches to incorporate such knowledge to facilitate the automatic detection of GOA inconsistencies.
Here, we discuss both the strengths and limitations of each approach employed in integrating background knowledge during the experiments.Additionally, we discuss the impact of the newly introduced test set on our study and demonstrating its reliability for evaluation purpose as compared to the original BC4GO-derived test set.:: We ::::: present ::: two :::: ad-hoc ::::::: examples : of :::::::::: inconsistencies :::::: detected :: by :: the :::: model :: in :: the ::::::: real-world :::::: gene2go ::::: dataset, ::::::: illustrating :: its :::::: practical :::: utility :: on ::::::: real-world :::: GOA ::::::: instances.The results of PubMedBERT as compared with +GeneRIF in Table 2 reveal that, GeneRIF statements are a more dependable alternative evidence source for detecting instances of incorrect gene selection (IG) than leveraging title and abstract of direct literature as evidence.This preference stems from the observation that literature abstracts often reference multiple genes, introducing confusion for both human and automated curation systems when identifying the correct gene for annotation.In contrast, GeneRIF typically presents the functional context for a single gene product.This enables automatic inconsistency detector to categorize an instance as IG without the distraction of other gene mentions if the annotated gene product is absent from the GeneRIF statements.This strategy notably improves Recall in the detection of IG. :::::: However, ::::: relying cues ::: for :: our :::::::::: implemented :::::: language :::::: models :: to ::::::: distinguish :::::::: specificity ::::: nuances :::::: between ::: GO :::: terms.However, our model currently faces challenges in distinguishing between related gene products that present as highly similar text strings in the evidence text, exemplified by difficulties (comparing column IG to other columns in Table 2) in differentiating between entities like the two protein complexes "mTORC1" and "mTORC2".For instance, the detector struggles to identify inconsistency between "mTORC2" and a GeneRIF statement such as "Glutamine synthetase limits betacatenin-mutated liver cancer growth by maintaining nitrogen homeostasis and suppressing mTORC1", leading to a decrease in Precision in IG detection.We hypothesize that integrating background knowledge to differentiate between gene products with highly similar text strings could offer a viable solution.The integration of existing GOA resulted in a substantial improvement in IG detection, likely attributed to the nuanced understanding of related genes provided by the knowledge graph formed by existing GOA.This integration generates diverse node embeddings that prove instrumental in identifying incorrect genes.Despite this improvement, the accessibility of GeneRIF statements, in contrast to the resource-intensive process of pre-training node embeddings on existing GOA knowledge graphs, underscores the continued value of GeneRIF as an evidence source.
Pre-extracted biological concepts have traditionally been favored for the processing of textual information.Many curation-assisting tools currently rely on keyword highlighting, such as PubTator (Wei et al., 2019).However, it is unclear whether this approach, pre-identification of key concepts as keywords, is necessary for automatic systems as well.Our experiments demonstrate that PubMedBERT, which utilizes pretrained distributional semantics and self-attention mechanisms, is capable of identifying and aligning biological concepts within raw text without preextraction.Compared to entity embedding extension (+Entity Embedding) and entity reformulation, PubMedBERT performs consistently well in detecting most types of prevalent inconsistencies.Nonetheless, preextraction of biological concepts may still have some value, as it can help to reduce the variances in gene ontology representations.For example, in cases where "intake of food" and "feeding behavior" are co-mentioned in the same text but the latter is marked as an irrelevant GO term annotation, PubMedBERT may fail to detect the inconsistency as it cannot align these phrases as the same GO concept.However, entity reformulation, which employs special linguistic markers, can partially standardize various textual representations of GO concepts, allowing the model to generalize alignment across different forms of GO term representations.We posit that pre-extraction of biological concepts may not be essential for BERTbased models if other strategies, such as using sufficient training data or pre-normalization of entities, can accommodate the variances in GO term representations.
Our findings substantiate the proposition that a nuanced comprehension of GO specificity, coupled with the utilization of existing GOA, is crucial for identifying inconsistencies.By incorporating the GO DAG, the detection of inconsistencies related to GO specificity has improved significantly, resulting in the achievement of the highest Precision in identifying over-broad inconsistency.However, the integration of GO DAG still faces challenges in differentiating subtle semantic differences between highly related GO terms.This is revealed from the detection of over-specific inconsistencies as worse than OB.Specifically, the synthesis of over-specific inconsistency involves capturing more nuanced semantic differences, specifically focusing on direct children, in contrast to the broader scope of over-broad inconsistency, which encompasses all ancestors.Furthermore, the integration of GO DAG has led to an excessive sensitivity in flagging OS to suspicious instances, as errors arise from misidentifying IM as OS if the irrelevant GO mention is a direct child of the correct GO term annotation.The incorporation of existing GOA into the detection process improves the identification of all prevalent types of inconsistencies.Surprisingly, we observed that this background knowledge surpasses the GO DAG in detecting over-specific inconsistency, with an increase in Precision of 0.11.This is due to the fact that the knowledge of GO specificity is also embedded in existing GOA.Specifically, a more general GO term is likely to be linked to numerous PMG nodes, whereas a more specific term is likely to be linked to a smaller number of PMG nodes.However, the accuracy of this integration can be influenced by the distribution of GO terms used in the annotations.We noted that the model tends to make overly broad predictions for frequently appearing GO terms such as "protein binding" or "process", resulting in a decrease in Precision.In contrast, GO DAG is not affected by the annotation distribution.Similarly, we observed that the model with the integration of existing GOA tends to only predict rarely appearing GO terms as OS, while many over-specific GO terms that are moderately distributed in existing annotations remain unidentified.We argue that while integrating with existing GOA outperforms GO DAG in detecting GOA inconsistencies, the reliability of this approach depends on the distribution of annotations within the training set.This background knowledge is not suitable for identifying inconsistencies if extremely broad or specific GO terms are used.
The combination of both GO DAG and existing GOA may potentially alleviate the discussed limitations in either one as the single background knowledge.We found this method can better distinguish semantic nuance, referring to the reduced gap of F 1 between the detection of OS and OB (0.01).This method improves the detection of IM and IG inconsistency while sole integration of GO DAG is not capable.It also lifts the Recall in detection of OB without sacrificing the Precision, contrasting to the sole integration of existing GOA.Similarly, this approach, regulated by the GO DAG, does not make over-specific predictions of rarely appearing GO terms.It lifts the Precision in detection of OS without sacrificing the Recall.We assume that certain self-regulation rules are generalized through the combination of multiple types of background knowledge on domain-specific semantic inconsistency detection task.
A key finding in our study is that the use of dummy embeddings outperformed the fundamental competitor (PubMedBERT) in the detection of self-consistency, OB, and IG instances.We attribute this improvement to two reasons: firstly, the dummy embeddings may have acted as an independent regularizer for PubMedBERT, preventing the model from over-fitting during fine-tuning on limited data.Secondly, it may have reduced the impact of noisy samples during fine-tuning, where a noisy sample refers to a consistent instance being incorrectly labeled as an inconsistency or vice versa.Although the dummy embeddings achieved the same F 1 as integration of GO DAG and existing GOA in the detection of OB, it resulted in a low Precision.Similar trade-offs between Precision and Recall were also observed in the detection of IM.We suspect that because the dummy embeddings are not providing any informative support when compared with GO DAG and existing GOA.The achievements made by dummy embeddings will fade away if human curators become involved in expanding the existing dataset with higher-quality instances.We also encourage the exploration of alternative independent regularizers to further improve the model's generalization capability.
Another contribution of this study is the introduction of a novel test set designed for evaluating inconsistency detection in GOA, addressing several limitations associated with existing methodologies.Firstly, our decision to create this new test set comes from the acknowledgment of significant annotation discrepancies among annotators in BC4GO (Van Auken et al., 2014), rendering it an unreliable source for simulating real-world GOA inconsistencies.In addition, only 11.5% of evidence sentences annotated in BC4GO were extracted from the title and abstract of literature sources.The majority of the evidence sentences were extracted from the results and discussion section within full-text article.Acknowledging the real-world challenges associated with accessing full articles, our approach deliberately focus on utilizing evidences that were completely extracted from the title and abstract.
To further improve the reliability of our evaluations, we developed a held-out test set derived from the NCBI gene2go database.This test set remains independent of existing training and development subsets, as evidenced by the absence of overlap in literature PMID sources.Realizing the importance of authentic inconsistencies, we adopted the peer-reviewed inconsistency synthetic strategy proposed in (Chen et al., 2022) for test set generation, followed by a manual check.
Our experimental findings demonstrate that the PubMedBERT model outperforms the BC4GO-derived dataset on this new test set, providing a more realistic evaluation of model performance on real-world data.This comprehensive approach aims to address the limitations associated with existing test sets and contributes to a more robust assessment of GOA inconsistency detection models.
In summary, our study has made significant contributions by presenting a comprehensive exploration of various biological background knowledge and delivering novel approaches for their integration in automatic GOA inconsistency detection :::::: (Outlined :: in ::::: Figure : 1).Through our research, we have identified distinct values in each approach or their synergistic combination, emphasizing the role of background knowledge in enhancing semantic inconsistency detection.Specifically, our methodologies effectively leverage the distributional semantics of GOA instances and the topological features of their associated background knowledge.
In the future, we plan to collaborate with expert curators to conduct systematic qualitative evaluations.We will explore ensemble models that unify our proposed approaches into a comprehensive inconsistency detection system.Finally, we will apply an end-to-end GOA inconsistency system to biological databases and assess its implications for biological database curation processes.

3
Develop various variants of multi-type single class classifiers to detect inconsistencies in human-in-the-loop setting.Collect existing GOA and and GO DAG based on the synthesized dataset Encode knowledge into node embeddings using GNN with edge type classification objective Extract pre-trained node embeddings to combine with distributional language models Collect manually curated GOA from database as consistent instances Apply heuristic rules to synthesize typed inconsistencies Apply concept recognition to mark the spans of key biological concepts within evidence text Collect GeneRIF statement indexed by the pmid and gene for each instance Manual verification of true inconsistencies in the synthesized dataset PubMedBERT leverages the contextual information confined to the individual GOA instance.

Fig. 2 .
Fig. 2. Modeling two GOA instances as an un-directed heterogeneous graph

MLPFig. 4 .
Fig. 4. Architecture of joint model for GOA inconsistency detection, with the addition of entity layer in the input.Tok * denotes a linguistic token, E * and h * denote a token [CLS] and [SEP] are special tokens marking the boundary of an input pair.The highlighted entity embedding layer illustrates a case where tok 1 , tok 3 , and tok 4 are part of the pre-extracted biological concepts and thus were encoded in digit 1.This figure demonstrates the combination of h [ cls], GODAG, and GOA KG.
Fig 4 demonstrates a GOA instance with a four-layered encoding, including entity embeddings, token embeddings, sentence embeddings, and positional embeddings.The entity embeddings is stacked with the other three encodings, forming the complete input representation for the model.
Architecture of Graph Neural Network (GNN) with the objective of edge type classification for encoding existing GOA patterns.Etn denotes the initial vectorization of node n representation.h n u denotes the hidden representation of node u in the n th layer of GNN; SAGE denotes the GraphSAGE sub-layer for aggregation of node features passed through same type of edges; Convolution aggregate ::::: aggregates ReLU denotes the Rectified Linear Unit, which is the activation function; Norm denotes batch normalization; ∥ denotes the flat concatenation of two node vectors; MLP denotes single-layered a :output of SAGE sub-layers; : multi-layer perceptron ::with :: one ::::hidden ::: layer.

Table 1 .
Summary of the experimented approaches extensions (+Entity Embedding).

Table 2 .
Integration of Background Knowledge for Automatic Detection of Inconsistencies in Gene Ontology Annotation 7 Performance comparison of systems variants considering the integration of different background knowledge alternatives for detecting GOA inconsistencies on a test set.The baseline and PubMedBERT approaches did not utilize any background knowledge, while the combined GoDg & exGOA approach incorporated both GO specificity and existing GOA.Bold font is used to indicate the highest score in each column.generated dummy embeddings serves as a regularization technique for the PubMedBERT model, leading to improved generalization in detecting inconsistencies.Furthermore, it indicates that the model's effectiveness is still limited by the quality and quantity of the available data.