LILLIE: Information extraction and database integration using linguistics and learning-based algorithms

Querying both structured and unstructured data via a single common query interface such as SQL or natural language has been a long standing research goal. Moreover, as methods for extracting information from unstructured data become ever more powerful, the desire to integrate the output of such extraction processes with ‘‘clean’’, structured data grows. We are convinced that for successful integration into databases, such extracted information in the form of ‘‘triples’’ needs to be both (1) of high quality and (2) have the necessary generality to link up with varying forms of structured data. It is the combination of both these aspects, which heretofore have been usually treated in isolation, where our approach breaks new ground. The cornerstone of our work is a novel, generic method for extracting open information triples from unstructured text, using a combination of linguistics and learning-based extraction methods, thus uniquely balancing both precision and recall. Our system called LILLIE (LInked Linguistics and Learning-Based Information Extractor) uses dependency tree modification rules to refine triples from a high-recall learning-based engine, and combines them with syntactic triples from a high-precision engine to increase effectiveness. In addition, our system features several augmentations, which modify the generality and the degree of granularity of the output triples. Even though our focus is on addressing both quality and generality simultaneously, our new method substantially outperforms current state-of-the-art systems on the two widely-used CaRB and Re-OIE16 benchmark sets for information extraction. We have made our code publicly available1 to facilitate further research. © 2021 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
It is commonly known that some 80% of enterprise data is unstructured while only some 20% is structured [1,2]. Hence, only a relatively small part of enterprise data is stored in relational databases. Traditionally, the information retrieval and natural language processing communities focus on working with unstructured data, while the database community typically works with structured data. In order to query both structured and unstructured data via a single common query interface such as SQL or natural language [3,4], there have been several research efforts over the last years. One such approach, which we follow in our work, is to first use information extraction techniques to retrieve relevant entities (subjects and objects) and relationships (predicates) from text documents and to then populate so-called knowledge graphs or ontologies [5]. The next step is to link the generated knowledge graph with the tables of a relational database (entity linking). Finally, the combined system can be queried in either SQL or in natural language.
An example of such an end-to-end data processing pipeline is shown in Fig. 1. The inputs are text documents from medical articles such as in PubMed, 2 as well as a relational database that stores information about genes and so-called anatomical entities, i.e. different organs in our body. First, the input text ''THY1 is overexpressed in human gallbladder carcinoma'' is parsed and subject, object and predicate are extracted. Next, the subject ''THY1'' and the object ''human gallbladder carcinoma'' are linked to the relational database.
Building such an end-to-end pipeline to enable the vision of querying structured and unstructured data via a common interface has been a long standing research effort [6,7]. However, each of the previously-mentioned steps has typically been treated and Example of an end-to-end data processing pipeline.
Step 1 -Information Extraction: First we extract triples such as subject, predicate and object from a text document.
Step 2 -Entity Linking: Afterwards we link the extracted subjects and objects to specific columns of a relational database. As a result we have an extended relational database that is enriched with information stored in text documents. optimized in isolation. Hence, significant potential for improvement is left unexplored when considered in the larger context of data exploration. When the triple extraction process is viewed as an integral part of this larger process of integrating and querying structured and unstructured data, we claim that two considerations are crucial to be treated simultaneously -the combination of which has previously received only little attention: • How to optimize the effectiveness of triple extraction, balancing both recall and precision?
• How to augment the approach to increase generality, making the extracted triples suitable for linking up with varying forms of structured data stored in a relational database?
Much work on triple extraction concentrates on the first aspect only and is therefore not really optimized towards an endto-end pipeline of both triple extraction and database integration.
In this work we tackle an important open gap in triple extraction and database integration. In particular, we present a novel approach for extracting subject-predicate-object relational triples from unstructured text, which are then linked to relevant ontologies and inserted into a relational database. This paper is part of a greater vision of building a data exploration system with INODE [8].
The contributions of our paper are as follows: • We combine a high-precision rule-based triple extractor with a high-recall learning-based extractor, using a novel triple refinement method.
• Our system includes additional options for output modifications, which allow the granularity and specificity of the extracted triples to be customized to a given structured database.
• Our approach outperforms current state of the art systems on the two widely-used benchmark datasets CaRB and ReOIE.
The paper is organized as follows: in Section 2 we review the related work on information extraction and entity linking for knowledge base construction; in Section 3, we give an overview of the LILLIE architecture; in Sections 4 and 5, we describe the algorithms and functions of the rule-based extractor and the learning-based extractor, respectively; in Sections 6 and 7, we show how to combine both engines, and customize their output; in Section 8, we describe how to apply our triple extractor to the task of entity linking and database insertion; in Section 9, we give a detailed analysis and evaluation of all the components of our system, and compare these to the current state-of-the-art. The paper culminates in Section 10, where we show how the enriched database can be queried and discuss performance considerations.

Related work
In this section, we provide background information for each of the distinctive modules that comprise our data processing pipeline, namely open information extraction and entity linking for knowledge base construction.

Information extraction
Information extraction systems aim at distilling structured representations of information from natural language text, usually in the form of relational triples {subject, predicate, object}, which correspond to {entity1; relationship; entity2} or n-ary propositions [9].
There are two types of information extraction systems: Closed Information Extraction (CIE) systems identify instances from a fixed and finite set of corpora, considering only a closed set of relationships between two arguments [10]. On the other hand, Open Information Extraction (OIE) systems use a domain-independent approach and are capable of extracting entities and relationship triples from natural language sentences. Since OIE systems follow a relation-independent extraction paradigm, they can play a key role in many natural language processing (NLP) applications involving natural understanding (NLU) and knowledge base construction from massive and heterogeneous corpora, by extracting phrases that indicate semantic relationships between entities.
In order to extract triples, most approaches try to identify linguistic extraction patterns, either hand-crafted or automatically learned from the data. The line of work on OIE starts with systems relying on distant supervision [11,12], and rule-based paradigms that focus on the grammatical and syntactic properties of the language [13,14]. An abundance of learning-based systems that leverage annotated data sources to train classifiers has been proposed [15,16], with more recent implementations making use of pretrained language models [17,18]. Despite the existence of so many approaches, however, the majority focus only on evaluating the effectiveness of different triple extraction tools on raw data, without incorporating any preprocessing strategies to limit the number of potentially uninformative triples [19].
Some more recent methods go beyond the triple extraction task by encompassing more thorough preprocessing and postprocessing strategies, including discourse analysis, coreference resolution or summarization to improve the quality of the extracted triples [20][21][22].

Entity Linking for knowledge base construction
Entity Linking (EL) -also known as Named Entity Recognition / Disambiguation -is the task of identifying an entity mention in a text and establishing a link to an entry in a knowledge base (KB), e.g. Wikidata [23], DBpedia [24], YAGO [25]. EL systems are capable of resolving the lexical ambiguity of entity mentions and can therefore be extremely useful in a plethora of NLU applications, by enriching the information extracted via OIE systems. Moreover, by establishing links between the entity mentions and KB entities, we are able to store and utilize information in semantic graphs, facilitating semantic parsing, question answering and exploratory data analysis operations.
Earlier approaches leverage statistical models combined with feature engineering methods to achieve entity linking, viewing the problem as a word sequence labeling task [26]. More modern neural-based approaches treat the problem as a multi-class classification task, in which entities correspond to classes. The goal is to propose a list of candidate entities for each mention by encoding both the mentions and the candidate entities into vector representations, then ranking the candidates based on Overview of our proposed architecture of LILLIE for extracting relational triples from text documents and integrating them into a relational database. Uniquely, our system is a combination of a rule-based extractor (high precision-oriented) with a learning-based extractor (high recall-oriented). content similarity. Much work has been put in constructing and correlating mention and candidate entity embeddings, spanning from convolutional encoders [27,28] to recurrent [29,30] and self-attention networks [31][32][33][34][35].

Architecture of LILLIE
An overview of the architecture of LILLIE (LInked Linguistics and Learning-Based Information Extractor) is shown in Fig. 2. We briefly describe the main aspects of each of the components. The details will be discussed in the subsequent sections.
• Rule-based Extractor: This component follows a precisionoriented, linguistics-based approach to extract triples from unstructured text (see Section 4).
• Learning-based Extractor: This component follows a recalloriented triple extraction approach, based on complementary OIE strategies to extract relational triples (see Section 5).
• Triple Refinement: This module efficiently combines the results of the aforementioned extractors, maintaining the best attributes of each one (see Section 6).
• Output Modification: A number of parameterization settings are introduced by this module, allowing LILLIE to adapt to different text domains (see Section 7).
• Entity Linking and Database Integration: The final part of our system aims at correlating the extracted triples with domain-specific ontologies in order to enhance their contextual value, before integrating them to a relational database (see Section 8).

The rule-based extractor
In this section we explain our precision-oriented, rule-based extractor, whose goal is to extract relational triples from a text document.
The input for the rule-based extractor is a sentence of unstructured plaintext, such as: Long non-coding RNA CCAT2 promotes breast cancer growth and metastasis And the output is a set of annotated subject-predicate-object triples, for example: (long non-coding RNA CCAT2 ; promotes ; breast cancer growth) (long non-coding RNA CCAT2 ; promotes ; breast cancer metastasis) For each component of a triple (subject, predicate and object), we identify a single base term from the input text. From this term, we expand it into a complete phrase, while annotating each term with the rules used to include it.
We explain this concept by using the previous example: long non-coding RNA CCAT2 ; promotes ; breast cancer growth Here, long non-coding is marked as an adjectival component of the subject, RNA is marked as a compound element, and CCAT2 is marked as the base term. With this additional information, we can alter the granularity of the given triples, for matching to more or less specific entities in structured data. For example, if "long non-coding RNA CCAT2" is not present in an ontology, we can match on the more general entities such as "RNA CCAT2" or "CCAT2".
We design our rules to be as generic as possible, and to be applicable on all possible text domains. To this end, we used the CaRB and ReOIE16 datasets for crafting these rules. The CaRB [36] and ReOIE16 [37] benchmarking sets are 1877 arbitrarily selected annotated sentences from various sources, designed to represent a wide variety of text domains and language styles.
To design the rules, we analyzed a set of only 30 random sentences from the CaRB development set, and designed the rules based on these, using an approach similar to a ''few shot'' learning model. We then tested these rules on the rest of the CaRB development set to assess their generalizability. This approach ensures that our system is generic, and adaptable to many different domains, as demonstrated by its performance on the ReOIE16 set, which was not seen prior to evaluation.
We believe these rules to be applicable to a wide variety of textual domains, with no modifications to the core rule set being needed as we move between datasets. In Section 7, we describe a small number of domain-specific adaptions layered on top, that can be enabled or disabled when required, to give further generalizability. This allows for high-precision annotated extractions, suitable for to mapping to structured data.

Pre-processing
To begin with, the input text is parsed into a syntactic dependency tree using the Stanford Dependency Parser [38], and annotated with part-of-speech tags. The dependency tree passes through several custom pre-processing stages, before the tokens comprising the triple are extracted from the tree.
We include a purpose-built anaphora resolution procedure in our system, based on deterministic transformation rules on the dependency tree. We keep the ruleset minimal, in order to Fig. 3. A syntactic dependency tree before pre-processing (above) and after (below). The subject-phrase "Gastrointestinal hormonal peptides", shown in blue, has been duplicated in the sub-clause on the left, to account for the conjunctive phrase. With this pre-processing step, LILLIE enables more effective downstream processing. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) achieve a reliable, high-precision procedure, in contrast to generalized anaphora resolution systems, which we found introduced many noisy substitutions. In addition, we account for conjunctive predicates by duplicating and re-arranging branches of the dependency tree.
An example of this pre-processing is shown in Fig. 3 for the sentence ''Gastrointestinal hormonal peptides can cause gastrointestinal malignancies and may contribute to dysmotility''.
This procedure handles cases of co-reference resolution where the anaphora is a ''wh-word'' ('which' or 'who'), or a pronominal (it, he, etc.), at the head of a sub-clause. These cases can be deterministically resolved, e.g. ''This cancer is found in the kidneys, where it can be aggressive.''. These instances can to be reliably resolved with very high precision, as they are linguistically unambiguous. Cases where: a) there are two (or more) ''wh-words'' or pronominals in the dependent clause; or b) where the anaphora is not the head-word of a sub-clause; are explicitly not handled, as they can be ambiguously interpreted, and attempting to resolve them would break the principle of high precision extraction.
Similarly, verbal conjunctions, such as in ''He was the Prime Minister and also served as a judge.'' (from the CaRB development set), are handled only in cases where the subject can be unambiguously added to the conjunctive clause, but will not be split in more complex cases, such as ''This gene is found in the kidney and it mutates within it.''. Such cases are deferred to the Learning-Based Extractor -which defines our complementary approach. We provide an analysis of how much this approach (of not attempting to resolve ambiguity) improves F1 and AUC in Table 6.
The procedure is described formally in Algorithm 1. We use d(n, n ′ ) to denote the syntactic dependency between nodes n and n ′ of the dependency tree, where n ′ is the governor. Further information concerning the syntactic and linguistic terms used in the algorithms in this work can be found in [39].

Triple extraction
After processing the original dependency tree into a set of modified trees, we can now use it to extract relational triples. To do this, we find sets of three nodes from each tree, which represent the base terms for the subject, predicate and object, respectively. We again explain this process with our running example: Algorithm 1 Input Pre-processing Input: t, a syntactic dependency tree Output: T , a set of modified trees for all n ∈ BreadthFirstSearch(t) do for all c ∈ n.children do if d(c, n) = conj and n is verbal then Long non-coding RNA CCAT2 promotes breast cancer growth and metastasis The base tokens are CCAT2 and promotes, for the subjects and predicates respectively, and metastasis and growth for each of the two objects. These are identified according to a set of rules, which determine the appropriate dependencies between the three terms. The edges between the subject, predicate and object base tokens must match a set of valid edges, such as (nsubj→dobj) for a nominal subject and direct object triple.
For each of these base terms, we traverse the dependency tree depth-first, marking terms that match certain rules. These rules take the form of a lookup table, where the dependency on each edge is mapped to six rules: three inclusion rules, one for subject, predicate and object: and, similarly, three donation rules. Depending on whether the base token is for the subject, predicate or object, two of these rules are then chosen. The inclusion rules determine whether an edge is valid to traverse, and the sub-tree can be included in the final triple. The donation rules determine if a given sub-tree will be donated, or moved, to the subject, predicate or object.
For example, a compound edge below the subject node will be included in the subject portion of the triple. If this edge was below the predicate node, however, it would be moved to the object's subtree. This process is applied recursively to all subtrees until the subject, predicate and object trees contain only relevant tokens.
For example, an auxiliary verb, such as may or could, will be annotated and added to the predicate, and compound nouns, such as breast and cancer will be added to the subject or object. The output of this procedure is a set of triples, consisting of terms annotated with dependency rules.
In total, there are 40 branch-rules used in our system. Here, we give an example of some of these rules: • Temporal Modifiers: A temporal modifier denotes the time at which the predicate was invoked by the subject on the object. In the sentence ''Last year, he passed his exams.'', last year is the temporal modifier. Temporal phrases are typically attached to the verb of a clause in a dependency graph; however, we transfer all instances of temporal indicators to the object portion of the triple using the branch donation rules. In the previous example, we take passed as the predicate, and transfer the phrase last year to the object, giving the triple (he ; passed ; his exams last year). This differs from the standard parse of (he ; passed last year ; his exams), and produces a more informative triple for structured database insertion by keeping specificity of the predicate high.
• Compound Particles: A compound particle denotes part of an idiomatic split verbal phrase, such as in the sentence ''The detective closed the case down.'', where down is a compound particle in the verb phrase shut down.
Using this rule, we can combine verb phrases split over long distances, and remove ambiguity in the object, by using dependency tree analysis. In the previous example, we can extract the triple (the detective ; closed down ; the case), rather than the more ambiguous interpretation: (the detective ; closed ; the case down), accounting for the clear semantic difference between the two predicates.
• Multi-Word Expressions: Multi-word expressions (MWEs) cover phrases like such as or instead of. In the sentence, ''We could find a suitable donor, at least.'', the phrase at least is attached as a MWE to the root verb.
In cases such as these, where the MWE will cause ambiguity in the predicate, even though it is a dependent branch of the root verb, we use the inclusion rules to prune its branch from the tree, to give the triple (we ; could find ; a suitable donor), rather than the direct parses of (we ; could at least find ; a suitable donor). In the subject and object, however, we leave MWEs in the output triple, as they frequently contribute necessary context in noun phrases.
• Copulas and Auxiliary Verbs: A copula generally denotes an is-a relation, as in ''CCAT2 is a gene'' or ''These tumours were malignant.'' Auxiliary verbs occur in phrases such as ''was found'' (was) or ''has been detected'' (has and been).
In a typical dependency parse, copulas are typically not positioned as the root of the tree, and are instead represented as a sub-branch of the object. As such, we modify the tree, by promoting the copula to the root, and positioning the object as a sub-tree of the root verb. However, in cases where copulas and auxiliary verbs interact, as in ''These tumours have been malignant.'', we use our rules to ensure that the extracted triple becomes (these tumours ; have been ; malignant), rather than (these tumours ; have ; been malignant), as the dependency tree suggests.
With these rules, we attempt to cover all linguistic domains, but, similarly to our pre-processing techniques, we take the approach of removing portions of the input that are linguistically ambiguous, and thus benefit from a stochastic approach, as these will be handed by the Learning-Based Extractor. As such, some dependencies, such as Clausal Subjects (e.g. in ''How this gene behaves makes little sense.''), or Discourse (representing elements of casual speech) are pruned from the dependency tree by our rules. We implement these rules using a table, describing when to include, remove, or transfer branches below the subject, predicate or object base token. An abridged version of this table is shown in Table 1.

The learning-based extractor
The learning-based extractor introduces a data processing pipeline that takes as input a sentence of natural language text and provides a structured representation of the extracted information in the form of OIE triples, identical to the rule-based extractor. For example, the same sentence: Long non-coding RNA CCAT2 promotes breast cancer growth and metastasis Produces the following triple in the form of subject-predicateobject: (long non-coding RNA CCAT2 ; promotes ; breast cancer growth and metastasis) The extractor comprises an in-place coreference resolution module and a parallel triple extraction module that integrates three complementary OIE engines relying on both learning-based and linguistics-based components, each based on a different extraction strategy. These engines are discussed in detail in Section 5.2. The main intuition behind this approach is to enhance the performance of our extractor compared to standalone engines, while developing a recall-oriented approach to be used in conjunction with the precision-oriented rule-based extractor. More information regarding the aforementioned modules is provided in the following subsections.

In-place coreference resolution
An in-place coreference resolution module is used to improve the quality of information extraction, addressing those cases where an entity found in unstructured text is replaced by its coreferential pronoun. For example, the phrase ''Mary is a nice person, I like hanging out with her.'' will be substituted with the coreference-resolved equivalent ''Mary is a nice person, I like hanging out with Mary.'' To this end, we leverage the pretrained neural coreference resolution tool from AllenNLP [40], which implements a variant of the Lee et al. end-to-end coreference resolution model [41] using Span-BERT embeddings [42]. The model has been trained on the OntoNotes 5.0 dataset [43], achieving an F1-score of 78.87% on the test set. Prior to being ingested by our triple extraction module, each sentence is processed through the in-place coreference resolution component, where all mentions referring to the same entity are substituted by that entity, eventually leading to more informative triples.
An indicative example of the performed substitutions on an small extract is provided in Table 2. As shown in the example, all mentions of the type ''this gene'' have been replaced with the original entity ''CCL26 gene'', while all mentions of the type ''this chemokine'' have been substituted with the original entity ''chemokine receptor CCR3''. It is evident that the coreferenceresolved text will lead to more informative triples, since all triple arguments will contain original noun phrases as unique referents instead of their coreferential pronouns. Table 2 In-place coreference resolution using the learning-based extractor on an example taken from the OncoMX gene biomarkers database. LILLIE replaces all mentions of ''this gene'' with the original entity ''CCL26 gene''.
Original sentences: CCL26: This gene is one of two Cys-Cys (CC) cytokine genes clustered on the q arm of chromosome. Cytokines are a family of secreted proteins involved in immunoregulatory and inflammatory processes. The CC cytokines are proteins characterized by two adjacentcysteines. The cytokine encoded by this gene displays chemotactic activity for normal peripheral blood eosinophils and basophils. The product of this gene is one of three related chemokines that specifically activate chemokine receptor CCR3. This chemokine may contribute to the eosinophil accumulation in atopic diseases.
Coreference-resolved sentences: CCL26: CCL26 gene is one of two Cys-Cys (CC) cytokine genes clustered on the q arm of chromosome. Cytokines are a family of secreted proteins involved in immunoregulatory and inflammatory processes. The CC cytokines are proteins characterized by two adjacent cysteines. The cytokine encoded by CCL26 gene displays chemotactic activity for normal peripheral blood eosinophils and basophils. The product of CCL26 gene is one of three related chemokines that specifically activate chemokine receptor CCR3. Chemokine receptor CCR3 may contribute to the eosinophil accumulation in atopic diseases.

Parallel triple extraction
Our approach aims at distilling the maximum available information from texts that can be used directly for end-user applications. To this end, we integrated three of the most popular OIE systems, each based on a different extraction strategy, into a single module. These complementary OIE engines were chosen both in terms of underlying architecture (i.e. clause-based, deep neural network-based) and also regarding the targeted corpus (i.e. some extractors are focusing on numerical context while others attend to conjunctive sentences). The advantage of having more triples is exploited during the output modification process (Section 7), providing the option to the end-user of having a high recall system, based on the needs of each use case. A brief explanation of the intuition behind each system is provided below: • Open IE 5.1 [44] is a successor to the Ollie learning-based information extraction system [15]. It is based on the combination of four different linguistics-based and learning-based OIE tools; namely CALMIE (specializing in triple extraction from conjunctive sentences) [45], RelNoun (for noun relations) [46], BONIE (for numerical sentences) [47], and SRLIE (based on semantic role labeling) [48].
• ClausIE [14] follows a clause-based approach, first identifying the clause type of each sentence and then applying specific proposition extraction based on the corresponding grammatical function of the clause's constituents. It also considers nested clauses as independent sentences. Because ClausIE detects useful pieces of information expressed in a sentence before representing them in terms of one or more extractions, it is especially useful in splitting complex sentences into many individual triples.
• AllenNLP OIE system [40] formulates the triple extraction problem as a sequence BIO tagging problem and applies a bi-LSTM transducer to produce OIE tuples, which are grouped by each sentence's predicate [49]. Given that it relies on supervised learning and contextualized word embeddings to produce independent probability distributions over possible BIO tags for each word, it has the potential of discovering richer and more complex relations.
We provide an example to showcase the supplementary results of the aforementioned engines which comprise the learningbased extractor in Table 3. The example showcases a triple extraction task on a complex sentence, containing both independent and multiple dependent clauses. The first column shows the list of derived triples and the second column presents the extraction engine that managed to identify each triple. It is evident that while some of the produced triples remain concise and highlyinformative (e.g. first and last extraction), others (second and third) contain redundant information with low marginal utility with respect to the use case. Even longer triples, however, can be useful using a relevant post-processing approach; a feature that is addressed by the Triple Refinement module described in the next section. In general, each engine that comprises the learningbased extractor has diverse quality characteristics from which we can benefit from while using their combined extractions. Examples such as this pose some of the most challenging research cases, as it is usually difficult to identify all possible individual clauses using a single extraction approach. However, the complementary strategy of the learning-based extractor manages to capture triples from different parts of the sentence, with a portion of them focusing on the main clause and the rest on the dependent clauses.

Triple refinement
Once both triple extraction engines have output a set of triples for a given sentence, they are combined into a single unified set, which aims to maintain the high precision of the rule-based engine, but increases recall using the triples output from the learningbased engine (which is designed to produce more triples, but with lower precision). To do this, the learning-based triples are first passed through a triple refinement procedure.
This procedure maps the output triples onto the sentence's syntactic dependency tree. For every triple, each node on the tree is marked if it is part of the subject, predicate or object. We design a set of edge-rules, similarly to the rule-based extractor, which signify if an edge is valid between two marked nodes. There are three sets of rules for subject, predicate and object nodes.
For example, in the triple from Table 3: (The protein encoded by KIF1A gene ; is ; a member of the kinesin family and functions as an anterograde motor protein) The object nodes member and functions are connected by a conjunction relation, which is not a valid object edge, according to the refinement ruleset. The whole branch of the functions sub-tree is pruned, and the object becomes: a member of the kinesin family. This procedure uses the inclusion rules described in Table 1 to determine which branches are kept in the final tree. In this sense, the triple refiner uses the same rules as the Rule-Based Extractor to determine which tokens and phrases are relevant. However, the crucial difference is that the Rule-Based Extractor will completely disregard the entire sub-tree of an ambiguous phrase, while the Learning-Based Extractor plus Triple Refiner will include ambiguous branches, but prune them to remove irrelevant terms. Because of this, while the Rule-Based Extractor may miss ambiguous elements, and the Learning-Based Extractor may include irrelevant elements in its output, the Triple Refiner prunes irrelevant parts of a given triple to create a bestof-both-worlds approach. This procedure is described formally in Algorithm 2, where V represents the table of inclusion rules from Table 1.

Algorithm 2 Triple Refiner
Input: T , a set of triples, (S, P, O), for a sentence; t, a dependency tree for that sentence; V , a mapping from {subject, predicate, object} to a list of valid dependencies Output: R, a set of refined triples Because of this, new triples can be obtained which the rulebased extractor was unable to find, while maintaining the high precision of the rule-based system. Additionally, because the triples are mapped to the dependency tree, they can be annotated with additional information. For example, the object in the previous triple now becomes: a member of the kinesin family Where kinesin can be easily isolated as a compound noun from the triple, for later entity linking. Mapping all triples to a dependency tree also allows us to apply the output modification procedures, described in the following section, to all triples, regardless of which engine they originated from.
After the refinement procedure, the triples are merged into a single set. If there are potential duplicates, the system favors the rule-based triple, in order to maintain high precision. For example, if two triples from different extractors have the same predicate for a given sentence, the rule-based one is chosen.

Output modification
Once the triple refinement is complete, and each triple is also mapped to a dependency tree, we can modify the final output based on several settings, which activate or disable a small number of auxiliary rules. The core rules and functionality remain generic and do not need to be modified, but these additional optional rules allow for extra flexibility if the text domain requires it. This allows our system to adapt to a wider variety of text domains and relational databases. If, for example, longer entities with additional context are required, a switch can be turned on which will provide a more expressive output. Or, if the input textual data is highly structured, compound entities can be split into smallersub entities for better matching. However, this modification may produce noisy results on less-structured text, since complex or ambiguous conjunctions are difficult to parse accurately, so it can be turned off in these cases. These modifications are described formally in Algorithm 3. We will discuss the details of the algorithm in the following subsections. The default version keeps the shorter extracted predicate:

Algorithm 3 Output Modification Procedures
(Blood E2F3 mRNA levels ; were ; significantly higher in lung cancer patients) While the enhanced_predicates version is as follows: (Blood E2F3 mRNA levels ; were significantly higher in ; lung cancer patients) This is illustrated in Fig. 4, where the predicate node (were) has been promoted to the root of the tree, and the original root, (higher) has been changed to an adverbial modifier. This expands the tokens in the predicate (shown in red).

entity_context Setting
If the textual input data is highly descriptive, and many entities include additional contextual (temporal, location, etc.) information in the subject or object, we can disable the entity_context option to exclude potentially noisy additional clauses or phrases (see second function in Algorithm 3). This aids in entity matching, as the relevant tokens can be extracted more efficiently. Without this option disabled, the output triple from the previous example would be: (Blood E2F3 mRNA levels ; were significantly higher in ; lung cancer patients when compared to either patients with benign lung diseases or healthy subjects) The benchmark evaluation results in Table 5 have the en-tity_context switch enabled, since human annotators tend towards more contextual labeling. However, we disable this switch when using our system for structured data integration, as the additional information causes excess noise during the entity linking procedure. This ease of adaptability to different domains and contexts, using the output modification switches, is one of the key strengths of the LILLIE system.

split_triples Setting
We can also split or merge conjunctive phrases into sub-triples (see third function in Algorithm 3). This option is best-suited for well-structured textual data, as complex conjunctions can introduce errors (i.e. lower precision) if the text is not well-formed. Consider the following sentence: Long non-coding RNA CCAT2 plays an important role in tumorigenesis, tumor growth and metastasis.
We can extract three separated triples for each sub-entity: (Long non-coding RNA LncRNA CCAT2 ; plays ; an important role in tumor growth) (Long non-coding RNA LncRNA CCAT2 ; plays ; an important role in metastasis) (Long non-coding RNA LncRNA CCAT2 ; plays ; an important role in tumorigenesis) Or, merge them into a single entity: (Long non-coding RNA LncRNA CCAT2 ; plays ; an important role in tumorigenesis tumor growth and metastasis) We demonstrate the effect of the split_triples enhancement in Table 7. When this switch is enabled, we gain a higher number of triples that can be linked to entities for database insertion. In this case, the enhancement was effective due to the linguistically precise nature of the scientific texts used. It should be noted that, while the combined effect of the proposed triple refinement and output modification processes can be considered similar to that of knowledge base canonicalization (i.e. the problem of mapping each entity to its canonical form to reduce ambiguity or redundancy) [50], our approach significantly differs from the established line of work [51,52] as it allows full control of the auxiliary information captured by the extractors, based on user preference.

Entity linking and database integration
The final step of our end-to-end system is to integrate the extracted triples into a relational database, in order to increase the contextual value of our information extraction pipeline. Even though our entity linking and database integration approach is generic, we describe a specific use case of applying our approach in the domain of medical research. We have chosen this use case to demonstrate LILLIE's impact of solving a challenging real-world problem [8] at the intersection of academia and industry based on solid theoretical foundations.
We followed an entity linking strategy to correlate triples subjects and objects with a specific anatomical entity from the Uberon [53] cross-species anatomy ontology (e.g. pharynx = UBERON:0006562) and with a specific biomarker from the On-coMX [54] cancer mutation and expression knowledgebase (e.g. Keratin 8 = KRT8). Finally, we integrated the linked triples into the relational database called OncoMX.

Entity linking
Since the output of our triple extractor and refiner contains additional annotations, such as compound or adjectival components, we can efficiently link textual entities from our extracted triples with existing entities in a database. For example, with the entity long non-coding RNA CCAT2, we can separate this into four sub-entities: long non-coding RNA CCAT2 RNA CCAT2 CCAT2 long non-coding CCAT2 Each of these new entities has a different degree of granularity, and we can now match on the most specific entity available. For example, if long non-coding CCAT2 was present in our database, we select this match; however, if only the less specific CCAT2 entity was present, we can still find a match. This approach has several advantages: it allows for a more efficient search, and can handle n-grams split over several sub-phrases. It also does not require a similarity measure, which enhances precision, while maintaining high recall.
Using the Uberon ontology, we then concatenated two properties of the ontology (label, hasExactSynonym) to create a simple dataframe of the following structure as shown in Fig. 5.
The column label contains a list of different names (official label and synonyms) for each anatomical entity and was used to match the extracted triples with Uberon entities. Similarly, we collected the gene names from the OncoMX database comprising a table of 809 gene records. We then ran an iterative process to match one or more mentions of gene names with each triple's subject or object.

Database integration and enrichment
We applied the aforementioned entity linking approaches on the extracted triples in order to enrich the existing OncoMX database with the new information stemming from our information extraction system. In particular, we could increase the information context of the OncoMX database via linking it with literature mentions of genes that are affecting cancer development in specific Uberon anatomical entities, as shown in Fig. 6. Each row contains a Pubmed ID (pmid) that corresponds to a medical article, a gene name (gene), an Uberon entity ID (uberon) and name (uberonname), as well as the extracted relational triple in subject-predicate-object format. The enriched database can then be used for querying structured and previously unstructured data via a single common query interface such as SQL or in natural language.

Experiments
To evaluate our system, we first measure the performance of our triple extractor against two state-of-the-art systems, OpenIE6 [55] and IMoJIE [56], on two standard benchmark data sets. Next, we use the PubMed abstracts dataset to demonstrate the qualitative advantages of our enhancements, in comparison to these systems and to show that our approach generalizes well for a diverse set of datasets. Lastly, we show how our data can be successfully queried in a relational setting from a database enriched with triples. For reproducibility of our results, we make the source code of LILLIE available. 3

Datasets
The performance of a triple extraction system is assessed as follows: for a given sentence, a system's extracted triples are compared with a set of gold-standard triples, selected by human annotators, and precision and recall are measured on term-level. We use two datasets of annotated sentences for our experiments: 3 Source code of LILLIE: https://github.com/OIELILLIE/LILLIE.

Table 4
Sample extractions on the domain-generic CaRB benchmark dataset.

Original sentence
Warner Communications Inc., which is being acquired by Time Warner, has filed a $1 billion breach-of-contract suit against Sony. • The CaRB dataset [36] contains 1282 open-domain sentences, divided into two sets of 641 sentences, for development and testing, respectively.
• Re-OIE16 [37] is an updated subset of OpenIE16 [57], an earlier corpus, with all sentences re-annotated to better reflect the needs of OIE. Similar to CaRB, this contains 600 open-domain sentences.
A sample of the sentences contained in CaRB, and the triples we extract, are shown in Table 4.
For further demonstrating the generalizability of our approach, we also perform extractions on a set of medical journal papers, taken from a variety of disciplines. The PubMed abstracts dataset contains 116,049 sentences across 38,703 abstracts and paper titles.

Performance of LILLIE's triple extraction pipeline
We first evaluate the triple extraction portion of our system LILLIE using the CaRB Evaluator 4 on the CaRB test set. In addition, we adapt the Re-OIE16 annotations to be used with the CaRB evaluator, in order to provide consistent results.
Initially, we used the CaRB development set, comprising 50% of the sentences, for development and tuning of the rule-based extractor and refiner. We developed a set of generic, domainindependent linguistic rules by analyzing a randomly-chosen sample of 30 sentences for linguistic patterns, then tested these rules on the remainder of the CaRB development set, to ensure maximum generalizability. These rules remained the same throughout all further evaluations and experiments. We then evaluated the performance of our system against the current state-of-the art systems on the testing portion of the CaRB set, and the previously unseen ReOIE16 benchmark. Since these two datasets are open-domain, and were not used for training or development, we believe our results on these benchmarks show that our system, and its components, such as the rule sets, are generic and adaptable to many different domains.
The results are shown in Table 5. We observe the most significant improvements for our approach (LILLIE) in the AUC scores, with an approx. 6% increase over both state of the art systems Fig. 6. Enriched OncoMX database after information extraction, entity linking and database integration. OpenIE6 and IMoJIE in the CaRB dataset, and an increase across all metrics in the Re-OIE16 dataset. The precision/recall balance achieved by the triple refiner and combiner of LILLIE is reflected in the improved F1 scores on both datasets. In particular, precision and recall are well-matched on Re-OIE16, despite the system being tuned on the CaRB development set, and ReOIE16 was not used during the training or development of our system, which demonstrates good generalizability. This result is noteworthy as tuning the output of the triple extraction component for these benchmarks was not an isolated goal, with the component instead tailored to be part of our bigger end-to-end system. The generalizibility of our domainindependent information extraction approach is also qualitatively depicted in Table 4 via a diverse selection of examples taken from the CaRB test set.

Ablation study
In Table 6, we show the results of each component of our system, in accordance with the architecture shown in Fig. 2. We give AUC and F1 scores on both CaRB and ReOIE testing sets, in order to show the effect of each individual component.
The individual scores for the Rule-Based and Learning-Based components are shown, along with the effects of pre-processing and co-reference resolution, respectively. The Rule-Based preprocessing increases the F1 and AUC scores on both datasets, and the in-place co-reference resolution increases the F1 scores on both data sets, while somewhat degrading AUC. This is due to the fact that the CaRB and ReOIE datasets contain single sentences, rather than longer texts, so the positive effects of co-reference resolution are less apparent. Nevertheless, the addition of this component improves the overall results when the triple refiner is applied.
Finally, we analyze the performance of combining the Rule-Based (RB) and Learning-Based (LB) approaches as indicated in Table 6 by ''Combination of RB and LB''. In particular, we demonstrate the improvements made by the Triple Refinement process by showing the results of a simple union of the triples from both components. A raw combination of the high-recall triples and high-precision triples yields a degradation of both F1 and AUC scores on both sets. However, when using the triple refiner, all metrics are improved, with AUC showing a significant increase on both sets. These results demonstrate that each individual component of our system provides a net-positive improvement in benchmark scores, and, when combined together, result in a substantial improvement overall.

Error analysis
In general, LILLIE encounters errors in three main areas: firstly, in qualified phrases, such as ''Research shows that...''; secondly, in sentences which are grammatically ambiguous to parse; thirdly, in triples requiring inference. We will discuss these error areas in more detail below.
(1) Qualified phrases: A sentence containing a qualified phrase is one in which the main clause is not necessarily implied to be factual. In the sentence ''This protein leads to tumor growth.'', the meaning is unambiguous. However, if the sentence were ''Some studies claim that this protein leads to tumor growth.'', the sub-clause is not necessarily implied as fact.
Consider the following sentence from PubMed as an example of a sentence with a qualified phrase: The prevailing view of CD73 is that it is overexpressed in tumors.
LILLIE extracts the following triple:

(CD73 ; is ; overexpressed in tumors)
This is not necessarily a definitively true statement, since it is qualified as an opinion, and outputting this triple may result in ambiguous information being added to a database. Our engine does not account for such qualifying phrases, and outputs these triples as facts. However, the outputs of OpenIE6 and IMoJIE will produce similar unqualified triples, and currently this remains an open problem in Information Extraction.
(2) Grammatically ambiguous sentences: Our approach encounters errors in cases where the sentence is grammatically complex, such as the following, taken from the CaRB development set: US 258 and NC 122 parallel the river north before the two routes diverge northeast of Tarboro.
Here, the dependency parser may fail to produce the necessary parse -in this case, due to the term ''parallel'' being misinterpreted as a noun, rather than a verb. However, in use cases such as the PubMed abstracts, where text is often unambiguous, this issue occurs rarely.
(3) Triples requiring inference: For cases where inference is required, our extractor has a lower recall than other engines. Because of our aim to build a high-precision engine for database integration, we make no attempt to infer triples that are not directly implied by the text, as this was found to add additional noise and degrade precision. For example, consider the following title of a paper on PubMed:

Association of Leptin, Visfatin, and Adiponectin With Renal Cell Carcinoma
There may be indirect triples, such as: (Leptin ; is associated with ; Renal Cell Carcinoma) Other systems attempt to extract such triples; however, when testing additional inference methods, we found them to degrade precision, and introduce many noisy superfluous triples. As such, we were unable to maintain our precision-recall balance in these cases. This is an area for further study, as a high-precision method of extracting such information would be valuable for entity linking.

Positive effect of triple enhancements
Using examples from the PubMed abstracts dataset, we now demonstrate the effects of our various triple enhancement procedures, in comparison to the triples output from the OpenIE6 and IMoJIE systems. To begin with, we show the output of both existing systems on our main example sentence: Long non-coding RNA CCAT2 plays an important role in tumorigenesis, tumor growth and metastasis.
The IMoJIE system, while achieving high precision on the test sets, does not output split triples of the form demonstrated in Section 7. Instead, it groups all conjunctive entities into a single triple: (Long non-coding RNA ; plays ; an important role in tumorigenesis , tumor growth and metastasis) This form of triple reduces precision during the entity linking process -particularly in cases where the input text is sufficiently well-formed to reliably split triples. On the other hand, the OpenIE6 system produces a similar output to our method: (Long non-coding RNA ; plays ; an important role in metastasis) However, our split_triples flag allows this behavior to be enabled or disabled depending on the nature of the input data, giving a balance between extraction precision and entity linking precision. Additionally, the pre-processing steps described in Section 4 allow for verbal conjunctions, such as in:   Our system uses the subject duplication procedure shown in Fig. 3 and Algorithm 1 to obtain two triples, one for each object entity:

Database enrichment and querying
With these high-precision triples, we can more accurately link textual mentions of named entities to their corresponding entries in a knowledge base. This allows us to perform new queries that were not supported by the original database. We showcase this capability by leveraging the raw information contained in PubMed abstracts to enrich the OncoMX database from Section 8. Using the OncoMX database, we can execute structured relational queries on unstructured, textual information stored in medical articles. Such an example is provided in Fig. 7, which corresponds to the equivalent natural language question: ''What are the genes over-expressed in breast cancer that are reported in the literature?' ' We are able to answer this query by exploiting the triples extracted from the PubMed medical articles (unstructured data) and their mapping to genes and anatomical entities (structured data). The goal of our query is to find all literature cases that include ''over-expression'' of a gene, specifically on breast cancer. The result is a subset of 13 genes that are reported as being over-expressed in breast cancer cases.
In a similar manner, we can extend our search to find all anatomical entities where genes are over-expressed due to cancer according to the literature. The results of this query are shown in Fig. 8. The number of returned rows is limited to 20 from the original 70 for visualization purposes.
Finally, we can focus our search on finding all literature cases derived from the triple extraction of Pubmed articles which include the keywords ''cancer'' and ''biomarker'' in the extracted triples. The results shown in Fig. 9 contain the captured genes and anatomical entities, along with their corresponding subjectpredicate-object relational triples.
In order to quantify the effect of our tight integration between triple extraction and entity linking, we also perform a comparison between each system on this same dataset. For this experiment, we run OpenIE6, IMoJIE and two versions of LILLIE (with split_triples enabled and disabled, respectively) on the set of PubMed abstracts, and perform the entity linking procedure described in Section 8. For OpenIE6 and IMoJIE, we use an n-gram based search over the Uberon and OncoMX databases, as they do not provide the annotated triples of our system. This procedure simply searches the databases with all possible n-grams from the subject and object, until the longest match is found, rather than deriving the n-grams from the syntactic structure of the triple.
The dataset consists of 38,703 abstracts, comprising 116,049 sentences. We extract triples from each sentence, and attempt to link each triple with the structured database. Linked triples are ones that contain both an Uberon anatomical entity and an Fig. 9. Example of a SQL query on the enriched OncoMX database. It allows us to search for cancer biomarkers based on specific keywords found in the literature, showcasing LILLIE's information extraction capabilities.

Table 7
The number of extracted triples on the PubMed abstracts dataset for each system, and the number of these triples that link with both an Uberon anatomical entity and an OncoMX gene symbol. LILLIE shows a higher ratio of relevant triples (linked triples divided by extracted triples) than the state-of-the-art systems. OncoMX gene symbol, with one in the subject and the other in the object, irrespectively. Partial matches are not recorded. The results of this are shown in Table 7 for OpenIE6 and LILLIE. However, we were unable to run the full dataset with IMoJIE, which encountered memory issues on the large amount of data, so we show the results for a sample of 1000 abstracts (3035 sentences) in Table 8. We also show the average speed of each system, tested on an Intel Core i7-7700HQ 2.80 GHz CPU, with 32 GB RAM and NVIDIA GeForce GTX 1050 GPU. The results show that, when using our entity linking procedure, LILLIE achieves a higher ratio of relevant triples in comparison to OpenIE6. In particular, the higher-precision version, with split_triples disabled, achieves a comparable amount of linked triples, on less than half the extractions overall. This shows that our system can additionally be adapted to provide a higher-precision variant, if required.
We report a slower runtime for our system compared to OpenIE6, and a faster runtime that IMoJIE. However, as shown in Tables 7 and 5, we extract more accurate (higher F1 and AUC scores) and more usable (higher ratio of linked triples) triples overall. Table 8 The number of extracted triples on a sample of 1000 PubMed abstracts, and the number of these triples that link with both an Uberon anatomical entity and an OncoMX gene symbol. In the future we will explore more advanced entity matching techniques based on transformer neural network architectures such as the ones presented in [33,58]. However, for our end-toend data processing pipeline, the current entity linking approach showed already promising results.

Conclusions
In this paper, we presented LILLIE -an end-to-end system for the enrichment of relational databases with extracted information from unstructured text. We developed a (1) precisionoriented, linguistics-based, triple extraction approach using domain-independent generic rules. We combined this approach with a (2) recall-oriented, learning-based, triple extraction approach to counter the loss of structural and semantic information. LILLIE not only allows effectively combining these two approaches but also enables enhancements of the extracted information via parameterizable postprocessing. Hence, our system is able to adapt to a diverse set of textual domains. Finally, we also leverage entity linking methods to integrate textual entities from our extracted triples into a relational database, thus increasing the contextual value of the extracted entities.
We compared LILLIE's performance with the two popular state of the art OIE systems, IMoJIE and OpenIE6, on the two widelyused benchmark datasets CaRB and Re-OIE16. LILLIE shows a substantial performance gain over the existing systems in terms of AUC, Precision, Recall and F1-score. Moreover, we demonstrated the effects of our triple enhancement processes on a corpus comprising biomedical documents (PubMed abstracts) to highlight the generalizability of our approach.
Future work could investigate the integration of additional learning-based OIE extractors, employing transfer or few-shot learning techniques to enhance and extend information extraction for domain-specific or even multi-lingual corpora for which no training data is available. Another interesting line of work could address entity disambiguation using popular knowledge bases (e.g. DBpedia) or word embedding techniques to resolve complex one-to-many property mappings, further increasing the effectiveness of our system.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.