Biomedical knowledge engineering approaches driven by processing the primary experimental literature

Neuroinformatics databases derived from the literature tend to be much smaller than their bioinformatics counterparts. One neuroinformatics system, CoCoMac, describes roughly 4x104 neuroanatomical connections from 413 papers; by way of constrast, the Mouse Genome Informatics (‘MGI’) system, contains nucleotide sequences from ~105 papers. The resources needed to support a large-scale database are extraordinary (MGI supports a team of 30+ professsional biocurators), and even then, it is still highly challenging to maintain a complete, up-to-date account of large-scale portions of the literature. Curating knowledge from Neuroscientific papers is is made even more difficult because: (A) the information is largely unstructured (occuring either as natural language text or tables) and (B) the information is semantically complex: typically more so than genomic studies where individual genes are linked to ontologies of phenotype or function. Thus, we need to (A) provide tools to accelerate the process of biocuration and (B) we need to define a general-purpose Knowledge Representation (KR) that can capture the semantics of neuroscientific knowledge in a mathematically tractable format that is also intuitively understandable to bench-scientists. Here we present an approach for the construction of knowledge bases from the biomedical literature based on a relatively-simple, generally-applicable KR for scientific observations called ‘Knowledge Engineering from Experimental Design’ (’KE-f-ED’). This approach is based on the experimental variables being studied and provides a way to represent data, significance relations and correlations in a generalized informatics framework. We describe the basic theory behind the model, and demonstrate a sample implementation in the domain of neuroendocrinology. We also describe how this approach provides a framework suitable for text mining that leverages information extraction (IE) technology. In collaboration with Elsevier Science, we downloaded 39,643 full-text articles as XML documents (and 117,602 as PDFs) from multiple neuroanatomically-focused journals. We used a Conditional Random Fields (CRF) model to extract individual mentions of variables from text based on a generic model of a specific experiment type (tract-tracing experiments) and are extending this approach to construct text-mining systems for other experimental types. Challenge 1: scaling up biocuration with text mining


Introduction
Neuroinformatics databases derived from the literature tend to be much smaller than their bioinformatics counterparts.One neuroinformatics system, CoCoMac, describes roughly 4x10 4 neuroanatomical connections from 413 papers; by way of constrast, the Mouse Genome Informatics ('MGI') system, contains nucleotide sequences from ~10 5 papers.The resources needed to support a largescale database are extraordinary (MGI supports a team of 30+ professsional biocurators), and even then, it is still highly challenging to maintain a complete, up-to-date account of large-scale portions of the literature.Curating knowledge from Neuroscientific papers is is made even more difficult because: (A) the information is largely unstructured (occuring either as natural language text or tables) and (B) the information is semantically complex: typically more so than genomic studies where individual genes are linked to ontologies of phenotype or function.Thus, we need to (A) provide tools to accelerate the process of biocuration and (B) we need to define a general-purpose Knowledge Representation (KR) that can capture the semantics of neuroscientific knowledge in a mathematically tractable format that is also intuitively understandable to bench-scientists.
Here we present an approach for the construction of knowledge bases from the biomedical literature based on a relatively-simple, generally-applicable KR for scientific observations called 'Knowledge Engineering from Experimental Design' ('KE-f-ED').This approach is based on the experimental variables being studied and provides a way to represent data, significance relations and correlations in a generalized informatics framework.We describe the basic theory behind the model, and demonstrate a sample implementation in the domain of neuroendocrinology.We also describe how this approach provides a framework suitable for text mining that leverages information extraction (IE) technology.In collaboration with Elsevier Science, we downloaded 39,643 full-text articles as XML documents (and 117,602 as PDFs) from multiple neuroanatomically-focused journals.We used a Conditional Random Fields (CRF) model to extract individual mentions of variables from text based on a generic model of a specific experiment type (tract-tracing experiments) and are extending this approach to construct text-mining systems for other experimental types.

Knowledge Engineering from Experimental Design
By focusing only on variables involved in experiments, we simplify our KR task enormously.Firstly, we partition the literature into domains based on different 'types' of experiment.Crucially, these types are defined by considering the variables that pertain only to the primary observations (rather than attempting to model how an experiment is interpreted or to capture every tiny detail of the protocol).A suitable rule-of-thumb is to include only those details needed to interpret results correctly (see Fig. 2).Following Fig. 1B, the linkage between dependent and independent variables is key to this representation.This linkage is directly provided by consideration of the scientific protocol.Given that computer programs and a scientific protocols are both sets of procedural instructions, we use the Unified Modeling Language (UML, a software engineering industry standard) to provide a modeling framework.A schema for tract-tracing experiments is shown in Fig. 3.Each dependent variable is indexed by all independent variables that precede them in the flowchart.Thus, this representation of a tract-tracing experiment has the following structure of variables: Given that PHAL is an anterograde tracer, this information is enough to infer that a neuroanatomical projection exists from CA3 to LS in the rat.Note that careful ontology engineering must be used to correctly define all variables being defined and used within the system.We anticipate automating this sort of interpretive reasoning in the future.

Depth of Representation
We present preliminary data concerning our efforts to implement a text-mining application based on the assumption that an individual experiment is essentially a collection of independent and dependent variables.We use Natural Language Processing to identify named entities corresponding to these variables and their values in the results sections of the full-text articles.Preliminary show performance is quite high for automatically reproducing annotations of text based on our tract-tracing model (Fig. 4  "Finally, recent intriguing data from Dallman's lab have raised the possibility that at least part of these inhibitory effects are mediated by altering energy metabolism [Laugero, et al. (2001) Endocrinology, 142( 7 The goal of the KE-f-ED model is to be able to represent the primary observations from any study.In order to test this idea, we examined a short passage of text from a typical scientific narrative (a colleagues' grant proposal of 6 statements citing 11 studies) and represented the experimental evidence supporting each statement as a separate KE-f-ED model (Fig. 5).
- Acknowledgements: This work was funded by grant R01 GM-083871 from NIGMS and support from the Center for Health Informatics (CHI) at ISI.We are grateful to Alan Watts and Arshad Khan for their support in developing the basic ideas of the KE-f-ED model, as well as many of our colleagues at ISI and USC for their ongoing contributions and ideas.

Figure 5 :
Figure 5: Conceptual demonstration of the KE-f-ED model.(A) A typical 'citance' or 'citation sentence' (attrib.Marti Hearst, UC Berkeley) providing a statement based on a cited experimental paper.(B) The design of the relevant experiment from the cited paper showing a computationally tractable representation of a complex physiological experiment.(C) This computational representation can capture and represent three essential elements of scientific knowledge: data, statistically-significant relations between data points and correlations between variables.The inset graphs are taken fromFig.4 of Laugero, et al. (2001), showing that, after adrenalectomy, expression of CRF mRNA in the Paraventricular Hypothalamic Nucleus is almost perfectly correlated to sucrose intake in day 5 of the experiment.

Figure 6 :
Figure 6: Left: Flowchart of initial bare-bones, proof-of-concept build using UML model files and Excel spreadsheets; Right: Summary view of the populated system in a browser.