Introduction

The storage, management, exchange and description of ‘omics based investigations, such as metabolomics, present challenges to biologists and bioinformaticians (Brooksbank and Quackenbush 2006; Fiehn et al. 2006; Sansone et al. 2006; Shulaev 2006). The Metabolomics Standards Initiative (MSI, http://msi-workgroups.sf.net) Working Groups have recognized that the establishment of reporting standards, such as minimal information requirements and exchange formats with defined semantics are necessary to enable efficient data sharing and meaningful data mining (Castle et al. 2006). Often these requirements are captured as free text, which is subject to ambiguities, redundancy, and typographical errors and as such reduces the power of computational approaches to retrieve the information and unambiguously interpret the experimental procedures.

Adding an interpretive annotation layer to the textual information is commonly done with representational artefacts (RA) such as structured controlled vocabularies (CVs) and/or ontologies (Cimino and Zhu 2006; Schulze-Kremer 1998), consisting of related representational units (RU). A CV is a set of terms (or RU), defined by an authority or through community agreement, in most cases formalized as is-a hierarchy of terms (taxonomy, although within the bio community this term is often used in the more restricted sense of a biological species taxonomy). Each RU is described by means of attributes such as identifier, name, definition and definition source (Smith et al. 2006). A CV is a simple and intuitive way of inserting an interpretive layer of semantics amongst terms used by different experimentalists to describe (annotate) an experimental parameter, in an unambiguous way, for example a type of sample treatment or instrument. Compared to ontologies, CVs are rather informal and lightweight representation artefacts. An ontology is a more explicit and formal representation of the knowledge in a domain, lying at the top end of the semantic complexity scale. Ontologies are semantically rich representations, containing CVs terms as classes as well as their properties, and logical statements for characterising those classes and the ways in which they can or cannot be related to each other.

By way of defined semantics, ontologies provide regimentations of a given terminology, while the defined syntax increases the interoperability between systems exchanging information. Ontologies facilitate the development of systems for data annotation and natural language processing and thereby ontology-based knowledge representations can extend the power of computational approaches to information retrieval, interpretation of experimental procedures, data exploration and knowledge discovery (Blake and Bult 2006). This potential has encouraged several scientific communities, including those operating in the metabolomics domain, to develop ontologies to be used for data annotation (Bodenreider and Stevens 2006; Field and Sansone 2006; Lan et al. 2003; Rubin et al. 2006; Schulze-Kremer 2002; Shulaev 2006; Stevens et al. 2006).

This article describes the working strategy, the developmental phases, the current activities and the challenges of the MSI Ontology Working Group (MSI OWG, http://msi-ontology.sf.net) in its effort to reach a broad consensus in the community on the formal semantic representation that is required to describe metabolomics investigations unambiguously.

The MSI OWG working strategy

The MSI OWG brings together members from diverse backgrounds and perspectives, including metabolomics practitioners, chemometrician, computer scientists, bioinformaticians (data managers, systems developers and data analysts) and ontology engineers.

Scope

The scope of the MSI OWG is to support the activities of the (1) Biological Context Metadata sub-WGs as well as the (2) Chemical Analysis, (3) Data Processing and (4) Exchange Format WGs (Sumner et al. 2007; Morrison et al. 2007; Griffin et al. 2007; van der Werf et al. 2007; Fiehn et al. 2007). The minimal reporting requirements identified by the first three WGs will inform the development of data exchange standards (Hardy and Taylor 2007) in order to provide a common mode of transport for the information between systems. Our work will ultimately provide a formal semantic interpretation for the format, by developing a common semantic framework to enable the metabolomics user communities to consistently annotate the experimental process and ensure meaningful exchange of their datasets. The MSI OWG has been conceived as a ‘single point of focus’ for communities where independent activities—to develop terminologies and databases for metabolomics investigations—are underway. Interoperability among these systems is the key driving force behind this endeavour.

Operating plan

The MSI OWG plans to (1) reach a consensus on a core set of CVs and (2) develop a corresponding ontology. Specifically the CVs and ontology aim to

  • Assist with the representation of study designs, protocols and instrumentation used, data generated and the types of analyses performed on them;

  • Provide a consensual set of terms for the consistent semantic description of data across disparate metabolomics resources (software and databases, both private and public).

The development of the CVs and an ontology for metabolomics is a long iterative process relying on the following stakeholders to provide input:

  • MSI OWG members as developers of the CVs and ontology;

  • Ontology experts/knowledge engineers to provide advice about the engineering of the ontology and practical use cases for an ontology-driven application;

  • Last but not least metabolomics practitioners/domain experts to provide use cases for the ontology, validate the CVs and ontology produced and advise on additional terms to be included.

Operating principles

The MSI OWG operates under the assumption that no one group or community alone can bridge the ‘semantic gap’, and that a synergistic effort is the only way forward. We work cooperatively and maintain a public website with the names of participating members to remain approachable, inclusive and transparent while the size of the group and the complexity of the tasks increase. We communicate via two mailing lists. The first list is open to the public, or those only interested to be kept informed with the progress, while the other list is ‘closed’ and available only to those willing to (1) share the terminology they currently use and (2) invest time and expertise in such collaborative endeavour. Our documents are publicly available via the MSI OWG webpage and drafts of the ontology are posted under the Open Biomedical Ontology (OBO, http://obo.sf.net; Rubin et al. 2006) umbrella. Readers, potential users and developers wishing to send feedback to this and other MSI WGs, can also use the following email address: msi-workgroups-feedback[at]lists.sourceforge.net.

Fortunately, there is a generally accepted view that concerted efforts are required across the functional genomics and system biology communities to work towards harmonised and interoperable reporting standards. At the very outset, we strived to reduce the duplication of efforts across ‘omics domains, where commonality exists, through extensive liaisons with other standards initiatives (described in the next sections) and other ontology communities under the OBO Foundry (Smith et al. under review; http://obofoundry.org). Common standards will benefit the entire scientific community by simplifying the task of data integration and facilitate the work of software developers, vendors and equipment manufacturers by reducing the time involved in and costs of implementing standards-compliant products (Quackenbush 2004).

Developmental phases

Phase 1—Use cases and CV

As described in the section above, ultimately our work will provide a semantic framework for the exchange format, to be agreed upon by the Data Exchange WG, that describes the minimal reporting requirements relevant for the interpretation of metabolomics investigations. Since both the definition of minimal reporting requirements and the development of a data exchange format represent work in progress, our work should be considered explorative and at a very early stage.

Domain coverage and resources

To prioritise our work, we have divided the domain coverage into two main components. Figure 1 shows the components in the investigation workflow: general components (investigation design, origin of the sample and characteristics, sample treatments, sample collection and computational analysis) and the technology-dependant components (instrument-specific sample preparation, instrumental analysis and data pre-processing). Conforming to a generally accepted view that duplication and incompatibility should be avoided, the development of CVs (and a subsequent ontology) for the general investigation components are built as a collaborative effort with standardization initiatives in other ‘omics domains, such as the Human Proteome Organization Proteomics Standards Initiatives (HUPO-PSI, http://psidev.sf.net) (Hermjakob 2006; Taylor et al. 2006) and the Microarray Gene Expression Data Society (MGED, http://www.mged.org) as part of the Ontology for Biomedical Investigations (OBI), further described below. OBI promises to be particularly useful for describing the biological sample and ultimately fulfils the ontological requirements of the MSI Biological Context Metadata sub-WGs. The CV for the technology-dependant components will be our primary focus, starting with the Nuclear Magnetic Resonance (NMR) spectroscopy sub-component. For the Mass Spectrometry (MS) sub-component the OWG will leverage on previous work by the PSI MS Ontology WG. The CVs for the Chromatography sub-component, shared by both proteomics and metabolomics domains, will be developed in close collaboration with the PSI Sample Processing Ontology WG. Every effort will be made to cover as many components as possible and similarly to evaluate and leverage on existing public sources of terms (Allen et al. 1995; Bodenreider 2004; de Matos et al. 2006; Jenkins et al. 2004; Kanehisa et al. 2006; Lindon et al. 2005; Soldatova and King 2006; Spasic et al. 2006; Vranken et al. 2005; Wishart et al. 2007).

Fig. 1
figure 1

The figure shows the main components in a metabolomics investigation workflow. Technology-dependant components of an investigation are shown in the box with vertical lines. Components common to other omics investigations are shown in boxes with horizontal lines

Naming conventions and metadata recommendations

At present, neither unified naming conventions, nor common metadata elements have been agreed upon by the ontology-oriented communities for naming and annotating RUs within RAs as well as the RA as a whole (Rickard et al. 2004; Supekar and Musen 2005). Naming conventions prescribe how CV terms and ontological classes should be named and formulated in a consistent manner to unify term appearance, reducing redundancy and increasing precision. These conventions would also provide guidance the ontology engineer on how to handle content related issues, for example Definition and Synonym (semantic naming conventions) and how to tackle lexical issues, such as term/class name length, allowed character set and format, word separators and word tense (syntactic naming conventions). Metadata elements belong to different categories. For example descriptive metadata helps to add useful information on RUs, e.g., definitions or provides examples, while administrative metadata provides information such as when and how a RU or RA was created (release date, version, authority etc.). In the absence of naming conventions and metadata elements applicable to our case, we have started working on such common recommendations in collaboration with the PSI Ontology WGs (Schober et al. 2007). The use of such common conventions would be pivotal in the development and maintenance of the ontology resource by the large participating communities. First drafts of the naming conventions and metadata ontologies are available from our webpage (http://msi-ontology.sourceforge.net/recommendations).

CVs master list

CV master lists for each sub-component will be created iteratively, requiring continuous interactions among the ontology engineers, the domain experts and the other MSI WGs, especially while both the minimal reporting requirements and the format are work in progress. We work according to the following steps:

  1. 1.

    Start from an initial list of terms for a sub-component from a certain resource (database models, glossaries etc.); add definitions for each term and make these compliant to the naming conventions. Keep track of the relationships between the terms (is_a, part_of etc) if provided for the ontology development phase;

  2. 2.

    Structure the terms in an is_a hierarchy (taxonomy) for sorting and redundancy removal;

  3. 3.

    Discuss the CVs within the OWG, and then circulate to the practitioners in the relevant metabolomics area. This will ensure that the lists are as complete as possible, that we obtain valid definitions and will aid ontology construction later on;

  4. 4.

    Explore the use of text mining over a relevant collection of metabolomics papers to identify frequently used terms and enrich the term list;

  5. 5.

    Once general agreement has been reached on the initial CVs, further resources will be processed in turn by deciding which of their terms should be incorporated into the initial CV. For each of these terms synonyms, definitions and relationships will be identified as before;

  6. 6.

    When all resources for a given sub-component have been exhausted, it will be determined which domains remain to be covered. At this stage, we will need to actively collaborate with both the metabolomics practitioners and the other MSI WGs, particularly with the Data Exchange WG, to ensure the quality and completeness of the proposed CV.

Iterative building of such informal ontology models helps to expand our list of terms, relations, their definition or meaning, and additional information such as examples to clarify meaning where appropriate.

Phase 2—Ontology

The OWG’s ultimate goal is to combine the CVs master lists and add further formal structure to create a formal ontology. To achieve this goal, the OWG engages with leading experts in the field of ontology and other ontology communities under the OBO and the OBO Foundry umbrellas. The OBO Foundry is a recent initiative that aims to establish a framework for semantic interoperability in the field of life science. To ensure consistent evolution of the ontologies, the OBO Foundry leaders have issued a set of development recommendations, which will be enhanced in the course of time as new aspects of ontology best practice become established. These recommendations will include the use of (1) an upper formal ontology, OBO Upper Biomedical Ontology (UBO) currently being developed and based on the Basic Formal Ontology (BFO, http://www.ifomis.uni-saarland.de/bfo) to define the top-level class framework under which knowledge representation will be carried out and (2) Relation Ontology (Smith et al. 2005) providing well characterized relations to describe how entities relate to each other (e.g., foundational relations is_a or part_of, but also temporal and spatial relations such as develops_from and located_in). The OBO Foundry also addresses housekeeping needs for ontology maintenance and editing, recommending, among other things Ontology Web Language (OWL, http://www.w3.org/2003/08/owlfaq) and OBO as the format for distribution.

The OWG directly participates in OBI (previously titled FuGO, http://obi.sourceforge.net, Whetzel et al. 2006), an international collaborative project, initiated in 2005, which aims to build a cross-domain ontology as a resource for the annotation of biological, medical and environmental investigations. OBI is an OBO Foundry project set to provide terms that can be used to annotate investigations and the protocols, instrumentation and materials used in those investigations, along with the data generated and analyses performed. OBI brings together HUPO-PSI, MGED Society and other communities, and where the MSI OWG represents the metabolomics domain in this collaborative effort. According to the OBI working strategy, the general experimental components are built collaboratively, while each participating community proposes an informal ontology model relevant to their specific domains. The MSI OWG will provide technology-dependant components by using the relevant OBI “leaf nodes” (e.g., Instrument) as top-level classes. These are then harmonized and positioned within the common BFO top level ontology to ensure reuse and integration with other existing bio-ontologies as described in Rosse et al. 2005. The OBI project is driven by a coordinating committee, bringing together representatives of the participating communities, while guidance on design and engineering is provided by an Advisory Board, including recognised ontology experts. OBI is being developed in OWL using the Protégé ontology editor (Noy et al. 2003). Use cases and terms from each community, minutes of the teleconferences, reports from face-to-face workshops and presentations are available from the project website. An initial version of the top-level structure of OBI, using the BFO is available at the OBO website (http://obo.sourceforge.net/cgi-bin/detail.cgi?obi). A first draft of OBI will be considered ‘completed’ when the general (common) experimental component and all the technology-dependant components have been developed and harmonized (redundancy removed).

Current activities

The MSI OWG posts and maintains the ontological components built under the OBO umbrella, in anticipation of OBI being completed. In these first months of our activity, we have focused on NMR experiments in the context of metabolomics investigations. The NMR.owl (available at the OBO site: http://obo.sourceforge.net/cgi-bin/detail.cgi?nmr) is a pure taxonomy of 247 classes in OWL format annotated with metadata through annotation properties. This ontological component encompasses terms of different types: (1) methods, (2) instruments, (3) parameters that can be measured, and other terms. Once collected, the initial terms have been normalised according to the proposed naming conventions (synonyms, acronyms and abbreviations added where known) and taxonomized using the Protégé editor (Fig. 2). Subsequently these have been placed (binned) under the relevant OBI and BFO classes. To populate the initial list of terms and then to refine the ontology, we are currently exploring a text-mining approach over a relevant domain specific collection of MEDLINE abstracts (http://www.ncbi.nlm.nih.gov/entrez/) and full papers (especially the Material and Methods sections) where available from PubMed Central (http://www.pubmedcentral.gov/) (Spasic et al. 2007).

Fig. 2
figure 2

A fragment of the NMR ontology in Protégé, the class/subclass relationship (taxonomy) on the left and the description of the term/class on the right hand side (including definition and administrative metadata). Top-level classes, imported from OBI or BFO, are displayed in the hierarchy with a lighter shade

The initial source of terms for the CV is Rubtsov et al. (2007). As stated before, the minimal reporting requirements and the format are both work in progress conducted by other MSI WGs, therefore, the ontology for the NMR sub-component should not be considered complete at this stage. The NMR.owl has also served as a test bed to evaluate the BFO top-level ontology as well as technical issues such as the OWL-import, cross ontology referencing mechanism, modularisation, constraint inheritance and the usage of RA and RU metadata annotation properties in Protégé. Overall, we can say that this experience has been an excellent use case to practice our working strategy and collaboration with the larger OBI group.

Concluding remarks

Every effort will be made to meet the group goals in a timely fashion, although the MSI OWG members are geographically distributed and central funds do not exits for the MSI WGs activities. One of the major bottlenecks in building bio-ontologies is the lack of a unified methodology and tools for collaborative development, making large collaborative endeavours more challenging (Castro et al. 2006). The MSI OWG and the OBI WG pose scenarios in which domain experts are geographically widespread and the structure of the ontology is constantly evolving. Consequently face-to-face workshops have proved to be the most efficient way to significantly advance the project. In addition, the sociological barriers involved in these kinds of large-scale collaborations can be far more challenging and extensive liaison is necessary between communities. Managing this process of consensus building from start to finish requires ample time, resources and expertise. The time invested to identify commonalities and synergies with other projects, such as OBI, is often limited due to a lack of resources. The massively collaborative nature of the ontology undertaking requires frequent face-to-face workshops to create the optimal conditions for building of consensus. Teleconferences and web meetings are also used, but these are generally short and are not an ideal mechanism for efficient collaborative development; rarely are they as effective as direct interactions established at face-to-face workshops. Unfortunately it is very difficult to hold such workshops without central funds; such funds having previously been difficult to obtain in competition with more traditional scientific projects. In the special issue of the journal OMICS (Field and Sansone 2006) twenty invited manuscripts describe different standardisation initiatives focusing on both the successes and pitfalls encountered, and lessons learned. This issue also includes a special call for action for further recognition of the importance of global omics standardisation activities (Brooksbank and Quackenbush 2006), where the authors eloquently describe the Herculean efforts that are often accomplished ‘on the side’ and without formal funding, simply because the lack of standardisation is an unacceptable state of affairs for omics researchers and is repeatedly proving to be a significant bottleneck in the collection, querying, processing, and sharing of data.