Keywords

1 Introduction

The development of biomedical applications contributes to the significant increase in the quantity of generated data. This is why there is an urgent need for intelligent tools that support the integration, harmonization and retrieval of this data. In this context, the exploitation of Knowledge Organization Systems (KOS) is promising [1]. According to Hodge [2], KOS are intended to encompass all types of schemes: classifications and categorizations, taxonomies, thesauri, as well as semantic networks and ontologies. These schemes, which are defined using different knowledge representation models, are widely used in the biomedical field for various purposes. This is for instance the case for the ICD-9-CM classification that is used in billing systems for reporting diagnosis; the MedDRA terminology used to encode drug reports; the NCI Thesaurus (NCIT) implemented in the cancer research nomenclature; and SNOMED CT (SCT) which helps in organizing the content of Electronic Health Records.

This heterogeneity in the available resources for ensuring semantic interoperability in systems is one of the specificities of the biomedical domain. This is mainly due to the size of the domain. Since existing KOS are dedicated to a specific subfield, as it is the case for the Foundational Model of Anatomy [3] representing the phenotypic structure of the human body or Gene Ontology [4] for genetics, it forces users to implement several KOS in their system to optimize the coverage of the domain of interest. In addition, frontiers between the evoked subfields are quite fuzzy therefore some elements of the KOS overlap. It means that necessary characteristics or information to define the semantics of an object in the real world are duplicated in more than one KOS. For example, the code C80099 denoting the “Menopause, Premature” concept in the NCIT could be used to annotate a report dealing with this medical problem, however the same concept has a different code in ICD-9-CM (e.g., 256.31).

This basic example underlines the necessity to define mappings, which express semantic correspondences between entities belonging to different KOS [5]. Mappings are important because users looking for reports annotated with C80099 may also be interested in documents tagged with the corresponding code in ICD-9-CM, since these are supposed to contain relevant information as well. In order to return such information, the system must be able to understand that both concepts, coming from different sources, are equivalent. To cope with this aspect of element overlapping in the biomedical domain, institutions like the U.S. National Library of Medicine Footnote 1 (NLM) define mappings between KOS, which are periodically released and distributed through applications like Unified Medical Language System (UMLS) [6] and BioPortal [7].

Another specificity of the biomedical domain concerns the rapid evolution of biomedical knowledge as pointed out by Baneyx & Charlet [8]. This implies that KOS must be updated in order to respect, as faithfully as possible, the evolution of the domain. As a consequence, new versions of the same KOS are frequently released, and the already established mappings between KOS have to be maintained each time a new version of a KOS is released. Even if the Semantic Web community has intensively studied ontology evolution [9], many aspects deserve a closer attention, in particular in the biomedical domain. An example might be the impact of changes in the established mappings.

In this context, the purpose of this paper is twofold. First of all, it aims at providing a quantitative and qualitative analysis of the mapping maintenance problem in the biomedical domain, with an emphasis on the lack of existing works in the field. This study has been realized on both the main KOS of the domain (e.g., NCIT, ICD-9-CM, SCT and MedDRA) and the mappings provided by the BioPortal. Second, based on this analysis, we sketch the outlines of the DyKOSMap approach. This approach aims at maintaining (adapting) mappings that exist between biomedical KOS in a more automatic way, by taking into account the semantic relationship type of the existing mappings combined with the evolution of these KOS.

The remainder of the paper is organized as follows: Sect. 2 presents both the results of our study regarding the mapping maintenance problem as well as the related works of the field. Section 3 introduces the DyKOSMap approach proposed in order to cope with the lack of existing approaches and the specificities of the problem previously introduced. We will also introduce the main characteristics of our approach through concrete examples borrowed from the biomedical field. We discuss the main strengths and weaknesses of the proposed approach in Sect. 4. Finally, Sect. 5 wraps up with concluding remarks and outlines future work.

2 The Mapping Maintenance Problem

Even if mapping maintenance has been recognized as an important problem to be tackled in the biomedical domain, little work has been proposed to solve it. As a consequence, it is necessary to have a clear understanding of the problem characteristics as well as of the elements that must be taken into account in the design of a solution. The analysis of the problem provided in this section is based on the observations we have made on the evolution of various KOS that are MedDRA, SCT, ICD-9-CM and the NCIT, and on the impact of their evolution on the mappings established between them.

2.1 Quantitative Analysis of the Problem

By virtue of the size of existing biomedical KOS (millions of elements in SCT), their dynamic nature and the quantity of mapped elements (e.g., about 100.000 mappings between ICD-9-CM and SCT), the solutions provided to maintain mappings have to be as automatic as possible in order to support knowledge engineers in charge of this laborious and error prone task. To support this argument, we made a quantitative analysis of the KOS evolution and its impact on existing mappings. Our study consisted in observing all the changes that occurred in NCIT from March 2010 (version 10.01) to October 2011 (version 11.09) and identifying the amount of mappings between NCIT and ICD-9, and between NCIT and MedDRA (v. 12.0) affected by these changes. Actually, 1.162 different concepts were mapped from NCIT (v. 10.01) to ICD-9, and 6.195 different concepts were mapped from NCIT (v. 10.01) to MedDRA, according to information provided by the BioPortal Footnote 2 application. In October 2011 (after 19 months) it appears that 583 NCIT elements were modified (merged, split or removed). This was, for instance, the case of the concept code C80099 (Menopause, Premature), which was originally mapped to the concept code 256.31 (Premature menopause) of the ICD-9, and to the concept labeled “Menopause” associated with the code 10027308 in MedDRA. In the new NCIT version, C80099 concept was merged with the concept C62595 (Premature_Ovarian_Failure). Consequently, due to modifications in NCIT, the previously established mappings are not valid anymore and therefore cannot be exploited by underlying biomedical systems unless they are updated. Our analysis shows that 0.8 % of the changes in NCIT had affected 0.4 % of the mappings established between NCIT and ICD-9. For the mappings between NCIT and MedDRA, 7.37 % of the changes in NCIT (between the considered versions) affected 0.69 % of the mappings. These observations clearly justify that the maintenance of mappings can hardly be done manually in an acceptable time due to the time spent on the identification of the impacted mapping, but it is completely irrelevant to re-compute the whole set of mappings each time a KOS evolves. It shows the real need for intelligent techniques and tools to cope with the maintenance of mappings affected by KOS evolution.

2.2 Qualitative Analysis of the Problem

We define mapping maintenance as:

“The modifications performed on the mappings established between

KOS in order to keep them valid when these KOS evolve.”

According to this definition, several dimensions have to be considered. We will highlight various facets of these dimensions through a set of experiments which is part of our study. The definition implies that mappings must already exist between KOS before they evolve in order to be maintained. Moreover, additional sets of experiments must show that not only modifications occurring in KOS impact the mappings, but also the types of modification (e.g., the addition or removal of elements) and, in turn, the KOS model (e.g., ontology, thesauri, classification, etc.).

Regarding this context, some attempts have been proposed in the literature. The majority of them have an emphasis on database schemas that are therefore strongly linked to the Entity-Relationship model. This is why they are inadequately coping with heterogeneous KOS models specificities. Fagin et al. [10] show how two fundamental operators on schema mappings, namely composition and inversion, can be used to address the mapping adaptation problem in the context of schema evolution. Due to the nature of database schemas, the adaptation process does not take into account the type of changes and their consequences. The work done by An et al. [11] concerns the maintenance of mappings established between ontologies and XML schemas. A formal solution is provided to (semi-) automatically adapt the mappings in order to maintain their validity when XML schemas evolve. An et al. present a plan which consists of several strategies to adapt a new semantic mapping. The provided solution clearly depends on the specificities of XML schemas and their limited possibilities of change. Additional investigation is required to figure out whether the proposed technique is applied to other KOS models.

In the case of ontologies, Martins & Silva [12] have proposed an original method for maintaining ontology mappings impacted by the element removal operation. Based on this method and depending on the nature of the ontology change strategy (elementary or composite changes), several cases have to be taken into consideration. In this approach, user intervention is necessary when the evolution leads to inconsistent ontologies. The mapping maintenance strategies are based on elementary descriptions of changes in a log, which may hardly support the maintenance process. Moreover, the proposal is too dependent on the Semantic Bridge Ontology (SBO) model, which is not a common way to define mappings. Recently, Khattak et al. [13] proposed an approach providing the benefits of mapping reconciliation between updated ontologies. This approach basically uses the change history of ontology to reduce the time required for reconciling mappings among ontologies. It considers the modified ontological elements, and not the nature of their changes and re-computes the mappings using matching algorithms with only the changed elements as input. In their work on ontology evolution, Hartung et al. [14] are interested in mappings, but between two versions of the same ontology. This implies that the KOS models do not interfere with the maintenance process. They provide a formally described set of changes that can occur within ontology. Nevertheless, to identify changes that have occurred, not all information provided in ontology is exploited, and the way similarity between ontology elements is computed to determine the type of changes is not explained.

Despite the improvements brought by these proposals from tackling the mappings maintenance problem, they fail by not taking into account the type of changes that occurred in a KOS, or they only support a specific KOS model and therefore must be adapted to cope with specificities of the biomedical domain. To overcome these gaps, we introduce the DyKOSMap framework that supports the mapping maintenance problem. The design of the framework is in line with the definition of mapping maintenance since it takes into account the type of changes in a KOS, the underlying heterogeneous KOS model and the information provided by existing mappings.

3 The DyKOSMap Approach for Biomedical KOS Mapping Maintenance

According to our study, changes occurring in a KOS, independent of its model, can have a direct impact on the mappings that have been established between the modified KOS and another one. This idea is the foundation of the approach supported by the DyKOSMap framework. Its utmost objective is to support, in an (semi-) automatic way, the maintenance of existing mappings between heterogeneous KOS through a coherent integration of the types of changes that affect a KOS, its underlying knowledge representation model and rules governing the overall process. The results of the process consist of a set of up-to-date mappings and the description of the history of mapping changes that is used to retrieve information about old mappings.

3.1 Influence of the Types of Changes in the Evolving KOS

The types of changes affecting a KOS have to be determined from the difference between two versions of the same KOS. These types of change are usually complex ones corresponding to sets of basic (atomic) change combinations [15]. Atomic changes are additions or removals of basic KOS entities; e.g., the addition of a concept or an attribute, or the removal of a relation. A merge, a split or any other aggregation of these changes are examples of types of complex changes. For instance, the removal of the concept C62595 (Premature_Ovarian_Failure) from NCIT version 10.01 and the combination of its content with the concept C80099 (Menopause, Premature) from NCIT version 11.09 is a complex change corresponding to a merge operation.

The recognition from one version to another of types of complex changes operated in a KOS can be easier if these possible types are already known. We thus propose an approach based on the exploitation of generic descriptions of types of complex changes via change patterns (CPs) as it is done by Hartung et al. [14]. That way, in the previous example, the identification of the complex changes could be based on a merge pattern that applies at the moment a concept is deleted and most of its content appears in another (new) concept (i.e., there is a similarity between them). The patterns are generic, and when applied, they are instantiated with entities of the evolving KOS.

CPs need to be defined and represented according to KOS models in order to take into account the underlying knowledge representation model. In order to improve the CP design, an empirical observation of old releases of different types of biomedical KOS is necessary over a significant period of time. Until today, we have studied CPs with regard to ICD-9-CM. Table 1 illustrates some examples of CPs designed observing ICD-9-CM. They present possible variations of merge and split complex changes. Besides those, other variations may also be found.

Table 1. Examples of change patterns

In the examples of Table 1, C{x?}v1 denote concepts from a KOS version 1, while C{x?}v2 represent KOS elements, which were added, removed or modified in the same KOS version 2. “Merge A” describes a situation where CAv1 and CBv1 were removed from a KOS. CCv2 appears in the next version of the KOS. All these 3 elements are semantically similar. The identifier code of CCv2 is different from the code of CAv1 and CBv1. The semantic similarity value between these elements can be calculated using different methods. Actually, we measure the distance between strings used to describe concepts (e.g., label, attribute values) and we check whether the applied changes are relevant from the semantic point of view using MetaMAP Footnote 3 application. In the same light, “Split B” pattern describes a situation where part of the information attached to the CAv1 concept is used to create a new concept, CBv2, while CAv2 represents the modified CAv1. This situation frequently occurs when the attributes of a concept are modified. These different change patterns may lead to divergent ways for maintaining the mappings. The possible patterns or their variations for other biomedical KOS models are under study.

There are many different approaches for identifying changes in the context of ontology evolution. One of them specifically deals with this identification by computing the difference between different versions of a same ontology. However, in this approach few proposals explicitly attempt to recognize complex changes [1618], and KOS models specificities are not taken into account. The patterns that we described in Table 1 have been designed based on the observation of the evolution of the elements of ICD-9-CM. There are not many attempts in the literature to design CPs empirically, i.e., based on experiments observing data evolution along the time. Javed et al. [19] proposed a graph-based pattern discovery algorithm analyzing ontology change logs to attempt an automatic derivation of possible change patterns. Shaban-Nejad [20] utilizes category theory for representing changes as patterns in a framework used for analyzing biomedical ontological changes. All these approaches do not define patterns empirically, and do not consider that changes may have distinct meanings according to different KOS models.

3.2 Techniques to Maintain Mappings

According to the selected CPs and KOS models, a set of actions (add, modify or delete) to update or refine mappings have to be done. Similarly, we propose to define the actions to be executed in a generic way. These actions are pre-defined. They are components of heuristics stored in a catalog. When a heuristics applies, then it is instantiated. Variables are replaced by concepts or relationships in the KOS.

Example of a heuristic:

“For the CP “Split A” (Table 1), add the “part of” relationship between CBv2 (respectively CCv2) and all concepts once linked by an equivalence relationship to CAv1”.

In this example, we assume that “part of” is a possible type of semantic relationship which can be established between the involved biomedical KOS elements. But, this heuristic does not indicate all the actions having to be executed. If the model of the KOS that is affected by evolution can represent hyperonymy or hyponymy relations between its elements, and if CAv1 was linked to a more general concept, then CBv2 and CCv2 will also be linked to this concept after evolution. In that case, new mappings that are a kind of heritage of the old mappings related to CAv1, must also be added in relation to CBv2 and CCv2. Therefore, the actions to make mappings evolve depend on the type of changes in the KOS (taken into account through patterns in the “For” part of the heuristics above) and also on the KOS model.

The approach that we propose aims at keeping the mappings valid after the evolution of a KOS, and also at keeping track of their history. This is done through a coherent integration of the instances of CPs, the corresponding set of actions suggested by the heuristics and the last (current) version of the mappings, and their associated history. More precisely, the system has to identify the mappings that will be reviewed first, i.e., all mappings between the addressed KOS and the other KOS. Then, based on both the information from the instantiated CPs and the actions selected according to each CPs instance, the maintenance is performed on the impacted mappings only. In parallel, the system records the modifications of the mappings updating, thus of the mappings history. We obtain consistent and up-to-date mappings together with the history of mapping changes and KOS. How to manage the consistency of the mappings when they are updated is still an open research problem that will be tackled in future investigations.

To illustrate how CPs and actions from heuristics are combined to maintain mappings in practice, let us reconsider the example regarding “Premature Menopause”. In this example, C80099 (Menopause, Premature) concept code from NCIT is mapped to the concept code 256.31 (Premature menopause) in the ICD-9, and to the 10027308 (Menopause) in MedDRA. In the new version of NCIT, the concept C80099 is merged with the concept C62595 (Premature_Ovarian_Failure) and C62595 was removed from the KOS. The various CPs concerning NCIT is under study and may diverge from those studied for ICD-9-CM (Table 1). The mappings between the newly merged concept of NCIT and concepts from MedDRA and ICD-9 must be maintained but their relationships may change.

In this example, let us assume that the CP which applies considers that the remaining concept includes the removed concept. This CP is close to the “Merge B” pattern presented in Table 1. The heuristics associated to this NCIT CP must propose the following actions to maintain the mappings between NCIT and ICD-9:

  1. (1)

    Copy the existing relationships between the concept C62595 and the concept of ICD-9 and add them to the modified concept C80099.

In this example, once the description of C62595 was added to C80099, the links to 256.31 and 10027308 still may be considered valid; in some cases more fine-grained constraints will need to be analyzed to observe whether the relationships of the mappings have to be reconsidered or not;

  1. (2)

    Delete the mapping related to C62595 and delete possible redundant mappings related to C80099 (if there are).

The system has to interpret and transform these actions into basic tasks that are performed on the mappings. The execution of these tasks is dependent on the formalism (language) used to express mappings. The previous example shows a merged CP, and the performed adaptation actions are copy, add, and delete. A similar heuristic must be applied to update the mappings between NCIT and MedDRA.

In this example, the new mapping relations between the merged concept and the ancient mapped concepts will be the inverse of “part of”. However, sometimes to define the exact semantics of the modified relationships is complex given the new definition of the concepts. Possible refinements of the relationships in the mappings could consider new attributes and relationships aggregated between the concepts when merging.

A real example that illustrates a complex change leading to a mapping refinement is related to the concept code C84382 (Sterilization) in the NCIT, which is mapped to ICD9/V25.2 (Sterilization) and to MedDRA/10054640 (Sterilization). Figure 1 presents an illustration of this case. After evolution, C84382 (Sterilization) in NCIT is split into C84382 and C95019 (Disinfectant) (as the “Split B” change pattern in Table 1). Consequently, the relationship of “Equivalence” type between C84382 (Sterilization) and ICD9/V25.2 (Sterilization) on one hand and MedDRA/10054640 (Sterilization) on the other hand have to be redefined. Mappings between some of these elements already existed but the relationships were not the same. The “equivalence” relation has to be converted into a “part-of” relation. In this case, the CP that has to be recognized is “split”, and an example of instantiated action to be applied is (explained in natural language): “Modify the <equivalence> relation into <part-of> from <KOSa_C1> = NCIT_C84382 to <KOSb_C2> = ICD9_V25.2”.

Fig. 1.
figure 1

Example of mapping maintenance

3.3 Towards the DyKOSMap Framework Definition

The general process proposed for mapping maintenance can be performed each time a new version of a KOS is released. In this section, we describe the workflow related to the mapping evolution process. An illustration of the DyKOSMap framework supporting the whole process is depicted in Fig. 2.

Fig. 2.
figure 2

The DyKOSMap framework

First, the changes between two different versions of a KOS are identified (CI in Fig. 2). To support this task, our approach relies on the use of previously designed sets of CPs. CPs aim at recognizing what types of changes (e.g., merge, split) were performed on the elements of the two KOS versions. CPs are specific to the KOS models. The selection of the CPs depends on the KOS model and on the changes between the two versions of the KOS under study. Instances of CPs are generated by the CI module. The result serves as input to the Mapping Evolution (ME) module.

The Mapping Evolution (ME) module aims at maintaining mappings up-to-date with respect to the occurred changes, and at generating the histories (for keeping track) of both KOS (KOS history in Fig. 2) and mapping evolution (mapping history in Fig. 2). To support these tasks, the module relies on the use of previously designed sets of CPs and heuristics. Thus, modifying the CPs and/or the heuristics may also modify the way mappings are maintained in the ME process as a whole. The heuristics that have to be satisfied are different according to the possible changes of a KOS model identified by CPs. Heuristics selected by the ME module are coherent to the instantiated CPs obtained from the CI module. The decision of clearly separating CPs and heuristics is justified by the fact that they must be able to evolve over time, independent of the mapping maintenance process. Selected heuristics drive the overall mapping maintaining phase by providing the right actions to be done to maintain affected mappings without recalculating all the mappings.

4 Discussion

Even though KOS evolution and mapping methods have been investigated in the literature, and several of these proposals have been applied to the biomedical domain, approaches for dealing with the mappings in the course of evolution are still lacking. There is no solution yet that could benefit from such proposals to maintain mappings according to the evolution aspects of biomedical KOS, which may contribute to decrease the time spent to review and keep mappings valid after KOS evolution.

Due to this lack, after each evolution step and KOS releases, software applications and humans in charge of maintaining the integration between the different KOS need to recalculate and validate all mappings. This strategy forces an efficient approach of placing the stress on the development of optimal matching algorithms and efficient validation strategies. Proposals in the literature are still not able to define how the evolution may impact the mappings, and how to deal with the mapping maintenance in the biomedical domain. Such an investigation must take into account the types of changes, which may be different according to the heterogeneous KOS models. However, a better understanding of the influences and impacts of evolution on mappings is needed to better accomplish it. Moreover, the current mapping maintenance solutions are not able to deal with the complexity brought by these heterogeneous KOS models and to take it into account in their solutions. They frequently focus on specific models, and mostly on database schemas. In addition, change patterns based on the KOS models are not considered in these solutions that may improve (better support) the way mappings are maintained over time.

Regarding this problematic we proposed an approach that handles mapping maintenance and KOS evolution as an integrated process. The framework supporting this approach is under development to accomplish the ME process and ensure the dynamic reconciliation of biomedical KOS. The idea is to split the entire ME process into several steps. For each one, approaches can be found in the literature and will be deeply evaluated in future work in order to identify the ones that could fit better to the framework. We will investigate their limitations and to use them within our approach and propose improvements and adaptations. However, at this moment, algorithms comparing results are beyond the scope of this paper. We focused on the general description of our approach, on the justification of the necessity of such solution and on the framework that will support it. The examples provided real illustrations on how the solution may work and on its possible outcomes.

In the proposed approach we assume that a first set of mapping was already built and validated, and we consider that these mappings will evolve over time based on the ME process. The final outcome provided by the framework (the evolved mappings and the history of the modifications) can be used to allow the retrieval and integration of information annotated with concepts from old versions of a KOS. This can avoid the update of every knowledge source that uses the modified KOS as annotation reference, or it can avoid end-users from being forced to store every version of a KOS in their system in order to perform retrieving activities.

Although this paper presents preliminary research results in suggesting a framework for maintaining mappings, it is important to consider the severity and urgency of the elucidated problem, of gaps in the literature which still need to be bridged, and of the benefits that a solution could have on applications. DyKOSMap is original for the issues we raised, and once first steps toward the problem are overcome, it might provide a way for performing a dynamic reconciliation of biomedical KOS.

5 Conclusion

Biomedical applications need to explore the evolution of the concepts and of the mappings between different KOS over time in order to enable publishing, integrating and retrieving data in an easier way. Dealing with the evolution effects of biomedical KOS is urgent, and mappings are the most affected elements. They deserve research attention for their adequate maintenance after evolution. Furthermore, biomedical vocabulary is frequently implemented in very heterogeneous KOS models, which increases the complexity of the issue. In this paper we presented the DyKOSMap approach for handling the changes and mapping evolution based on KOS complex changes identification. The proposed framework presents special phases based on the use of change patterns and on heuristics in a process designed to make the solution more adaptable. Our aim for future works is to develop a deeper investigation of each of the framework steps. Empirical experiments will be conducted in order to observe and understand the real effects of the evolution in the mappings established. The results from these experiments will lead to the heuristics definition. A tool based on this framework is envisioned, as well as case studies in real biomedical application scenarios.