Elsevier

Information Systems

Volume 34, Issue 8, December 2009, Pages 724-752
Information Systems

Using semantic components to search for domain-specific documents: An evaluation from the system perspective and the user perspective

https://doi.org/10.1016/j.is.2009.04.005Get rights and content

Abstract

We seek to leverage an expert user's knowledge about how information is organized in a domain and how information is presented in typical documents within a particular domain-specific collection, to effectively and efficiently meet the expert's targeted information needs. We have developed the semantic components model to describe important semantic content within documents. The semantic components model for a given collection (based on a general understanding of the type of information needs expected) consists of a set of document classes, where each class has an associated set of semantic components. Each semantic component instance consists of segments of text about a particular aspect of the main topic of the document and may not correspond to structural elements in the document. The semantic components model represents document content in a manner that is complementary to full text and keyword indexing. This paper describes how the semantic components model can be used to improve an information retrieval system. We present experimental evidence from a large interactive searching study that compared the use of semantic components in a system with full text and keyword indexing, where we extended the query language to allow users to search using semantic components, to a base system that did not have semantic components. We evaluate the systems from a system perspective, where semantic components were shown to improve document ranking for precision-oriented searches, and from a user perspective. We also evaluate the systems from a session-based perspective, evaluating not only the results of individual queries but also the results of multiple queries during a single interactive query session.

Introduction

Some information needs of domain experts are highly specific and precision-oriented in that the need can be satisfied by just one, or a few, documents to answer a question or to support a decision. Consider, for example, Kirsten, a hypothetical family practitioner who is seeing a long-standing patient with severe asthma who is newly pregnant. Kirsten needs to decide whether to continue her patient's current medications or to change to a different regime to protect the fetus. Kirsten does not need documents that describe asthma, or documents that recommend diagnostic tests for asthma. She knows her patient has asthma. She needs a document that describes the safety, or risks, of specific asthma medications in pregnant women, and she needs to find it quickly because she is only allotted 10 min for the entire patient visit.

In a classic article, Belkin et al. put forth the anomalous state of knowledge (ASK) hypothesis, suggesting that “an information need arises from a recognized anomaly in the user's state of knowledge concerning some topic or situation and that, in general, the user is unable to specify precisely what is needed to resolve that anomaly” [1]. We posit that the ASK hypothesis does not always apply to domain experts when they search for information. An expert in a domain is likely to have a mental framework for understanding what kinds of information she needs, to understand how entities that are important in the domain usually relate to each other, and to know what kind of information will satisfy a particular information need. Domain experts often have extensive knowledge about how information in the domain is typically organized and expressed within documents. This claim is supported by observations in multiple domains. Dillon [2] showed that experienced researchers have a mental model of typical academic articles. When given pieces of cut up articles, with approximately every other paragraph removed, experimental subjects rapidly assembled the fragments into an order that followed an Introduction–Method–Result–Discussion format. In related work, Bishop [3] described a series of focus groups, interviews, and usability tests investigating how academic researchers use structural components of scientific journal articles (such as figures, tables, references, author lists, methods sections). She studied how researchers use structural components of articles to select which documents to use, how they use structural components to read and comprehend the documents, and how they use structural components to extract, transform and use the information from the documents in their own work. During a usability study of an experimental digital library system for forest management, the investigators observed that the forestry professionals exhibited a striking familiarity with the organization of long documents, rapidly homing in on sections of interest [4]. When physicians were tasked with a scenario of familiarizing themselves with the medical record of a patient for whom they were to assume responsibility, the physicians rapidly focused attention on the relevant portions of the relevant documents in the medical record, attending only to information that would influence the scenario [5].

Information retrieval (IR) systems for domain-specific digital libraries typically return documents based on scores related to matching the words in user queries to document representations consisting of words extracted from document text (full text indexing) or keywords, usually assigned from a controlled vocabulary (keyword indexing). Our work seeks to exploit user knowledge by including domain-specific characteristics of document types and content to supplement existing representations of document content. We have previously introduced the semantic components model for representing document content in a way that complements existing indexing techniques [6], [7]. The semantic components model allows users to express queries against domain-specific components of documents as well as against whole documents.

In this paper, we report on the retrieval performance of an implementation of the semantic components model on top of an existing information retrieval system. Our general goal was to answer two questions:

  • 1.

    Can semantic components aid precision-oriented searches by improving document ranking from a system perspective?

  • 2.

    Does the addition of semantic components to a search system result in better searching experience from a user perspective?

To do this, we compared two systems, one with semantic components and one without, in an empirical, realistic, interactive searching study.

Evaluations of IR systems are often carried out in the context of existing test collections that include a document collection, a set of queries, and relevance judgments. Evaluating a model with a new form of indexing and a corresponding new query model precludes use of existing test collections. Evaluating retrieval systems in interactive settings adds more realism but also new challenges as compared with traditional laboratory-style IR research. Evaluation paradigms for interactive information retrieval was recently listed as one of the grand challenges for information retrieval research [8]. This research represents a combined approach. We studied domain experts in a controlled setting while they interactively searched for documents relevant to four search scenarios designed to mimic real information needs. We analyzed the search results from multiple perspectives, using traditional methods to evaluate the results returned by an experimental search system and also evaluating results from the perspective of the searchers themselves. We consider not only the results from single queries, but the overall results of search sessions, where a search session consists of all the searches submitted by one searcher for one scenario as part of an interactive session using one search system.

The study involved users and documents from sundhed.dk [9], the national Danish health portal. Intended for both healthcare professionals and citizens in Denmark, sundhed.dk has been operational since 2001 and contained nearly 25,000 documents at the time of our study. sundhed.dk uses a combination of full text indexing and manual keyword indexing with both uncontrolled terms and terms from three health-related controlled vocabularies. Our “control” system (System 1) used the existing full text and keyword indexing over the entire collection of documents copied from sundhed.dk. The “experimental” system (System 2) used the existing indexing plus semantic component indexing over the same collection. The users we studied are family physicians who use the portal as part of their medical practices. In this experiment we studied semantic components in the medical domain, but the semantic components model itself is not limited to any particular domain.

The contributions of this paper are:

  • 1.

    A description of a prototype implementation of the semantic components model on top of an existing document collection and operational information retrieval system.

  • 2.

    The results of a searching study, in which 30 physician users of an existing digital library each performed four realistic searching tasks, that demonstrate the ability of the semantic components model to enhance precision-oriented retrieval of documents in a domain-specific setting from a system perspective.

  • 3.

    The results of the same searching study as evaluated from the user perspective with respect to finding documents that the user judged as satisfying the searching tasks.

  • 4.

    A session-based evaluation of the searching study results in addition to the evaluations based on the results of individual queries.

This paper is an expanded version of a conference paper presented at the 2007 Conference on Information and Knowledge Management (CIKM 2007) [10]. In this paper we add evaluations of the study results from the user perspective and from a session-based perspective, and expand the related work section.

The remainder of the paper is organized as follows. We provide an overview of the semantic components model in Section 2. In Section 3 we review areas of related work. In 4 Experimental methods, 5 Experimental results we describe our experimental methods and results, respectively. We discuss the results in Section 6 and conclude in Section 7.

Section snippets

Semantic components

Documents in domain-specific collections can be classified into document classes by grouping documents that will tend to contain the same kinds of information. Document classes may be based on the type of topic that is the main focus or the main purpose of the document. The appropriate classification scheme depends on the nature of the document collection and the domain. In health-related collections that we have analyzed, topic type is a natural axis for classification. For example, we have

Related work

Document classes are related to the familiar notion of document genre. A number of authors have suggested using document genre to improve information retrieval (such as [12], [13], [14]). Freund et al. analyzed the information tasks of software engineers [15] and demonstrated a correlation between information tasks and document genre [14].

Turner et al. [16] created a model of documents in the public health domain in which genre was one component. They used content analysis and expert users to

Experimental methods

In this study we focused on two research questions. Our first question was Can semantic components aid precision-oriented searches by improving document ranking from a system perspective? We addressed this question by comparing document rankings from two search systems for users’ queries using a reference standard based on the relevance judgments of a family practitioner with expertise in research and clinical care. Our second question was Does the addition of semantic components to a search

Search performance evaluated from a system perspective for single queries

System 2 achieved a higher mean performance for each scenario, and for all scenarios combined, using either MAP or nDCG. Over all scenarios, the improvement was 35.5% as measured by MAP and 25.6% by nDCG. Analysis of variance found no interaction between system and scenario for either MAP or nDCG. The difference between Systems 1 and 2 was statistically significant for both MAP (p<0.02) and nDCG (p<0.01). As expected, given the varying difficulty of the scenarios, the difference among scenarios

Evaluation perspectives: user versus system

We evaluated our experimental search systems from both a system-oriented perspective and a user-oriented perspective. Both perspectives are important for gauging the potential usefulness of a new approach to indexing and searching such as ours. The system-perspective evaluation measures how well an algorithm ranks documents and is independent of whether a user recognized that a document might be useful and opened it. The system perspective is also independent of variations in how strictly users

Summary and future work

We have described an implementation of semantic components on top of an existing domain-specific digital library and have presented experimental evidence demonstrating that semantic components can enhance document retrieval. Our results are from a realistic interactive searching study, in which 30 domain experts searched on four scenarios. We analyzed the searching results from both a system-oriented perspective and a user-oriented perspective. From the system-oriented perspective, the

Acknowledgments

This work was supported in part by the National Science Foundation, Grant nos. 0514238, 0511050 and 0534762 and by the National Library of Medicine Training Grant 5-T15-LM07088. The physician participants were paid by a grant from the Kvalitetsudviklingsudvalget for Almen Praksis I Aarhus Amt. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the views of the National Science Foundation.

We thank Vibeke Luk, our primary

References (54)

  • Susan L. Price, Lois M. Delcambre, Marianne Lykke Nielsen, Using semantic components to express clinical questions...
  • N.J. Belkin

    Some(what) grand challenges for information retrieval

    SIGIR Forum

    (2008)
  • sundhed.dk. URL: http://www.sundhed.dk, accessed: 6 December...
  • Susan L. Price, Marianne Lykke Nielsen, Lois M.L. Delcambre, Peter Vedsted, Semantic components enhance retrieval of...
  • Susan L. Price. Semantic components: a model for enhancing retrieval of domain-specific information, Dissertation in...
  • Kevin Crowston et al.

    Can document-genre metadata improve information access to large digital collections?

    Library Trends

    (2003)
  • Andreas Rauber, Alexander Müller-Kögler. Integrating automatic genre analysis into digital libraries, in: Proceedings...
  • Luanne Freund, Elaine G. Toms, Charles L.A. Clarke, Modeling task-genre relationships for IR in the workplace, in:...
  • Luanne Freund, Elaine G. Toms, Julie Waterhouse, Modeling the information behaviour of software engineers using a...
  • Anne M. Turner et al.

    Modeling public health interventions for improved access to the gray literature

    Journal of the Medical Library Association

    (2005)
  • William C. Mann et al.

    Rhetorical structure theory: description and construction of text structures

  • Simone Teufel et al.

    Summarizing scientific articles: experiments with relevance and rhetorical status

    Computational Linguistics

    (2002)
  • Elizabeth D. Liddy, Kenneth A. McVearry, Woojin Paik, Edmund Yu, Mary McKenna, Development, implementation and testing...
  • Gretchen P. Purcell et al.

    Development and evaluation of a context-based document representation for searching the medical literature

    International Journal on Digital Libraries

    (1997)
  • Jack G. Conrad, Daniel P. Dabney, A cognitive approach to judicial opinion structure: applying domain expertise to...
  • Chris D. Paice, Paul A. Jones, The identification of important concepts in highly structured technical papers, in:...
  • Marti A. Hearst

    Clustering versus faceted categories for information exploration

    Communications of the ACM

    (2006)
  • Cited by (13)

    • How doctors search: A study of query behaviour and the impact on search results

      2012, Information Processing and Management
      Citation Excerpt :

      In this section we present the methods and the data of our study: in Section 4.1 we present the design of the searching study studied in this paper; Section 4.2 presents the data collection methods and Section 4.3 the methods for data analysis and data evaluation. Design of the searching study is described in detail in (Price et al. (2009). In the searching study we used a convenience sample (as distinguished from a random sample) of 30 family practice physicians from the Aarhus region.

    • A new knowledge representation for managing organisational knowledge objects

      2017, Proceedings of the 2016 42nd Latin American Computing Conference, CLEI 2016
    • Document organization by means of graphs

      2016, Inteligencia Artificial
    View all citing articles on Scopus
    1

    Present address: Microsoft Corporation, One Microsoft Way, Redmond, WA 98052, USA.

    View full text