Exploring the use of concept spaces to improve medical information retrieval

doi:10.1016/S0167-9236(00)00097-X

Decision Support Systems

Volume 30, Issue 2, 27 December 2000, Pages 171-186

https://doi.org/10.1016/S0167-9236(00)00097-X Get rights and content

Abstract

This research investigated the application of techniques successfully used in previous information retrieval research, to the more challenging area of medical informatics. It was performed on a biomedical document collection testbed, CANCERLIT, provided by the National Cancer Institute (NCI), which contains information on all types of cancer therapy. The quality or usefulness of terms suggested by three different thesauri, one based on MeSH terms, one based solely on terms from the document collection, and one based on the Unified Medical Language System (UMLS) Metathesaurus, was explored with the ultimate goal of improving CANCERLIT information search and retrieval.

Researchers affiliated with the University of Arizona Cancer Center evaluated lists of related terms suggested by different thesauri for 12 different directed searches in the CANCERLIT testbed. The preliminary results indicated that among the thesauri, there were no statistically significant differences in either term recall or precision. Surprisingly, there was almost no overlap of relevant terms suggested by the different thesauri for a given search. This suggests that recall could be significantly improved by using a combined thesaurus approach.

Introduction

Medicine is a dynamic field incorporating numerous specialties, each with its own preferred terminology. This diversity of vocabularies can be an obstacle for medical professionals requiring access to current medical information [16]. While advances in medical database technology have improved information accessibility, retrieval speed, and searching flexibility, they have not resolved the problems of vocabulary differences among biomedical specialties, variations in indexing and classification systems, nor variations in information accessing systems. Medical information retrieval, as a specialized case of information retrieval, is subject to classic information retrieval problems such as: “information overload” [3], the “vocabulary problem” (semantic barrier [17]), synonymy and polysemy. It also has some interesting problems of its own [12], [14], [15].

The community of medical information users is extremely varied in its level of biomedical expertise, its familiarity with various biomedical indexing vocabularies and its information usage requirements. For example, biomedical expertise ranges from patients and families encountering terms for the first time, to specialists in focused research areas who are considered experts. Compounding this problem is the fact that there is no single commonly accepted biomedical indexing vocabulary. This lack of an information standard and the existence of thousands of different medical databases containing information that can be formatted, indexed and stored in a variety of different ways make it difficult, if not impossible, to locate and exchange medical information. Users requiring information from a variety of medical sources may have to learn several different information retrieval systems and several different indexing vocabularies to locate the information they need.

Depending on medical information usage requirements, the goals of indexing vocabularies may conflict. For example, biomedical research information, databases of clinical studies or drug trials, and medical insurance databases all need to have data organized or summarized by categories (generalization). However, primary care professionals dealing with individual patient records require a detailed, precise and expressive vocabulary that can accurately describe patient information [11], [13]. Patient records can be a composite of every potential data format (numeric, free text, tables, graphs, images and audio). Patient record information systems, therefore, require a standard vocabulary that can specialize (the direct opposite of generalization) and accommodate a massive quantity of highly variable and volatile information, thereby increasing medical information system challenges.

We are investigating improving medical information retrieval by building on techniques successfully applied to other information retrieval domains (e.g. Worm/Fly Genome, the Internet, and a large scientific abstract collection). Previous research demonstrated that the creation of automatically generated concept spaces (thesauri) is an efficient, effective technique to improve document precision and recall in directed searches of large information spaces. The Worm/Fly genome research [5] indicated that a combined thesaurus approach could improve recall without sacrificing precision. Currently, we are investigating augmenting automatically generated concept space terms with terms from existing medical thesauri: Medical Subject Headings (MeSH) and the Unified Medical Language System (UMLS) Metathesaurus, both developed and maintained by the National Library of Medicine (NLM).

Section snippets

Literature review

There are four major approaches to textual medical information retrieval. They are: keyword indexing and retrieval (traditional method), statistically based methods (Salton-based syntactic techniques), relevance feedback (using searcher feedback to improve future searches), and semantic methods (including extensions to Salton's techniques and Natural Language Processing).

CANCERLIT experiment

CANCERLIT contains bibliographic records (predominantly abstracts) from biomedical journals on research related to cancer biology, etiology, screening, prevention, and treatment published between 1963 and today. Approximately 200 core journals account for the majority of the collection. Additional citations come from journals, scientific meeting proceedings, books, dissertations, technical reports, and other publications. The National Cancer Institute (NCI) and NLM share processing costs,

Conclusions and future directions

Different users with different goals approach large information spaces in different ways. We focused on medical researchers and a highly technical, research-based biomedical document collection (CANCERLIT). This type of medical information user is a very technical, extremely focused expert who is intimately familiar with a particular section of the information space. Our subjects were interested in very narrow, directed searches. Due to their busy schedules, they had no interest in browsing or

Acknowledgements

This project was funded primarily by a grant from the NCI “Information Analysis and Visualization for Cancer Literature” (1996–1997), a grant from the NLM, “Semantic Retrieval for Toxicology and Hazardous Substance Databases” (1996–1997), an NSF/CISE “Intelligent Internet Categorization and Search” project (1995–1998), and the NSF/ARPA/NASA Illinois Digital Library Initiative project, “Building the Interspace” (1994–1998).

We would like to thank the following individuals for their generous

References (21)

J.N. Guidi et al.
Information retrieval and genomics — An introduction
Computers in Biology and Medicine
(1996)
P. Srinivasan
Optimal document-indexing vocabulary for MEDLINE
Information Processing & Management
(1996)
B.T. Bartell et al.
Representing documents using an explicit model of their similarities
Journal of the American Society for Information Science
(1995)
M.J. Bates
Subject access in online catalogs: A design model
Journal of the American Society for Information Science
(1986)
D.C. Blair et al.
An evaluation of retrieval effectiveness for a full-text document-retrieval system
Communications of the ACM
(1985)
H. Chen et al.
Automatic construction of networks of concepts characterizing document databases
IEEE Transactions of Systems, Man and Cybernetics
(1992)
H. Chen et al.
A concept space approach to addressing the vocabulary problem in scientific information retrieval: An experiment of the Worm Community System
Journal of the American Society for Information Science
(1997)
G.C. Chute et al.
Latent semantic indexing of medical diagnoses using UMLS semantic structures
J.J. Cimino et al.
From ICD9-CM to MeSH using the UMLS: A how-to guide
S. Deerwester et al.
Indexing by latent semantic analysis
Journal of the American Society for Information Science
(1990)

There are more references available in the full text version of this article.

Cited by (26)

Collaboration-based medical knowledge recommendation
2012, Artificial Intelligence in Medicine
Citation Excerpt :
It is not convenient for them to retrieve appropriate knowledge quickly and accurately, especially within the explosion of knowledge nowadays [8,9]. Most existing clinical information systems adopt simple approaches to search medical knowledge, e.g., keyword-based knowledge search [10,11]. With an exponentially growing amount of medical knowledge being added to the clinical databases, it is becoming more and more difficult to find efficient and valuable knowledge based on keyword-based search method.
Clinicians rely on a large amount of medical knowledge when performing clinical work. In clinical environment, clinical organizations must exploit effective methods of seeking and recommending appropriate medical knowledge in order to help clinicians perform their work.
Aiming at supporting medical knowledge search more accurately and realistically, this paper proposes a collaboration-based medical knowledge recommendation approach. In particular, the proposed approach generates clinician trust profile based on the measure of trust factors implicitly from clinicians’ past rating behaviors on knowledge items. And then the generated clinician trust profile is incorporated into collaborative filtering techniques to improve the quality of medical knowledge recommendation, to solve the information-overload problem by suggesting knowledge items of interest to clinicians.
Two case studies are conducted at Zhejiang Huzhou Central Hospital of China. One case study is about the drug recommendation hold in the endocrinology department of the hospital. The experimental dataset records 16 clinicians’ drug prescribing tracks in six months. This case study shows a proof-of-concept of the proposed approach. The other case study addresses the problem of radiological computed tomography (CT)-scan report recommendation. In particular, 30 pieces of CT-scan examinational reports about cerebral hemorrhage patients are collected from electronic medical record systems of the hospital, and are evaluated and rated by 19 radiologists of the radiology department and 7 clinicians of the neurology department, respectively. This case study provides some confidence the proposed approach will scale up.
The experimental results show that the proposed approach performs well in recommending medical knowledge items of interest to clinicians, which indicates that the proposed approach is feasible in clinical practice.
Conceptual-driven classification for coding advise in health insurance reimbursement
2011, Artificial Intelligence in Medicine
Citation Excerpt :
The entire collection is widely used in the biomedical area, including the MEDLINE article database and NLM's catalog of book holdings. In this research we use MeSH as the main filter for the large amount of words obtained from the tokenization process as [27]. Each word that is found in the database is regarded as a medical term; hence its descriptor is taken into consideration for the next step in the framework leaving out any redundancies and irrelevant terms that could increase computational time.
With the non-stop increases in medical treatment fees, the economic survival of a hospital in Taiwan relies on the reimbursements received from the Bureau of National Health Insurance, which in turn depend on the accuracy and completeness of the content of the discharge summaries as well as the correctness of their International Classification of Diseases (ICD) codes. The purpose of this research is to enforce the entire disease classification framework by supporting disease classification specialists in the coding process.
This study developed an ICD code advisory system (ICD-AS) that performed knowledge discovery from discharge summaries and suggested ICD codes. Natural language processing and information retrieval techniques based on Zipf's Law were applied to process the content of discharge summaries, and fuzzy formal concept analysis was used to analyze and represent the relationships between the medical terms identified by MeSH. In addition, a certainty factor used as reference during the coding process was calculated to account for uncertainty and strengthen the credibility of the outcome.
Two sets of 360 and 2579 textual discharge summaries of patients suffering from cerebrovascular disease was processed to build up ICD-AS and to evaluate the prediction performance. A number of experiments were conducted to investigate the impact of system parameters on accuracy and compare the proposed model to traditional classification techniques including linear-kernel support vector machines. The comparison results showed that the proposed system achieves the better overall performance in terms of several measures. In addition, some useful implication rules were obtained, which improve comprehension of the field of cerebrovascular disease and give insights to the relationships between relevant medical terms.
Our system contributes valuable guidance to disease classification specialists in the process of coding discharge summaries, which consequently brings benefits in aspects of patient, hospital, and healthcare system.
Ontological analysis of web surf history to maximize the click-through probability of web advertisements
2009, Decision Support Systems
Citation Excerpt :
Information retrieval (IR) is an area of research which attempts to extract usable information from textual data [11]. IR has historically been employed in the field of library sciences, but it has recently gained favor in many other fields including Internet search, cyber security and medicine [15]. The power of IR is its ability to handle textual information.
Due to an enormous influx of capital over the past decade, the online advertising industry has become extremely robust and competitive. The difference between success and failure in such a competitive market often rests in the ability to deliver advertisements that are closely in line with a user's interests. In this work, we propose and test a new online advertisement targeting technique which adapts and utilizes several powerful and well tested information retrieval and lexical techniques to develop an estimate of a user's affinity for particular products and services based on an analysis of a user's web surfing behavior. This new online ad targeting technique performs extremely well in our empirical tests.
Semantic profile-based document logistics for cooperative research
2004, Future Generation Computer Systems
This paper proposes a document logistics approach for cooperative research based on the Web and Knowledge Grid. The approach realizes effective research document collection, organization and provision as well as knowledge sharing by incorporating the following functions: construction of semantic profiles representing interests, continuous discovery and collection of potentially relevant documents, synthesis of evaluation feedbacks, and support of flexible management operations and document recommendation services. The prototype has been implemented and is available for use online. Experiments show that the proposed approach is feasible and effective.
Evolutionary development and research on Internet-based collaborative writing tools and processes to enhance eWriting in an eGovernment setting
2003, Decision Support Systems
The Center for the Management of Information (CMI) at the University of Arizona has been actively involved in research with various U.S. government organizations for nearly 20 years. This article details the years of evolutionary development and research conducted by CMI in an eGovernment setting that resulted in the creation of an Internet-based collaborative writing ((eWriting) tool, called Collaboratus. By embracing persistence, serendipity, and years of multi-methodological research in the field and in the lab, CMI has built on the foundation of eWriting research that was largely abandoned at the beginning of the eBusiness revolution. This research shows the promising potential for Collaboratus and eWriting tools to help improve digital government through improved document production and collaboration, and highlights many future research opportunities.
Health Information Retrieval: State of the art report
2022, arXiv

View all citing articles on Scopus

Dr. Andrea Houston is an Assistant Professor in the Information Systems and Decision Sciences Department at Louisiana State University. She received her PhD from the Department of Management Information Systems at The University of Arizona in 1998, her MBA from the University of New Hampshire (1981) and her BA (1976) from the University of Pennsylvania. She worked in industry as a project leader and systems analyst for over 15 years before returning to academics. Her research interests include looking at information retrieval issues, medical informatics, digital libraries and electronic publishing, human factors in Human/Computer Interaction, and Natural Language Processing. She is also interested in software team development and organizational memory. She is a member of ACM, IEEE, AIS and ASIS.

Dr. Hsinchun Chen is the McClelland Professor of Management Information Systems at the University of Arizona and head of the UA/MIS Artificial Intelligence Lab. He received the PhD degree in Information Systems from New York University in 1989. He is author of more than 70 articles covering semantic retrieval, search algorithms, knowledge discovery, and collaborative computing in leading information technology publications. He serves on the editorial board of Journal of the American Society for Information Science and Decision Support Systems. He is an expert in digital library and knowledge management research whose work has been featured in various scientific and information technologies publications including Science, Business Week, NCSA Access Magazine, WEBster, and HPCWire.

Susan M Hubbard, RN, is the director of the International Cancer Information Center (since 1984) and an Associate Director of the National Cancer Institute. She received a BS from the Honors College of the University of Connecticut and a Master's in Public Administration from the American University (1993). As the Director of NCI's International Cancer Information Center, she directs the NCI's efforts to identify, implement, and evaluate state-of-the-art communication technologies to maximize the potential impact of advances in cancer research on health care. She directed the development of PDQ, NCI's primary mechanism for communicating state-of-the-art information about cancer. The Department of Health and Human Services (DHHS) designated ICIC as a “reinvention laboratory” in 1994 under Vice President Albert Gore's National Performance Review (NPR). The International Cancer Information Center has also received the NPR “Hammer Award” and the Department of Health and Human Service's Continuous Improvement Program Award. Ms. Hubbard has published extensively in nursing and medical journals and texts.

Bruce R. Schatz is Director of the Community Architecture for Network Information SYSTEMS (CANIS) Laboratory at the University of Illinois at Urbana–Champaign. He is the Principal Investigator of the Digital Libraries Initiative project and the DARPA Information Management Program, which performs research in information systems building analysis environments to support community repositories (Interspace), and in information science performing large-scale experiments in semantic retrieval for vocabulary switching. Dr. Schatz holds faculty appointments in Library and Information Science, Computer Science, Neuroscience, and Health Information Sciences. He is also a Senior Research Scientist at the National Center for Supercomputing Applications (NCSA), serving as the scientific advisor for digital libraries and information systems. He has served in this role since 1989, including the period during which NCSA developed Mosaic.

Robin Sewell received her Doctor of Veterinary Medicine degree from Washington State University (1986) and her MLA from the University of Arizona (1997). She currently serves as the AI Lab's Program Coordinator and a Research Specialist. Her interests are in Medical Informatics and the National Library of Medicine's Unified Medical Language System.

Dorbin Ng received the PhD in 2000 from the Department of Management Information Systems at the University of Arizona, from which he also received a BS in Business Administration majoring in Management Information Systems and Finance (1990) and a MS in MIS (1993). He is currently a Systems Scientist in the Computer Science Department at Carnegie Mellon University working on the Informedia Digital Video Library Project. His research interests include digital libraries, intelligence and multimedia information retrieval, semantic interoperability for information analysis and knowledge management environment, large-scale knowledge discovery using high-performance supercomputers, search engine and user interface development in Internet, neural-network computing, and collaborative computing.

¹: Tel.: +1-520-621-4153.

²: Tel.: +1-217-244-0651.

³: Tel.: +1-301-496-9096.

⁴: Tel.: +1-520-621-2748.

⁵: Tel.: +1-412-268-4499.

View full text

Exploring the use of concept spaces to improve medical information retrieval

Abstract

Introduction

Section snippets

Literature review

CANCERLIT experiment

Conclusions and future directions

Acknowledgements

Computers in Biology and Medicine

Information Processing & Management

Representing documents using an explicit model of their similarities

Journal of the American Society for Information Science

Subject access in online catalogs: A design model

Journal of the American Society for Information Science

An evaluation of retrieval effectiveness for a full-text document-retrieval system

Communications of the ACM

Automatic construction of networks of concepts characterizing document databases

IEEE Transactions of Systems, Man and Cybernetics

A concept space approach to addressing the vocabulary problem in scientific information retrieval: An experiment of the Worm Community System

Journal of the American Society for Information Science

Latent semantic indexing of medical diagnoses using UMLS semantic structures

From ICD9-CM to MeSH using the UMLS: A how-to guide

Indexing by latent semantic analysis

Journal of the American Society for Information Science