Exploring the use of concept spaces to improve medical information retrieval
Introduction
Medicine is a dynamic field incorporating numerous specialties, each with its own preferred terminology. This diversity of vocabularies can be an obstacle for medical professionals requiring access to current medical information [16]. While advances in medical database technology have improved information accessibility, retrieval speed, and searching flexibility, they have not resolved the problems of vocabulary differences among biomedical specialties, variations in indexing and classification systems, nor variations in information accessing systems. Medical information retrieval, as a specialized case of information retrieval, is subject to classic information retrieval problems such as: “information overload” [3], the “vocabulary problem” (semantic barrier [17]), synonymy and polysemy. It also has some interesting problems of its own [12], [14], [15].
The community of medical information users is extremely varied in its level of biomedical expertise, its familiarity with various biomedical indexing vocabularies and its information usage requirements. For example, biomedical expertise ranges from patients and families encountering terms for the first time, to specialists in focused research areas who are considered experts. Compounding this problem is the fact that there is no single commonly accepted biomedical indexing vocabulary. This lack of an information standard and the existence of thousands of different medical databases containing information that can be formatted, indexed and stored in a variety of different ways make it difficult, if not impossible, to locate and exchange medical information. Users requiring information from a variety of medical sources may have to learn several different information retrieval systems and several different indexing vocabularies to locate the information they need.
Depending on medical information usage requirements, the goals of indexing vocabularies may conflict. For example, biomedical research information, databases of clinical studies or drug trials, and medical insurance databases all need to have data organized or summarized by categories (generalization). However, primary care professionals dealing with individual patient records require a detailed, precise and expressive vocabulary that can accurately describe patient information [11], [13]. Patient records can be a composite of every potential data format (numeric, free text, tables, graphs, images and audio). Patient record information systems, therefore, require a standard vocabulary that can specialize (the direct opposite of generalization) and accommodate a massive quantity of highly variable and volatile information, thereby increasing medical information system challenges.
We are investigating improving medical information retrieval by building on techniques successfully applied to other information retrieval domains (e.g. Worm/Fly Genome, the Internet, and a large scientific abstract collection). Previous research demonstrated that the creation of automatically generated concept spaces (thesauri) is an efficient, effective technique to improve document precision and recall in directed searches of large information spaces. The Worm/Fly genome research [5] indicated that a combined thesaurus approach could improve recall without sacrificing precision. Currently, we are investigating augmenting automatically generated concept space terms with terms from existing medical thesauri: Medical Subject Headings (MeSH) and the Unified Medical Language System (UMLS) Metathesaurus, both developed and maintained by the National Library of Medicine (NLM).
Section snippets
Literature review
There are four major approaches to textual medical information retrieval. They are: keyword indexing and retrieval (traditional method), statistically based methods (Salton-based syntactic techniques), relevance feedback (using searcher feedback to improve future searches), and semantic methods (including extensions to Salton's techniques and Natural Language Processing).
CANCERLIT experiment
CANCERLIT contains bibliographic records (predominantly abstracts) from biomedical journals on research related to cancer biology, etiology, screening, prevention, and treatment published between 1963 and today. Approximately 200 core journals account for the majority of the collection. Additional citations come from journals, scientific meeting proceedings, books, dissertations, technical reports, and other publications. The National Cancer Institute (NCI) and NLM share processing costs,
Conclusions and future directions
Different users with different goals approach large information spaces in different ways. We focused on medical researchers and a highly technical, research-based biomedical document collection (CANCERLIT). This type of medical information user is a very technical, extremely focused expert who is intimately familiar with a particular section of the information space. Our subjects were interested in very narrow, directed searches. Due to their busy schedules, they had no interest in browsing or
Acknowledgements
This project was funded primarily by a grant from the NCI “Information Analysis and Visualization for Cancer Literature” (1996–1997), a grant from the NLM, “Semantic Retrieval for Toxicology and Hazardous Substance Databases” (1996–1997), an NSF/CISE “Intelligent Internet Categorization and Search” project (1995–1998), and the NSF/ARPA/NASA Illinois Digital Library Initiative project, “Building the Interspace” (1994–1998).
We would like to thank the following individuals for their generous
Dr. Andrea Houston is an Assistant Professor in the Information Systems and Decision Sciences Department at Louisiana State University. She received her PhD from the Department of Management Information Systems at The University of Arizona in 1998, her MBA from the University of New Hampshire (1981) and her BA (1976) from the University of Pennsylvania. She worked in industry as a project leader and systems analyst for over 15 years before returning to academics. Her research interests include
References (21)
- et al.
Information retrieval and genomics — An introduction
Computers in Biology and Medicine
(1996) Optimal document-indexing vocabulary for MEDLINE
Information Processing & Management
(1996)- et al.
Representing documents using an explicit model of their similarities
Journal of the American Society for Information Science
(1995) Subject access in online catalogs: A design model
Journal of the American Society for Information Science
(1986)- et al.
An evaluation of retrieval effectiveness for a full-text document-retrieval system
Communications of the ACM
(1985) - et al.
Automatic construction of networks of concepts characterizing document databases
IEEE Transactions of Systems, Man and Cybernetics
(1992) - et al.
A concept space approach to addressing the vocabulary problem in scientific information retrieval: An experiment of the Worm Community System
Journal of the American Society for Information Science
(1997) - et al.
Latent semantic indexing of medical diagnoses using UMLS semantic structures
- et al.
From ICD9-CM to MeSH using the UMLS: A how-to guide
- et al.
Indexing by latent semantic analysis
Journal of the American Society for Information Science
(1990)
Cited by (26)
Collaboration-based medical knowledge recommendation
2012, Artificial Intelligence in MedicineCitation Excerpt :It is not convenient for them to retrieve appropriate knowledge quickly and accurately, especially within the explosion of knowledge nowadays [8,9]. Most existing clinical information systems adopt simple approaches to search medical knowledge, e.g., keyword-based knowledge search [10,11]. With an exponentially growing amount of medical knowledge being added to the clinical databases, it is becoming more and more difficult to find efficient and valuable knowledge based on keyword-based search method.
Conceptual-driven classification for coding advise in health insurance reimbursement
2011, Artificial Intelligence in MedicineCitation Excerpt :The entire collection is widely used in the biomedical area, including the MEDLINE article database and NLM's catalog of book holdings. In this research we use MeSH as the main filter for the large amount of words obtained from the tokenization process as [27]. Each word that is found in the database is regarded as a medical term; hence its descriptor is taken into consideration for the next step in the framework leaving out any redundancies and irrelevant terms that could increase computational time.
Ontological analysis of web surf history to maximize the click-through probability of web advertisements
2009, Decision Support SystemsCitation Excerpt :Information retrieval (IR) is an area of research which attempts to extract usable information from textual data [11]. IR has historically been employed in the field of library sciences, but it has recently gained favor in many other fields including Internet search, cyber security and medicine [15]. The power of IR is its ability to handle textual information.
Semantic profile-based document logistics for cooperative research
2004, Future Generation Computer Systems
Dr. Andrea Houston is an Assistant Professor in the Information Systems and Decision Sciences Department at Louisiana State University. She received her PhD from the Department of Management Information Systems at The University of Arizona in 1998, her MBA from the University of New Hampshire (1981) and her BA (1976) from the University of Pennsylvania. She worked in industry as a project leader and systems analyst for over 15 years before returning to academics. Her research interests include looking at information retrieval issues, medical informatics, digital libraries and electronic publishing, human factors in Human/Computer Interaction, and Natural Language Processing. She is also interested in software team development and organizational memory. She is a member of ACM, IEEE, AIS and ASIS.
Dr. Hsinchun Chen is the McClelland Professor of Management Information Systems at the University of Arizona and head of the UA/MIS Artificial Intelligence Lab. He received the PhD degree in Information Systems from New York University in 1989. He is author of more than 70 articles covering semantic retrieval, search algorithms, knowledge discovery, and collaborative computing in leading information technology publications. He serves on the editorial board of Journal of the American Society for Information Science and Decision Support Systems. He is an expert in digital library and knowledge management research whose work has been featured in various scientific and information technologies publications including Science, Business Week, NCSA Access Magazine, WEBster, and HPCWire.
Susan M Hubbard, RN, is the director of the International Cancer Information Center (since 1984) and an Associate Director of the National Cancer Institute. She received a BS from the Honors College of the University of Connecticut and a Master's in Public Administration from the American University (1993). As the Director of NCI's International Cancer Information Center, she directs the NCI's efforts to identify, implement, and evaluate state-of-the-art communication technologies to maximize the potential impact of advances in cancer research on health care. She directed the development of PDQ, NCI's primary mechanism for communicating state-of-the-art information about cancer. The Department of Health and Human Services (DHHS) designated ICIC as a “reinvention laboratory” in 1994 under Vice President Albert Gore's National Performance Review (NPR). The International Cancer Information Center has also received the NPR “Hammer Award” and the Department of Health and Human Service's Continuous Improvement Program Award. Ms. Hubbard has published extensively in nursing and medical journals and texts.
Bruce R. Schatz is Director of the Community Architecture for Network Information SYSTEMS (CANIS) Laboratory at the University of Illinois at Urbana–Champaign. He is the Principal Investigator of the Digital Libraries Initiative project and the DARPA Information Management Program, which performs research in information systems building analysis environments to support community repositories (Interspace), and in information science performing large-scale experiments in semantic retrieval for vocabulary switching. Dr. Schatz holds faculty appointments in Library and Information Science, Computer Science, Neuroscience, and Health Information Sciences. He is also a Senior Research Scientist at the National Center for Supercomputing Applications (NCSA), serving as the scientific advisor for digital libraries and information systems. He has served in this role since 1989, including the period during which NCSA developed Mosaic.
Robin Sewell received her Doctor of Veterinary Medicine degree from Washington State University (1986) and her MLA from the University of Arizona (1997). She currently serves as the AI Lab's Program Coordinator and a Research Specialist. Her interests are in Medical Informatics and the National Library of Medicine's Unified Medical Language System.
Dorbin Ng received the PhD in 2000 from the Department of Management Information Systems at the University of Arizona, from which he also received a BS in Business Administration majoring in Management Information Systems and Finance (1990) and a MS in MIS (1993). He is currently a Systems Scientist in the Computer Science Department at Carnegie Mellon University working on the Informedia Digital Video Library Project. His research interests include digital libraries, intelligence and multimedia information retrieval, semantic interoperability for information analysis and knowledge management environment, large-scale knowledge discovery using high-performance supercomputers, search engine and user interface development in Internet, neural-network computing, and collaborative computing.
- 1
Tel.: +1-520-621-4153.
- 2
Tel.: +1-217-244-0651.
- 3
Tel.: +1-301-496-9096.
- 4
Tel.: +1-520-621-2748.
- 5
Tel.: +1-412-268-4499.