Challenging conventional assumptions of automated information retrieval with real users: Boolean searching and batch retrieval evaluations
Introduction
A common assumption held by many researchers in the information retrieval (IR) field is that “natural language” searching (i.e., the entry of search terms without Boolean operators and the relevance ranking of results) is superior to searching using Boolean operators (Salton, 1991). Some research to support this notion comes from “batch” retrieval evaluations, in which a test collection of fixed queries, documents, and relevance judgments is used in the absence of real searchers to determine the efficacy of one retrieval system versus another. It has also been advocated that this approach to evaluation can be generalized to real world searches (Salton & Buckley, 1988).
Previous research comparing Boolean and natural language systems has yielded conflicting results. The first study to compare Boolean and natural language searching with real searchers was the CIRT study, which found roughly comparable performance between the two when utilized by search intermediaries (Robertson & Thompson, 1990). Turtle found, however, that expert searchers using a large legal database obtained better results with natural language searching (Turtle, 1994). We have performed several studies of medical end-user searching comparing Boolean and natural language approaches. Whether using recall-precision metrics in bibliographic (Hersh, Buckley, Leone, & Hickam, 1994) or full-text databases (Hersh & Hickam, 1995), or using task-completion studies in bibliographic (Hersh, Pentecost, & Hickam, 1996) or full-text databases (Hersh et al., 1995), the results have been comparable for both types of systems.
Likewise, there is also debate as to whether the results obtained by batch evaluations, consisting of measuring recall and precision in the non-interactive laboratory setting, can be generalized to real searchers. Much evaluation research dating back to the Cranfield studies (Cleverdon & Keen, 1966) and continuing through the Text Retrieval Conference (TREC) (Harman, 1993) has been based on entering fixed query statements from a test collection into an IR system in batch mode with measurement of recall and precision of the output. It is assumed that this is an effective and realistic approach to determining the system's performance (Sparck Jones, 1981). Some have argued against this view, maintaining that the real world of searching is more complex than can be captured with such studies. These authors point out that relevance is not a fixed notion (Meadow, 1985), interaction is the key element of successful retrieval system use (Swanson, 1977), and relevance-based measures do not capture the complete picture of user performance (Hersh, 1994). If batch searching results cannot be generalized, then system design decisions based on them are potentially misleading.
We used the TREC interactive track to test the validity of these assumptions. The TREC-7 and TREC-8 interactive tracks use the task of instance recall to measure success of searching. Instance recall is defined as the number of instances of a topic retrieved (Hersh & Over, 2000). For example, a searcher might be asked to identify all the discoveries made by the Hubble telescope; in this case each discovery is an instance and the proportion of instances correctly listed is instance recall. This is in contrast to document recall, which is measured by the proportion of relevant documents retrieved. Instance recall is a more pertinent measure of user success at an IR task, since users are less likely to want to retrieve multiple documents covering the same instances. This paper reviews the results of our experiments in the TREC-7 and TREC-8 interactive tracks, where we assessed: (a) Boolean versus natural language searching (Hersh et al., 1998) and (b) batch versus actual searching evaluation results, respectively (Hersh et al., 1999).
Section snippets
Commonalities across studies
There were a number of common methods in both experiments, which we present in this section. Both studies used instance recall as the outcome (or dependent) variable. The study consisted of a searcher who belonged to a group (librarian type in the TREC-7 experiment and librarian vs. graduate student in the TREC-8 experiment) and had a measurement of instance recall for each question (a total of eight in the TREC-7 experiment and six in the TREC-8 experiment).
All other data collected were
Comparing Boolean versus natural language searching
The main goal of our TREC-7 interactive experiment was to compare searching performance with Boolean and natural language interfaces in a specific population of searchers, namely experienced information professionals. A secondary goal of the experiment was to identify attributes associated with successful searching in this population.
Assessing the validity of batch-oriented retrieval evaluations
The goal of our TREC-8 experiment was to assess whether IR approaches achieving better performance in batch evaluations could translate that effectiveness to real users. This was done by a three-stage experiment. In the first stage we identified an “improved” weighting measure that achieved the best results over “baseline” TF IDF with previous (TREC-6 and TREC-7) interactive track queries and relevance judgments. Next, we used the TREC-8 instance recall task to compare searchers using the
Conclusions
The results of these experiments challenge some widely held assumptions by many researchers in the IR field, which are that natural language systems are superior to Boolean systems and that the results of batch searching experiments are generalizable to evaluations with real users. Our particular experiments showed that for experienced users searching TREC-style queries on TREC databases, Boolean and natural language searching yield comparable instance recall. Furthermore, results obtained from
Acknowledgements
This study was funded in part by Grant LM-06311 from the US National Library of Medicine.
References (29)
- Allen, B. (1992). Cognitive differences in end-user searching of a CD-ROM index. In Proceedings of the 15th annual...
- Chin, J., Diehl, V., & Norman, K. (1988). Development of an instrument measuring user satisfaction of the...
- Cleverdon, C., & Keen, E. (1966). Aslib Cranfield research project: Factors determining the performance of indexing...
- Dumais, S., & Schmitt, D. (1991). Iterative searching in an online database. In Proceedings of the human factors...
- et al.
Learning to use a text editor: some learner characteristics that predict success
Human–Computer Interaction
(1986) - Harman, D. (1993). Overview of the first Text Retrieval Conference. In Proceedings of the 16th annual international ACM...
Relevance and retrieval evaluation: perspectives from medicine
Journal of the American Society for Information Science
(1994)- Hersh, W., Buckley, C., Leone, T., & Hickam, D. (1994). OHSUMED: an interactive retrieval evaluation and new large test...
- Hersh, W., Elliot, D., Hickam, D., Wolf, S., Molnar, A., & Leichtenstein, C. (1995). Towards new measures of...
- et al.
An evaluation of interactive Boolean and natural language searching with an on-line medical textbook
Journal of the American Society for Information Science
(1995)