Challenging conventional assumptions of automated information retrieval with real users: Boolean searching and batch retrieval evaluations

doi:10.1016/S0306-4573(00)00054-6

Information Processing & Management

Volume 37, Issue 3, May 2001, Pages 383-402

https://doi.org/10.1016/S0306-4573(00)00054-6 Get rights and content

Abstract

Two common assumptions held by information retrieval researchers are that searching using Boolean operators is inferior to natural language searching and that results from batch-style retrieval evaluations are generalizable to the real-world searching. We challenged these assumptions in the Text Retrieval Conference (TREC) interactive track, with real users following a consensus protocol to search for an instance recall task. Our results showed that Boolean and natural language searching achieved comparable results and that the results from batch evaluations were not comparable to those obtained in experiments with real users.

Introduction

A common assumption held by many researchers in the information retrieval (IR) field is that “natural language” searching (i.e., the entry of search terms without Boolean operators and the relevance ranking of results) is superior to searching using Boolean operators (Salton, 1991). Some research to support this notion comes from “batch” retrieval evaluations, in which a test collection of fixed queries, documents, and relevance judgments is used in the absence of real searchers to determine the efficacy of one retrieval system versus another. It has also been advocated that this approach to evaluation can be generalized to real world searches (Salton & Buckley, 1988).

Previous research comparing Boolean and natural language systems has yielded conflicting results. The first study to compare Boolean and natural language searching with real searchers was the CIRT study, which found roughly comparable performance between the two when utilized by search intermediaries (Robertson & Thompson, 1990). Turtle found, however, that expert searchers using a large legal database obtained better results with natural language searching (Turtle, 1994). We have performed several studies of medical end-user searching comparing Boolean and natural language approaches. Whether using recall-precision metrics in bibliographic (Hersh, Buckley, Leone, & Hickam, 1994) or full-text databases (Hersh & Hickam, 1995), or using task-completion studies in bibliographic (Hersh, Pentecost, & Hickam, 1996) or full-text databases (Hersh et al., 1995), the results have been comparable for both types of systems.

Likewise, there is also debate as to whether the results obtained by batch evaluations, consisting of measuring recall and precision in the non-interactive laboratory setting, can be generalized to real searchers. Much evaluation research dating back to the Cranfield studies (Cleverdon & Keen, 1966) and continuing through the Text Retrieval Conference (TREC) (Harman, 1993) has been based on entering fixed query statements from a test collection into an IR system in batch mode with measurement of recall and precision of the output. It is assumed that this is an effective and realistic approach to determining the system's performance (Sparck Jones, 1981). Some have argued against this view, maintaining that the real world of searching is more complex than can be captured with such studies. These authors point out that relevance is not a fixed notion (Meadow, 1985), interaction is the key element of successful retrieval system use (Swanson, 1977), and relevance-based measures do not capture the complete picture of user performance (Hersh, 1994). If batch searching results cannot be generalized, then system design decisions based on them are potentially misleading.

We used the TREC interactive track to test the validity of these assumptions. The TREC-7 and TREC-8 interactive tracks use the task of instance recall to measure success of searching. Instance recall is defined as the number of instances of a topic retrieved (Hersh & Over, 2000). For example, a searcher might be asked to identify all the discoveries made by the Hubble telescope; in this case each discovery is an instance and the proportion of instances correctly listed is instance recall. This is in contrast to document recall, which is measured by the proportion of relevant documents retrieved. Instance recall is a more pertinent measure of user success at an IR task, since users are less likely to want to retrieve multiple documents covering the same instances. This paper reviews the results of our experiments in the TREC-7 and TREC-8 interactive tracks, where we assessed: (a) Boolean versus natural language searching (Hersh et al., 1998) and (b) batch versus actual searching evaluation results, respectively (Hersh et al., 1999).

Section snippets

Commonalities across studies

There were a number of common methods in both experiments, which we present in this section. Both studies used instance recall as the outcome (or dependent) variable. The study consisted of a searcher who belonged to a group (librarian type in the TREC-7 experiment and librarian vs. graduate student in the TREC-8 experiment) and had a measurement of instance recall for each question (a total of eight in the TREC-7 experiment and six in the TREC-8 experiment).

All other data collected were

Comparing Boolean versus natural language searching

The main goal of our TREC-7 interactive experiment was to compare searching performance with Boolean and natural language interfaces in a specific population of searchers, namely experienced information professionals. A secondary goal of the experiment was to identify attributes associated with successful searching in this population.

Assessing the validity of batch-oriented retrieval evaluations

The goal of our TREC-8 experiment was to assess whether IR approaches achieving better performance in batch evaluations could translate that effectiveness to real users. This was done by a three-stage experiment. In the first stage we identified an “improved” weighting measure that achieved the best results over “baseline” TF $∗$ IDF with previous (TREC-6 and TREC-7) interactive track queries and relevance judgments. Next, we used the TREC-8 instance recall task to compare searchers using the

Conclusions

The results of these experiments challenge some widely held assumptions by many researchers in the IR field, which are that natural language systems are superior to Boolean systems and that the results of batch searching experiments are generalizable to evaluations with real users. Our particular experiments showed that for experienced users searching TREC-style queries on TREC databases, Boolean and natural language searching yield comparable instance recall. Furthermore, results obtained from

Acknowledgements

This study was funded in part by Grant LM-06311 from the US National Library of Medicine.

References (29)

Allen, B. (1992). Cognitive differences in end-user searching of a CD-ROM index. In Proceedings of the 15th annual...
Chin, J., Diehl, V., & Norman, K. (1988). Development of an instrument measuring user satisfaction of the...
Cleverdon, C., & Keen, E. (1966). Aslib Cranfield research project: Factors determining the performance of indexing...
Dumais, S., & Schmitt, D. (1991). Iterative searching in an online database. In Proceedings of the human factors...
L Gomez et al.
Learning to use a text editor: some learner characteristics that predict success
Human–Computer Interaction
(1986)
Harman, D. (1993). Overview of the first Text Retrieval Conference. In Proceedings of the 16th annual international ACM...
W Hersh
Relevance and retrieval evaluation: perspectives from medicine
Journal of the American Society for Information Science
(1994)
Hersh, W., Buckley, C., Leone, T., & Hickam, D. (1994). OHSUMED: an interactive retrieval evaluation and new large test...
Hersh, W., Elliot, D., Hickam, D., Wolf, S., Molnar, A., & Leichtenstein, C. (1995). Towards new measures of...
W Hersh et al.
An evaluation of interactive Boolean and natural language searching with an on-line medical textbook
Journal of the American Society for Information Science
(1995)

Hersh, W., & Over, P. (2000). TREC-8 interactive track report. In Proceedings of the eighth text retrieval conference...

W Hersh et al.

A task-oriented approach to information retrieval evaluation

Journal of the American Society for Information Science

(1996)

Hersh, W., Price, S., Kraemer, D., Chan, B., Sacherek, L., & Olson, D. (1998). A large-scale comparison of Boolean vs....

Hersh, W., Turpin, A., Price, S., Kraemer, D., Chan, B., Sacherek, L., & Olson, D. (1999). Do batch and user...

Cited by (0)

View full text

Information Processing & Management

Challenging conventional assumptions of automated information retrieval with real users: Boolean searching and batch retrieval evaluations

Abstract

Introduction

Section snippets

Commonalities across studies

Comparing Boolean versus natural language searching

Assessing the validity of batch-oriented retrieval evaluations

Conclusions

Acknowledgements

Learning to use a text editor: some learner characteristics that predict success

Human–Computer Interaction

Relevance and retrieval evaluation: perspectives from medicine

Journal of the American Society for Information Science

An evaluation of interactive Boolean and natural language searching with an on-line medical textbook

Journal of the American Society for Information Science

A task-oriented approach to information retrieval evaluation

Journal of the American Society for Information Science