The challenge of commercial document retrieval, Part I: Major issues, and a framework based on search exhaustivity, determinacy of representation and document collection size

https://doi.org/10.1016/S0306-4573(01)00024-3Get rights and content

Abstract

With the growing focus on what is collectively known as “knowledge management”, a shift continues to take place in commercial information system development: a shift away from the well-understood data retrieval/database model, to the more complex and challenging development of commercial document/information retrieval models. While document retrieval has had a long and rich legacy of research, its impact on commercial applications has been modest. At the enterprise level most large organizations have little understanding of, or commitment to, high quality document access and management. Part of the reason for this is that we still do not have a good framework for understanding the major factors which affect the performance of large-scale corporate document retrieval systems. The thesis of this discussion is that document retrieval—specifically, access to intellectual content—is a complex process which is most strongly influenced by three factors: the size of the document collection; the type of search (exhaustive, existence or sample); and, the determinacy of document representation. Collectively, these factors can be used to provide a useful framework for, or taxonomy of, document retrieval, and highlight some of the fundamental issues facing the design and development of commercial document retrieval systems. This is the first of a series of three articles. Part II (D.C. Blair, The challenge of commercial document retrieval. Part II. A strategy for document searching based on identifiable document partitions, Information Processing and Management, 2001b, this issue) will discuss the implications of this framework for search strategy, and Part III (D.C. Blair, Some thoughts on the reported results of Text REtrieval Conference (TREC), Information Processing and Management, 2002, forthcoming) will consider the importance of the TREC results for our understanding of operating information retrieval systems.

Introduction

A radical shift began to take place in the 1990s in the focus of commercial information processing software development: a shift away from the better-understood data retrieval/database model, and to the more complex and challenging development of commercial document retrieval models. This shift in focus came none too soon: management theorist Peter Drucker has stated that we have entered the third major evolution in the “… concept and structure of organizations… the shift from the command-and-control organization, the organization of departments and divisions, to the information-based organization, the organization of knowledge specialists.” (Drucker, 1988). Much of this organizational information, perhaps most of it, takes the form of documents (e.g., reports, messages, letters, journal and magazine articles, memos, minutes of meetings, research bulletins, etc.), and it is easy to see why documents are important: they are often the organizing and interpretive medium that gives data, figures and other information meaning within an organizational context. In short, documents are the medium where organizational memory or intelligence resides. Poor access to document content means poor access to the knowledge that an organization creates or acquires.

Interleaf has estimated that more than 1 billion documents are being created each day in North America, and that executives spend 40% of their time dealing directly with documents. The Gartner Group has estimated that as much as 90% of a corporation's information is contained in its documents. This brings a new urgency to the task of providing access to this enormous, and growing, volume of document-based information: lawsuits are lost or settled out of court because supporting documents cannot be found; internal studies and analyses are redone because the documented results of the original work cannot be located; managers make decisions that are sub-optimal or even incorrect because they are not aware of important relevant information that exists in other parts of the organization or is available from commercial document databases; scarce grant money is allocated to research that has already been done and published because neither the authors of the grant nor the referees are aware that similar work has already been completed. The consequences of poor document retrieval can be striking:

The effects of a failure of “document control” can be dramatic. A major utility company was required to shut down four nuclear reactors because of lost repair instructions (at a loss of $2m per day). The US Department of Defense estimates that half of all military accidents result from missing or inaccurate technical information. A major airline was fined $10k per take-off because of out-of-date maintenance information. A major drug company lost its entire R&D investment owing to inability to provide timely documentation (Fleischer, 1990).

Clearly document, or text, retrieval has taken a prominent place in commercial information system development. But are we ready for this shift? Do we have the software tools to build large-scale commercial document retrieval systems, and, more importantly, do we have a good conceptual understanding of what factors influence document retrieval in the corporate context? A clear “yes” cannot be given. Certainly, there has been a rich history of theoretical work in document retrieval (Salton & McGill, 1983; van Rijsbergen, 1979) and a history of small-system tests. But tests of actual large-scale commercial applications of this theory are still rare, and the market-place has been saturated with systems whose theoretical antecedents date back to the 1950s. [The periodic Text REtrieval Conferences (TRECs) have attempted to evaluate the retrieval effectiveness of comparatively large IR systems (though not in operational settings). But their conclusions must be taken with some caution. See Blair (2002, forthcoming).]

Section snippets

Data retrieval versus document retrieval

Why cannot we build commercial document retrieval systems based on the better-understood data retrieval model? Clearly, Data Base Management Systems (DBMSs) represent a relatively well-understood framework that should give us a firm foundation on which to build new commercial document retrieval systems. In addition, DBMSs have a wide selection of support software and utilities designed to facilitate database management (e.g., telecommunications interfaces, data loading programs, concurrency

Document retrieval and the problem of scale

This scaling problem is central to the problem of document retrieval, so it is important that we examine some of the factors that influence it more closely. Document retrieval is critically dependent on how the documents are represented on a particular system, and this system of representation comprises a kind of “language” in which document content or context can be described and searchers' requests can be expressed. Consequently, the properties of this “document language” can influence the

Zipf and the trial-and-error nature of document retrieval

One might ask, reasonably, why the mere increase in the number of documents that are represented by the word “computer” increases the indeterminacy of retrieving documents “about computers”. Is it not possible that the increase in the number of documents represented by “computer” simply increases the number of documents that are useful to the searcher? This does not necessarily happen. To understand why this is the case, we need to look again at Zipf's model of language. We had mentioned

Vocabulary balance and the competing forces of language

Zipf postulated that the statistical regularities which he noticed in language were the result of the competition of two forces in language: unification and diversification (Cherry calls these forces “Personal” and “Social” in his informed application of Zipf's work to communication theory (Cherry, 1971)). Basically, language needs two kinds of words to work efficiently: general words which can be used in a variety of contexts (and have a variety of related meanings), and specific words with

Document description and the small-system effect

Ironically, biasing the document representation vocabulary towards description tends to make small document retrieval systems of several hundred documents work better. Since many of the older tests of document retrieval effectiveness were done on such small systems, the results tended to confirm the effectiveness of “over-description”. A prominent study of document retrieval language used on a small system came precisely to this conclusion:

Many, many alternative access words are needed for

What do we know about large-scale commercial document retrieval systems?

Since document retrieval systems, when used to access intellectual content, do not scale up as easily as data retrieval systems, it becomes imperative to treat large-scale systems as fundamentally different from small-scale systems. That is, large-scale document retrieval systems may require not only a different design model, but may also require a substantially different theoretical foundation than small systems (Blair, 1990). It also means that usually we cannot infer the reliability of

Search exhaustivity and database size

We have already discussed how database size can significantly affect the retrieval of intellectual content. Search exhaustivity has a similar important influence. An exhaustive search on a document retrieval system is one in which the inquirer needs to see all, or nearly all, of the documents which are useful to him. This can be contrasted with what might be called a sample search, where the inquirer does not need all of the useful documents on the system. For example, a lawyer preparing to

The determinacy of representation: content and context

The third and final component in this framework for document retrieval is based on how precisely the documents can be represented on any given system; this is the system of representation through which we can provide access to the intellectual content of a given set of documents by including them in specific logical or intellectual categories. In more formal terms, it is the system of representation that, for any search, can provide an ordering of the documents in the database from those most

A framework for document/text retrieval

Fig. 2 gives the basic framework for document retrieval broken down by database size, search type (exhaustivity/sample), and the level of representational determinacy (content/context). This framework breaks down document retrieval into eight classes. An example of the type of retrieval for each class would be:

  • 1.

    (Large DB: exhaustive: content) Corporate Litigation Support. (“Did the defendants write anything objecting to the contract changes?”)

    • Research and Development. (“What work has been done

Degrees of difficulty

What can we say comparatively about these eight classes of document searching? In general:

AExhaustive content searches are more difficult than sample searches
BSearches for precise intellectual content on large document collections are usually more difficult than searches for the same material on small document collections, ceteris paribus
CContent searches based on descriptions of low determinacy are usually less precise than those based on descriptions of high determinacy (e.g., context),

Conclusion

The thesis of the first part of this two-part article has been that document retrieval is a complex process which is strongly influenced by at least three major factors: the size of the document collection; the type of search (exhaustive, existence or sample); and, the determinacy of document representation. Collectively, these factors can be used to provide a useful framework or taxonomy of the major kinds of document searches. Such a framework helps to highlight the fundamental issues facing

Acknowledgements

The author wishes to thank M.E. (Bill) Maron of the University of California, Berkeley, Don Swanson of the University of Chicago, Scott Serich of George Washington University, Bruce Hill of the University of Michigan and Steven Kimbrough of the University of Pennsylvania, for their comments on earlier versions of this paper, and their discussions of the issues raised in this article.

References (33)

  • D.C Blair

    The challenge of commercial document retrieval, Part II: A strategy for document searching based on identifiable document partitions

    Information Processing and Management

    (2001)
  • Blair, D. C. (2002). Some thoughts on the reported results of TREC. Information Processing and Management...
  • D.C Blair et al.

    An evaluation of retrieval effectiveness for a full-text document retrieval system

    Communications of the ACM

    (1985)
  • B.C Brookes

    The Derivation and Application of the Bradford-Zipf Distribution

    Journal of Documentation

    (1968)
  • T.A Brooks

    All the right descriptors: a test of the strategy of unlimited aliasing

    Journal of the American Society for Information Science

    (1993)
  • C Cherry

    On human communication: A review, a survey, and a criticism

    (1971)
  • Cited by (35)

    • External to internal search: Associating searching on search engines with searching on sites

      2015, Information Processing and Management
      Citation Excerpt :

      At the time of the study, 70.6% of all of BuenaMusica’s traffic was referred by major search engines; 25.0% was direct traffic, and 4.3% was referred by other websites. These traffic percentages are typical of many online commercial sites (Blair, 2002; Butlion, 2013). The site is particularly popular in South America, where it is one of the top sites in fourteen different South American countries.

    • A multi-faceted and automatic knowledge elicitation system (MAKES) for managing unstructured information

      2011, Expert Systems with Applications
      Citation Excerpt :

      It is also interesting to note that some researchers utilized a similarity-based method for retrieving documents (Chen, Cheng, & Cheng, 2007; Chen, Wei, Wu, & Hu, 2006; Cheng & Wu, 1995; Lucarella, 1988). Some conventional approaches (Blair, 2002; Kang, Na, Kim, & Lee, 2007; Lee, Park, & Choi, 2001; SanJuan & Ibekwe-SanJuan, 2006; Tombros, Villa, & van Rijsbergen, 2002) in information retrieval utilize document clusters based on the assumption that related documents are grouped into the same cluster, in order to improve efficiency or effectiveness. From the viewpoint of efficiency in information retrieval, searching and browsing clusters rather than individual documents may help to reduce the retrieval time for the system, and reduce the time users spend seeking, respectively.

    • A lean enterprise model for document control

      2011, IFAC Proceedings Volumes (IFAC-PapersOnline)
    • Evaluating epistemic uncertainty under incomplete assessments

      2008, Information Processing and Management
    • Bibliographic database access using free-text and controlled vocabulary: An evaluation

      2005, Information Processing and Management
      Citation Excerpt :

      On the other hand, when accounting for blind query expansions (bottom part of Table 11), the conclusions drawn are similar. When users want to find a greater number of pertinent articles from a large collection, they must anticipate scanning a large (or huge (Blair, 2002)) number of retrieved items. For example, a lawyer preparing to defend a client wants to find around 75% of all relevant documents (Blair & Maron, 1985).

    View all citing articles on Scopus
    View full text