The challenge of commercial document retrieval, Part I: Major issues, and a framework based on search exhaustivity, determinacy of representation and document collection size
Introduction
A radical shift began to take place in the 1990s in the focus of commercial information processing software development: a shift away from the better-understood data retrieval/database model, and to the more complex and challenging development of commercial document retrieval models. This shift in focus came none too soon: management theorist Peter Drucker has stated that we have entered the third major evolution in the “… concept and structure of organizations… the shift from the command-and-control organization, the organization of departments and divisions, to the information-based organization, the organization of knowledge specialists.” (Drucker, 1988). Much of this organizational information, perhaps most of it, takes the form of documents (e.g., reports, messages, letters, journal and magazine articles, memos, minutes of meetings, research bulletins, etc.), and it is easy to see why documents are important: they are often the organizing and interpretive medium that gives data, figures and other information meaning within an organizational context. In short, documents are the medium where organizational memory or intelligence resides. Poor access to document content means poor access to the knowledge that an organization creates or acquires.
Interleaf has estimated that more than 1 billion documents are being created each day in North America, and that executives spend 40% of their time dealing directly with documents. The Gartner Group has estimated that as much as 90% of a corporation's information is contained in its documents. This brings a new urgency to the task of providing access to this enormous, and growing, volume of document-based information: lawsuits are lost or settled out of court because supporting documents cannot be found; internal studies and analyses are redone because the documented results of the original work cannot be located; managers make decisions that are sub-optimal or even incorrect because they are not aware of important relevant information that exists in other parts of the organization or is available from commercial document databases; scarce grant money is allocated to research that has already been done and published because neither the authors of the grant nor the referees are aware that similar work has already been completed. The consequences of poor document retrieval can be striking:
The effects of a failure of “document control” can be dramatic. A major utility company was required to shut down four nuclear reactors because of lost repair instructions (at a loss of $2m per day). The US Department of Defense estimates that half of all military accidents result from missing or inaccurate technical information. A major airline was fined $10k per take-off because of out-of-date maintenance information. A major drug company lost its entire R&D investment owing to inability to provide timely documentation (Fleischer, 1990).
Clearly document, or text, retrieval has taken a prominent place in commercial information system development. But are we ready for this shift? Do we have the software tools to build large-scale commercial document retrieval systems, and, more importantly, do we have a good conceptual understanding of what factors influence document retrieval in the corporate context? A clear “yes” cannot be given. Certainly, there has been a rich history of theoretical work in document retrieval (Salton & McGill, 1983; van Rijsbergen, 1979) and a history of small-system tests. But tests of actual large-scale commercial applications of this theory are still rare, and the market-place has been saturated with systems whose theoretical antecedents date back to the 1950s. [The periodic Text REtrieval Conferences (TRECs) have attempted to evaluate the retrieval effectiveness of comparatively large IR systems (though not in operational settings). But their conclusions must be taken with some caution. See Blair (2002, forthcoming).]
Section snippets
Data retrieval versus document retrieval
Why cannot we build commercial document retrieval systems based on the better-understood data retrieval model? Clearly, Data Base Management Systems (DBMSs) represent a relatively well-understood framework that should give us a firm foundation on which to build new commercial document retrieval systems. In addition, DBMSs have a wide selection of support software and utilities designed to facilitate database management (e.g., telecommunications interfaces, data loading programs, concurrency
Document retrieval and the problem of scale
This scaling problem is central to the problem of document retrieval, so it is important that we examine some of the factors that influence it more closely. Document retrieval is critically dependent on how the documents are represented on a particular system, and this system of representation comprises a kind of “language” in which document content or context can be described and searchers' requests can be expressed. Consequently, the properties of this “document language” can influence the
Zipf and the trial-and-error nature of document retrieval
One might ask, reasonably, why the mere increase in the number of documents that are represented by the word “computer” increases the indeterminacy of retrieving documents “about computers”. Is it not possible that the increase in the number of documents represented by “computer” simply increases the number of documents that are useful to the searcher? This does not necessarily happen. To understand why this is the case, we need to look again at Zipf's model of language. We had mentioned
Vocabulary balance and the competing forces of language
Zipf postulated that the statistical regularities which he noticed in language were the result of the competition of two forces in language: unification and diversification (Cherry calls these forces “Personal” and “Social” in his informed application of Zipf's work to communication theory (Cherry, 1971)). Basically, language needs two kinds of words to work efficiently: general words which can be used in a variety of contexts (and have a variety of related meanings), and specific words with
Document description and the small-system effect
Ironically, biasing the document representation vocabulary towards description tends to make small document retrieval systems of several hundred documents work better. Since many of the older tests of document retrieval effectiveness were done on such small systems, the results tended to confirm the effectiveness of “over-description”. A prominent study of document retrieval language used on a small system came precisely to this conclusion:
Many, many alternative access words are needed for
What do we know about large-scale commercial document retrieval systems?
Since document retrieval systems, when used to access intellectual content, do not scale up as easily as data retrieval systems, it becomes imperative to treat large-scale systems as fundamentally different from small-scale systems. That is, large-scale document retrieval systems may require not only a different design model, but may also require a substantially different theoretical foundation than small systems (Blair, 1990). It also means that usually we cannot infer the reliability of
Search exhaustivity and database size
We have already discussed how database size can significantly affect the retrieval of intellectual content. Search exhaustivity has a similar important influence. An exhaustive search on a document retrieval system is one in which the inquirer needs to see all, or nearly all, of the documents which are useful to him. This can be contrasted with what might be called a sample search, where the inquirer does not need all of the useful documents on the system. For example, a lawyer preparing to
The determinacy of representation: content and context
The third and final component in this framework for document retrieval is based on how precisely the documents can be represented on any given system; this is the system of representation through which we can provide access to the intellectual content of a given set of documents by including them in specific logical or intellectual categories. In more formal terms, it is the system of representation that, for any search, can provide an ordering of the documents in the database from those most
A framework for document/text retrieval
Fig. 2 gives the basic framework for document retrieval broken down by database size, search type (exhaustivity/sample), and the level of representational determinacy (content/context). This framework breaks down document retrieval into eight classes. An example of the type of retrieval for each class would be:
- 1.
(Large DB: exhaustive: content) Corporate Litigation Support. (“Did the defendants write anything objecting to the contract changes?”)
Research and Development. (“What work has been done
Degrees of difficulty
What can we say comparatively about these eight classes of document searching? In general:A Exhaustive content searches are more difficult than sample searches B Searches for precise intellectual content on large document collections are usually more difficult than searches for the same material on small document collections, ceteris paribus C Content searches based on descriptions of low determinacy are usually less precise than those based on descriptions of high determinacy (e.g., context),
Conclusion
The thesis of the first part of this two-part article has been that document retrieval is a complex process which is strongly influenced by at least three major factors: the size of the document collection; the type of search (exhaustive, existence or sample); and, the determinacy of document representation. Collectively, these factors can be used to provide a useful framework or taxonomy of the major kinds of document searches. Such a framework helps to highlight the fundamental issues facing
Acknowledgements
The author wishes to thank M.E. (Bill) Maron of the University of California, Berkeley, Don Swanson of the University of Chicago, Scott Serich of George Washington University, Bruce Hill of the University of Michigan and Steven Kimbrough of the University of Pennsylvania, for their comments on earlier versions of this paper, and their discussions of the issues raised in this article.
References (33)
Indeterminacy in the subject access to documents
Information Processing and Management
(1986)An extended relational document retrieval model
Information Processing and Management
(1988)- et al.
Full-text information retrieval: further analysis and clarification
Information processing and management
(1990) Ranking techniques and the empirical log law
Information Processing and Management
(1984)- Baker, G. P., & Hacker, P. M. S. (1985). Vagueness and determinacy of sense. In Wittgenstein: Meaning and...
Searching biases in large interactive document retrieval systems
Journal of the American Society for Information Science
(1980)The data-document distinction in information retrieval
Communications of the ACM
(1984)The management of information: basic distinctions
Sloan Management Review
(1984)Language and representation in information retrieval
(1990)- Blair, D. C. (1999). The data-document distinction revisited. Working Paper, University of Michigan, Ann...
The challenge of commercial document retrieval, Part II: A strategy for document searching based on identifiable document partitions
Information Processing and Management
An evaluation of retrieval effectiveness for a full-text document retrieval system
Communications of the ACM
The Derivation and Application of the Bradford-Zipf Distribution
Journal of Documentation
All the right descriptors: a test of the strategy of unlimited aliasing
Journal of the American Society for Information Science
On human communication: A review, a survey, and a criticism
Cited by (35)
External to internal search: Associating searching on search engines with searching on sites
2015, Information Processing and ManagementCitation Excerpt :At the time of the study, 70.6% of all of BuenaMusica’s traffic was referred by major search engines; 25.0% was direct traffic, and 4.3% was referred by other websites. These traffic percentages are typical of many online commercial sites (Blair, 2002; Butlion, 2013). The site is particularly popular in South America, where it is one of the top sites in fourteen different South American countries.
A multi-faceted and automatic knowledge elicitation system (MAKES) for managing unstructured information
2011, Expert Systems with ApplicationsCitation Excerpt :It is also interesting to note that some researchers utilized a similarity-based method for retrieving documents (Chen, Cheng, & Cheng, 2007; Chen, Wei, Wu, & Hu, 2006; Cheng & Wu, 1995; Lucarella, 1988). Some conventional approaches (Blair, 2002; Kang, Na, Kim, & Lee, 2007; Lee, Park, & Choi, 2001; SanJuan & Ibekwe-SanJuan, 2006; Tombros, Villa, & van Rijsbergen, 2002) in information retrieval utilize document clusters based on the assumption that related documents are grouped into the same cluster, in order to improve efficiency or effectiveness. From the viewpoint of efficiency in information retrieval, searching and browsing clusters rather than individual documents may help to reduce the retrieval time for the system, and reduce the time users spend seeking, respectively.
A lean enterprise model for document control
2011, IFAC Proceedings Volumes (IFAC-PapersOnline)Evaluating epistemic uncertainty under incomplete assessments
2008, Information Processing and ManagementA computational framework for retrieval of document fragments based on decomposition schemes in engineering information management
2006, Advanced Engineering InformaticsBibliographic database access using free-text and controlled vocabulary: An evaluation
2005, Information Processing and ManagementCitation Excerpt :On the other hand, when accounting for blind query expansions (bottom part of Table 11), the conclusions drawn are similar. When users want to find a greater number of pertinent articles from a large collection, they must anticipate scanning a large (or huge (Blair, 2002)) number of retrieved items. For example, a lawyer preparing to defend a client wants to find around 75% of all relevant documents (Blair & Maron, 1985).