Real life, real users, and real needs: a study and analysis of user queries on the web
Introduction
A panel session at the 1997 ACM Special Interest Group on Research Issues In Information Retrieval conference entitled “Real Life Information Retrieval: Commercial Search Engines” included representatives from several Internet search services. Doug Cutting represented Excite, one of the major services. Graciously, he offered to make available a set of user queries as submitted to his service for research. The analysis we present here on the nature of sessions, queries, and terms resulted from this offer. Interestingly, the first two authors expressed their interest independently of each other, then met via email, exchanged messages and data, and conducted collaborative research exclusively through the Internet, before ever meeting in person at a Rutgers conference in February 1998, when the results were first presented. In itself, this is an example of how the Internet changed and is changing the conduct of research.
We will argue in the conclusions that real life Internet searching is changing information retrieval (IR) as well. While Internet search engines are based on IR principles, Internet searching is very different from IR searching as traditionally practised and researched in online databases, CD-ROMs and online public access catalogs (OPACS). Internet IR is a different IR, with a number of implications that could portend changes in other areas of IR as well.
With the phenomenal increase in usage of the Web, there has been a growing interest in the study of a variety of topics and issues related to use of the Web. For instance, on the hardware side, Crovella and Bestavros (1996) studied client-side traffic; and Abdulla, Fox and Abrams, (1997) analyzed server usage. On the software side, there have been many descriptive evaluations of Web search engines (e.g. Lynch, 1997). Statistics of Web use appear regularly (e.g. Kehoe et al., 1997, FIND/SVP, 1997), but as soon as they appear, they are out of date. The coverage of various Web search engine services was analyzed in several works. A recent article on this topic by Lawrence and Giles (1998) attracted a lot of attention. The pattern of Web surfing by users was analyzed as well (Huberman, Pirolli, Pitkow & Lukose, 1998). However, to date there has been no large-scale, quantitative or qualitative study of Web searching.
How do they search the Web? What do they search for on the Web? These questions are addressed in a large scale and academic manner in this study. Given the recent yearly exponential increase in the estimated number of Web users, this lack of scholarly research is surprising and disappointing. In contrast, there have been an abundance of user studies of on-line public access categories (OPAC) users. Many of these studies are reviewed in Peters (1993). Similarly, there are numerous studies of users of traditional IR systems. The combined proceedings of the International Conference on Research Issues in Information Retrieval (ACM SIGIR) present many of these studies.
In the area of Web users, however, there were only two narrow studies that we could find. One focused on the THOMAS system (Croft, Cook & Wilder, 1995) and contained some general information about users at that site. However, this study focused exclusively on the THOMAS Web site, did not attempt to characterize Web searching in a systematic way, and is devoted primarily to a description of the THOMAS system. The second paper was by Jones, Cunningham and McNab (1998) and focused again on a single Web site, the New Zealand Digital Library, which contains computer science technical reports. Given the technical nature of this site, it is questionable whether these users represent Web users in general. There is a small but growing body of Web user studies compared to the numerous studies of OPAC and IR system use.
In this paper, we report results from a major and ongoing study of users’ searching behavior on the Web. We examined a set of transaction logs of users’ searches from Excite (http://www.excite.com). This study involved real users, using real queries, with real information needs, using a real search engine. The strength of this study is that it involved a real slice of life on the Web. The weakness is that it involved only a slice — an observable artifact of what the users actually did, without any information about the users themselves or about the results and uses. Users are anonymous, but we can identify one or a sequence of queries originating with a specific user. We know when they searched and what they searched for, but we do not know anything beyond that. We report on artifactual behavior, but without a context. However, the observation and analysis of such behavior provide for a fascinating and surprising insight into the interaction between users and the search engines on the Web. More importantly, this study provides detailed statistics currently lacking on Web user behavior. It also provides a basis for comparison with similar studies of user searching of more traditional IR and OPAC systems.
The Web has a number of search engines. The approaches to searching, including algorithms, displays, modes of interaction and so on, vary from one search engine to another. Still, all Web search engines are IR tools for searching highly diverse and distributed information resources as found on the Web. But by the nature of the Web resources, they are faced with different issues requiring different solutions than the search engines found in well organized systems, such as in DIALOG, or in lab experiments, such as in the Text Retrieval Conference (TREC) (Sparck Jones, 1995). Moreover, from all that we know, Web users span a vastly broader and thus probably different population of users (Spink, Bateman & Jansen, 1999) and information needs, which may greatly affect the queries, searches, and interactions. Thus, it is of considerable interest to examine the similarities and/or differences in Web searching compared to traditional IR systems. In either case, it is potentially a very different IR.
The significance of this study is the same as all other related studies of IR interaction, queries and searching. By axiom and from lessons learned from experience and numerous studies:
“The success or failure of any interactive system and technology is contingent on the extent to which user issues, the human factors, are addressed right from the beginning to the very end, right from theory, conceptualization, and design process to development, evaluation, and to provision of services” (Saracevic, 1997).
Section snippets
Related IR studies
In this paper, we concentrate on users’ sessions, queries, and terms as key variables in IR interaction on the Web. While there are many papers that discuss many aspects of Web searching, most of those are descriptive, prescriptive, or commentary. Other than the two mentioned previously, we could not find any similar studies of Web searching. However, there were several studies that included data on searching of existing, mostly commercial, IR systems, and we culled data from those to provide a
Background on Excite and data
Founded in 1994, Excite Inc. is a major Internet media public company which offers free Web searching and a variety of other services. The company and its services are described at its Web site (http://www.excite.com), thus not repeated here. Only the search capabilities relevant to out results are summarized.
Excite searches are based on the exact terms that a user enters in the query, however, capitalization is disregarded, with the exception of logical commands AND, OR, and AND NOT. Stemming
Results
First, what is the pattern of user queries? We looked at the number of queries by each specific user and how successive queries differed from other queries by the same user. We classified the 51,474 queries as to unique, modified, or identical as shown in Table 1.
A unique query was the first query by a user (this represents the number of users). A modified query is a subsequent query in succession (second, third …) by the same user with terms added to, removed from, or both added to and removed
Failure analysis
Next, we turn to a discussion of the surprisingly high number of incorrect uses or mistakes. When they used it, 50% of users made a mistake in the use of the Boolean AND; 28% made an error in uses of OR, and only 19% used AND NOT incorrectly, but only 47 users, a negligible percent, used AND NOT at all. The most common mistake was not capitalizing the Boolean operator, as required by the Excite search engine. For example, a correct query would be: information AND processing. The most common
Terms
We also analyzed user queries according to the terms they included. A term was any series of characters bounded by white space. There were 113,793 terms (all terms from all queries). After eliminating duplicate terms, there were 21,862 unique terms that were non-case sensitive (in other words, all upper cases are here reduced to lower case). In this distribution logical operators AND, OR, NOT were also treated as terms, because they were used not only as operators but also as conjunctions (we
Conclusions and future research
We investigated a large sample of searches on the Web, represented by logs of queries from Excite, a major Web search provider. However, we consider this study as a starting point. We have begun the analysis of a new sample of over 1 million queries. We will compare the results from this study with those of the larger study to isolate similarities and/or differences. In this larger study, we will address many of the research questions raised in this paper. While Web search engines follow the
Acknowledgements
The authors gratefully acknowledge the assistance of Graham Spencer, Doug Cutting, Amy Smith and Catherine Yip of Excite Inc. in providing the data and information for this research. Without the generous sharing of data by Excite Inc. this research would not be possible. We also acknowledge the generous support of our institutions for this research and the useful comments of the anonymous reviewers.
References (20)
- et al.
Shared user behavior on the World Wide Web
- et al.
An analysis of search terminology used by humanities scholars: The Getty online searching project report
Library Quarterly
(1993) - et al.
Providing government information on the Internet: experiences with THOMAS
- et al.
Self-similarity in World Wide Web traffic evidence and possible causes
Online searching: Measures that discriminate among users with different types of experience
Journal of the American Society for Information Science
(1981)- FIND/SVP 1997. The 1997 American Internet User Survey....
Effects of search experience and subject knowledge on the search tactics of novice and experienced searchers
Journal of the American Society for Information Science
(1993)- et al.
Strong regularities in World Wide Web surfing
Science
(1998) - et al.
Searchers, the Subjects They Search, and sufficiency: A Study of a Large Sample of Excite Searches
- et al.
Real life information retrieval: a study of user queries on the Web
SIGIR Forum
(1998)
Cited by (987)
Intrinsic self-rewards for participating in household sector user innovation: Lessons from a survey of Chinese residents
2024, Technological Forecasting and Social ChangeTransient recovery from heat pipe dryout by power throttling
2024, International Journal of Heat and Mass TransferWetting hysteresis as the mechanism of heat pipe post-dryout thermal hysteresis and recovery
2023, International Journal of Heat and Mass TransferThe Influence of Presentation and Performance on User Satisfaction
2024, CHIIR 2024 - Proceedings of the 2024 Conference on Human Information Interaction and Retrieval