Real life, real users, and real needs: a study and analysis of user queries on the web

doi:10.1016/S0306-4573(99)00056-4

Information Processing & Management

Volume 36, Issue 2, 1 March 2000, Pages 207-227

https://doi.org/10.1016/S0306-4573(99)00056-4 Get rights and content

Abstract

We analyzed transaction logs containing 51,473 queries posed by 18,113 users of Excite, a major Internet search service. We provide data on: (i) sessions — changes in queries during a session, number of pages viewed, and use of relevance feedback; (ii) queries — the number of search terms, and the use of logic and modifiers; and (iii) terms — their rank/frequency distribution and the most highly used search terms. We then shift the focus of analysis from the query to the user to gain insight to the characteristics of the Web user. With these characteristics as a basis, we then conducted a failure analysis, identifying trends among user mistakes. We conclude with a summary of findings and a discussion of the implications of these findings.

Introduction

A panel session at the 1997 ACM Special Interest Group on Research Issues In Information Retrieval conference entitled “Real Life Information Retrieval: Commercial Search Engines” included representatives from several Internet search services. Doug Cutting represented Excite, one of the major services. Graciously, he offered to make available a set of user queries as submitted to his service for research. The analysis we present here on the nature of sessions, queries, and terms resulted from this offer. Interestingly, the first two authors expressed their interest independently of each other, then met via email, exchanged messages and data, and conducted collaborative research exclusively through the Internet, before ever meeting in person at a Rutgers conference in February 1998, when the results were first presented. In itself, this is an example of how the Internet changed and is changing the conduct of research.

We will argue in the conclusions that real life Internet searching is changing information retrieval (IR) as well. While Internet search engines are based on IR principles, Internet searching is very different from IR searching as traditionally practised and researched in online databases, CD-ROMs and online public access catalogs (OPACS). Internet IR is a different IR, with a number of implications that could portend changes in other areas of IR as well.

With the phenomenal increase in usage of the Web, there has been a growing interest in the study of a variety of topics and issues related to use of the Web. For instance, on the hardware side, Crovella and Bestavros (1996) studied client-side traffic; and Abdulla, Fox and Abrams, (1997) analyzed server usage. On the software side, there have been many descriptive evaluations of Web search engines (e.g. Lynch, 1997). Statistics of Web use appear regularly (e.g. Kehoe et al., 1997, FIND/SVP, 1997), but as soon as they appear, they are out of date. The coverage of various Web search engine services was analyzed in several works. A recent article on this topic by Lawrence and Giles (1998) attracted a lot of attention. The pattern of Web surfing by users was analyzed as well (Huberman, Pirolli, Pitkow & Lukose, 1998). However, to date there has been no large-scale, quantitative or qualitative study of Web searching.

How do they search the Web? What do they search for on the Web? These questions are addressed in a large scale and academic manner in this study. Given the recent yearly exponential increase in the estimated number of Web users, this lack of scholarly research is surprising and disappointing. In contrast, there have been an abundance of user studies of on-line public access categories (OPAC) users. Many of these studies are reviewed in Peters (1993). Similarly, there are numerous studies of users of traditional IR systems. The combined proceedings of the International Conference on Research Issues in Information Retrieval (ACM SIGIR) present many of these studies.

In the area of Web users, however, there were only two narrow studies that we could find. One focused on the THOMAS system (Croft, Cook & Wilder, 1995) and contained some general information about users at that site. However, this study focused exclusively on the THOMAS Web site, did not attempt to characterize Web searching in a systematic way, and is devoted primarily to a description of the THOMAS system. The second paper was by Jones, Cunningham and McNab (1998) and focused again on a single Web site, the New Zealand Digital Library, which contains computer science technical reports. Given the technical nature of this site, it is questionable whether these users represent Web users in general. There is a small but growing body of Web user studies compared to the numerous studies of OPAC and IR system use.

In this paper, we report results from a major and ongoing study of users’ searching behavior on the Web. We examined a set of transaction logs of users’ searches from Excite (http://www.excite.com). This study involved real users, using real queries, with real information needs, using a real search engine. The strength of this study is that it involved a real slice of life on the Web. The weakness is that it involved only a slice — an observable artifact of what the users actually did, without any information about the users themselves or about the results and uses. Users are anonymous, but we can identify one or a sequence of queries originating with a specific user. We know when they searched and what they searched for, but we do not know anything beyond that. We report on artifactual behavior, but without a context. However, the observation and analysis of such behavior provide for a fascinating and surprising insight into the interaction between users and the search engines on the Web. More importantly, this study provides detailed statistics currently lacking on Web user behavior. It also provides a basis for comparison with similar studies of user searching of more traditional IR and OPAC systems.

The Web has a number of search engines. The approaches to searching, including algorithms, displays, modes of interaction and so on, vary from one search engine to another. Still, all Web search engines are IR tools for searching highly diverse and distributed information resources as found on the Web. But by the nature of the Web resources, they are faced with different issues requiring different solutions than the search engines found in well organized systems, such as in DIALOG, or in lab experiments, such as in the Text Retrieval Conference (TREC) (Sparck Jones, 1995). Moreover, from all that we know, Web users span a vastly broader and thus probably different population of users (Spink, Bateman & Jansen, 1999) and information needs, which may greatly affect the queries, searches, and interactions. Thus, it is of considerable interest to examine the similarities and/or differences in Web searching compared to traditional IR systems. In either case, it is potentially a very different IR.

The significance of this study is the same as all other related studies of IR interaction, queries and searching. By axiom and from lessons learned from experience and numerous studies:

“The success or failure of any interactive system and technology is contingent on the extent to which user issues, the human factors, are addressed right from the beginning to the very end, right from theory, conceptualization, and design process to development, evaluation, and to provision of services” (Saracevic, 1997).

Section snippets

Related IR studies

In this paper, we concentrate on users’ sessions, queries, and terms as key variables in IR interaction on the Web. While there are many papers that discuss many aspects of Web searching, most of those are descriptive, prescriptive, or commentary. Other than the two mentioned previously, we could not find any similar studies of Web searching. However, there were several studies that included data on searching of existing, mostly commercial, IR systems, and we culled data from those to provide a

Background on Excite and data

Founded in 1994, Excite Inc. is a major Internet media public company which offers free Web searching and a variety of other services. The company and its services are described at its Web site (http://www.excite.com), thus not repeated here. Only the search capabilities relevant to out results are summarized.

Excite searches are based on the exact terms that a user enters in the query, however, capitalization is disregarded, with the exception of logical commands AND, OR, and AND NOT. Stemming

Results

First, what is the pattern of user queries? We looked at the number of queries by each specific user and how successive queries differed from other queries by the same user. We classified the 51,474 queries as to unique, modified, or identical as shown in Table 1.

A unique query was the first query by a user (this represents the number of users). A modified query is a subsequent query in succession (second, third …) by the same user with terms added to, removed from, or both added to and removed

Failure analysis

Next, we turn to a discussion of the surprisingly high number of incorrect uses or mistakes. When they used it, 50% of users made a mistake in the use of the Boolean AND; 28% made an error in uses of OR, and only 19% used AND NOT incorrectly, but only 47 users, a negligible percent, used AND NOT at all. The most common mistake was not capitalizing the Boolean operator, as required by the Excite search engine. For example, a correct query would be: information AND processing. The most common

Terms

We also analyzed user queries according to the terms they included. A term was any series of characters bounded by white space. There were 113,793 terms (all terms from all queries). After eliminating duplicate terms, there were 21,862 unique terms that were non-case sensitive (in other words, all upper cases are here reduced to lower case). In this distribution logical operators AND, OR, NOT were also treated as terms, because they were used not only as operators but also as conjunctions (we

Conclusions and future research

We investigated a large sample of searches on the Web, represented by logs of queries from Excite, a major Web search provider. However, we consider this study as a starting point. We have begun the analysis of a new sample of over 1 million queries. We will compare the results from this study with those of the larger study to isolate similarities and/or differences. In this larger study, we will address many of the research questions raised in this paper. While Web search engines follow the

Acknowledgements

The authors gratefully acknowledge the assistance of Graham Spencer, Doug Cutting, Amy Smith and Catherine Yip of Excite Inc. in providing the data and information for this research. Without the generous sharing of data by Excite Inc. this research would not be possible. We also acknowledge the generous support of our institutions for this research and the useful comments of the anonymous reviewers.

References (20)

G. Abdulla et al.
Shared user behavior on the World Wide Web
M.J. Bates et al.
An analysis of search terminology used by humanities scholars: The Getty online searching project report
Library Quarterly
(1993)
W.B. Croft et al.
Providing government information on the Internet: experiences with THOMAS
M.E. Crovella et al.
Self-similarity in World Wide Web traffic evidence and possible causes
C.H. Fenichel
Online searching: Measures that discriminate among users with different types of experience
Journal of the American Society for Information Science
(1981)
FIND/SVP 1997. The 1997 American Internet User Survey....
I. Hsieh-yee
Effects of search experience and subject knowledge on the search tactics of novice and experienced searchers
Journal of the American Society for Information Science
(1993)
B.A. Huberman et al.
Strong regularities in World Wide Web surfing
Science
(1998)
B.J. Jansen et al.
Searchers, the Subjects They Search, and sufficiency: A Study of a Large Sample of Excite Searches
B.J. Jansen et al.
Real life information retrieval: a study of user queries on the Web
SIGIR Forum
(1998)

There are more references available in the full text version of this article.

Cited by (987)

Intrinsic self-rewards for participating in household sector user innovation: Lessons from a survey of Chinese residents
2024, Technological Forecasting and Social Change
Household sector (HHS) user innovation is recognized as a prevalent type of innovation, initiated for personal use and unrelated to pecuniary transactions. Considering its non-pecuniary logic, this study addresses the non-pecuniarily self-driven motivations of HHS user innovation. We explore how altruistic intrinsic self-rewards (ISRs) motivate HHS user innovation by affecting individual relatedness, competence, and autonomy, based on a large-scale survey in China. Furthermore, we incorporate a social facilitation factor, i.e., the partnership (involving partnership size and partnership closeness), and investigate its moderating effects on the relationship between altruistic ISRs and HHS user innovation. The results indicate that individuals participating in HHS user innovation are positively motivated by altruistic ISRs. Besides, the partnership size strengthens the effect of altruistic ISRs on HHS user innovation while the partnership closeness dilutes the effect. By uncovering this motivation antecedent and the social facilitation conditions, this study provides theoretical contributions to and managerial implications for HHS user innovation.
Transient recovery from heat pipe dryout by power throttling
2024, International Journal of Heat and Mass Transfer
Heat pipes and vapor chambers are passive heat spreaders driven by capillary pumping of an internal working fluid via a porous wick. The capillary limit is the maximum steady-state heat input at which the fluid pressure drop can be supported by the capillary pressure head generated in the wick. However, heat pipes and vapor chambers often find application in devices where the heat input is highly transient and can exceed the capillary limit for brief time intervals. Operating heat pipes briefly above the capillary limit will not result in a dryout if the operating time interval does not exceed a characteristic time-to-dryout. Operation over a duration that exceeds this time-to-dryout can induce transient dryout and may lead to thermal hysteresis, that is, the original heat pipe thermal resistance may not be recovered even after the heat input is lowered back below the capillary limit. To fully recover the heat pipe performance after a transient dryout event, our recent experiments have shown that the heat input must be lowered (or throttled) significantly below the capillary limit. Due to the highly transient nature of power dissipation from electronic devices, it becomes imperative to characterize the throttling power level and duration required to ensure full recovery of a heat pipe from dryout under transient operations. This work experimentally characterizes recovery of heat pipes from dryout by power throttling under transient conditions, where ‘power throttling’ is the act of reducing the operating power level significantly below the capillary limit to eliminate post-dryout thermal hysteresis. We deduce from the experiments that the power must be throttled for longer than a minimum throttling time interval, defined as the time-to-rewet, in order to eliminate dryout-induced thermal hysteresis. The dependence of this time-to-rewet on the throttling power level is explored, and guidelines are presented on the need for throttling and the choice of throttling power under transient conditions.
Wetting hysteresis as the mechanism of heat pipe post-dryout thermal hysteresis and recovery
2023, International Journal of Heat and Mass Transfer
Heat pipes and vapor chambers are passive thermal management devices used for efficient heat transport by phase change. Their passive operation is enabled by capillary pumping of the working fluid in a porous wick, which is operationally limited by the maximum pressure head it can provide. This capillary limit marks the maximum heat input at which the capillary pressure generated can overcome the pressure drop in the wick; operating above the capillary limit at steady state leads to dryout. Heat pipes and vapor chambers are increasingly being used in electronics systems where end-user activity dictates the transient power input which can therefore be highly variable and time-dependent. It was recently shown that heat pipes can withstand a power pulse exceeding the capillary limit for brief time intervals. Under such operating conditions, the heat pipe will experience dryout only if the duration of the pulse load is longer than a certain characteristic time interval. The pulse-load-induced dryout may result in an increased thermal resistance when the power is reduced back down to pre-dryout levels, thus exhibiting a hysteresis in heat pipe thermal performance. In this work, we experimentally characterize the recovery from pulsed-load-induced dryout. We further propose that the observed change in steady-state thermal performance before and after dryout results from contact angle hysteresis at the three-phase contact line of the wick-liquid interface. A model is developed based on this proposed mechanism to predict the nature of recovery from dryout-induced thermal hysteresis, as well as to identify that a given heat pipe has a maximum possible hysteresis. The experiments illustrate the trends inferred from the model for the recovery process and confirm the existence of a “maximum hysteresis line,” which identifies the worst-case scenario for thermal hysteresis after heat pipe dryout. Based on these mechanistic learnings, a new testing protocol is proposed for experimentally characterizing this post-dryout maximum hysteresis signature for a heat pipe.
The Influence of Presentation and Performance on User Satisfaction
2024, CHIIR 2024 - Proceedings of the 2024 Conference on Human Information Interaction and Retrieval
Privacy-Aware Semantic Cache for Large Language Models
2024, arXiv
Dissecting users’ needs for search result explanations
2024, arXiv

View all citing articles on Scopus

View full text

Real life, real users, and real needs: a study and analysis of user queries on the web

Abstract

Introduction

Section snippets

Related IR studies

Background on Excite and data

Results

Failure analysis

Terms

Conclusions and future research

Acknowledgements

Shared user behavior on the World Wide Web

An analysis of search terminology used by humanities scholars: The Getty online searching project report

Library Quarterly

Providing government information on the Internet: experiences with THOMAS

Self-similarity in World Wide Web traffic evidence and possible causes

Online searching: Measures that discriminate among users with different types of experience

Journal of the American Society for Information Science

Effects of search experience and subject knowledge on the search tactics of novice and experienced searchers

Journal of the American Society for Information Science

Strong regularities in World Wide Web surfing

Science

Searchers, the Subjects They Search, and sufficiency: A Study of a Large Sample of Excite Searches

Real life information retrieval: a study of user queries on the Web

SIGIR Forum