skip to main content
10.1145/1507509.1507512acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Distinguishing humans from robots in web search logs: preliminary results using query rates and intervals

Published:09 February 2009Publication History

ABSTRACT

The workload on web search engines is actually multiclass, being derived from the activities of both human users and automated robots. It is important to distinguish between these two classes in order to reliably characterize human web search behavior, and to study the effect of robot activity. We suggest an approach based on a multi-dimensional characterization of search sessions, and take first steps towards implementing it by studying the interaction between the query submittal rate and the minimal interval of time between different queries.

References

  1. N. Buzikashvili, "Sliding window technique for the web log analysis". In 16th Intl. World Wide Web Conf., pp. 1213--1214, May 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N. N. Buzikashvili and B. J. Jansen, "Limits of the web log analysis artifacts". In Workshop on Logging Traces of Web Activity: The Mechanics of Data Collection, May 2006.Google ScholarGoogle Scholar
  3. O. Etzioni, "Moving up the information food chain: deploying softbots on the world wide web". AI Magazine 18(2), pp. 11--18, Summer 1997.Google ScholarGoogle Scholar
  4. N. Geens, J. Huysmans, and J. Vanthienen, "Evaluation of web robot discovery techniques: a benchmarking study". In 6th Industrial Conf. Data Mining, pp. 121--130, Jul 2006. (LNCS vol. 4065). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. J. Jansen, T. Mullen, A. Spink, and J. Pedersen, "Automated gathering of web information: an in-depth examination of agents interacting with search engines". ACM Trans. Internet Technology 6(4), pp. 442--464, Nov 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. J. Jansen and A. Spink, "How are we searching the world wide web? a comparison of nine search engine transaction logs". Inf. Process. & Management 42(1), pp. 248--263, Jan 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Spink and B. J. Jansen, Web Search: Public Searching of the Web. Kluwer Academic Publishers, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Stassopoulou and M. D. Dikaiakos, "Web robot detection: a probabilistic reasoning approach". Computer Networks, 2009. (to appear). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Distinguishing humans from robots in web search logs: preliminary results using query rates and intervals

          Recommendations

          Reviews

          Donald Harris Kraft

          This paper raises an interesting issue about distinguishing between humans and robots performing a search, when processing Web search logs. This may not seem like an important issue, but it is necessary to ensure proper evaluation of these logs. Moreover, one can also use the methodology employed in this paper to assess the impact of robot searches on the search engines. Duskin and Feitelson note that past work generally considered that humans do not employ Boolean operators as extensively as software agents, such as those involved with meta-search engines. In their analysis, the authors use a variety of measures, including the number of queries submitted (average rate), the minimal interval between successive queries, the rate at which queries are typed, the duration of sessions of continuous activity, the time of day of the queries, and the regularity of submitted queries (whether the same query was submitted very often or if the queries were submitted at regular intervals). The authors use three logs-AlltheWeb, AltaVista, and MSN-as their data sources, over a few days. They find that classification by number of queries and by minimal interval between queries is quite useful. Moreover, they test the thresholds by which classification (human or robot) is made, to find the best thresholds for the classification decision. Finally, Duskin and Feitelson note that this is only a first step. More testing is needed to find reliable classification methods, since they feel that one simple threshold is not accurate enough. Readers, especially those involved with Web search log analysis, should consider this most interesting paper. Online Computing Reviews Service

          Access critical reviews of Computing literature here

          Become a reviewer for Computing Reviews.

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            WSCD '09: Proceedings of the 2009 workshop on Web Search Click Data
            February 2009
            95 pages
            ISBN:9781605584348
            DOI:10.1145/1507509

            Copyright © 2009 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 9 February 2009

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Upcoming Conference

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader