Search markets and search results: The case of Bing
Introduction
According to Alexa.com (2012), web search is central to 12 of the world's 25 most visited websites: Google (rank 1), Yahoo! (4; search and portal, now owned by Microsoft and with results driven by Bing), Baidu (5), Google India (14), Yahoo! Japan (15), Google Germany (17), Yandex (18), MSN (19; portal and Bing search), Google Hong Kong (21), Google Japan (22), Bing (23) and Google UK (25). Although the broad details of how search engines work are known, the details of their operations, and particularly the ranking of results and spam filtering, are unknown and seem to be closely guarded commercial secrets. It is known that the same query will generate different results over time and between search engines, but the same query can also generate different results at the same time for different users based upon their geographic location or search preferences and the nature and extent of this variation are unclear.
Some search engines, including Bing, segment users into search markets when calculating the results of their query. These search markets are based upon geographic location and language. For example, one Bing search market is English-India; people in India searching Bing with English as their default setting will get different results than people searching in English in the US (the English-USA market). The search market is chosen by Bing, with users having (as of November 2012) the option to override it by clicking the Preferences icon on the Bing.com home page and then the Change your country/region link. Bing's list of 40 options includes 10 that specify a language, of which three include English: Arab countries, Canada, and United States (all areas with other popular languages spoken). In contrast, Google appears to have more fine-grained location based results (perhaps, in part, for its map-based search results) because its results pages (as of November 2012) include a Change location link with a free-text field that recognizes individual towns. Google also apparently has 188 different national or regional variants branded by domain name, such as Google.co.uk (Google.com, 2012). For both search engines, the regional results include international results and the user is able to separately request that only results from a certain country or region are returned. There seems, however, to have been no research into the impact of search markets on search engine results.
Section snippets
Problem statement
Differences between search results are of particular interest in the field of webometrics, which often involves counting web search matches for large sets of queries. Many studies need lists of URLs matching a search that are as complete as possible or, alternatively, hit-count estimates (figures reported near the top of a search results page estimating the total number of results) as proxies for these (Ortega and Aguillo, 2009, Park, 2010, Spörrle and Tumasjan, 2011, Thelwall et al., 2010,
Search engines and search results
Although the performance and algorithms used by the major commercial search engines are not public, some general information is known about how search engines work from publications (Brin & Page, 1998) and patents (Page, 2001) produced by their architects. In addition, some information science research has investigated the output of search engines, typically focusing on variations in results over time.
Research questions
The research questions concern the extent of variation of the top 10 results and all results returned for a query. The questions concerning the top 10 are most relevant to users who typically may not visit any more results; and the questions concerning all URLs are most relevant to webometric studies, although in both cases the results may vary for different types of query.
- •
In terms of the overlaps between the results sets for the same query, do the top 10 and all search results vary more over
Methods
The overall research design was to conduct a series of identical searches in different search markets, at a series of different points in time, and to compare the results for the extent of overlap between them, using the Bing API 2.0 as the data source. The Bing API allows programmers to access the Bing search engine on a limited basis. The choice of the API was made partly because it is used in webometric research and partly to ensure reliable results. An alternative way to collect the data
Results
The first question asked whether search results varied more over time or more between markets. The top 10 results between different markets at the same point in time (Table 1, diagonal values) overlapped approximately the same amount as between the same market at different points in time (Table 1, off-diagonal values), at least for gaps of one or two months. Although the overall difference was significant at p < 0.001 using an independent samples t-test, the overall Jaccard similarity difference
Discussion
There is a surprising amount of variation among the URLs returned for the different English search markets, as returned by the Bing API. This finding should not be taken as indicative of all queries because of potential variations due to different types of search, such as academic, educational, cultural, or commercial queries. Indeed, it seems likely that academic queries would exhibit less variation, and cultural and commercial queries would exhibit more variation. Similarly, the results may
Conclusions
The top 10 results for the queries tested here showed substantial variations with the average overlap between any pair of search markets being less than 50%. There was more overlap between the full sets of results, with a majority of URLs being the same, on average, between pairs of different search markets. The extent of overlap between different search markets' results was about the same as the overlap for the same market a month later for the top 10 results, but less for the complete set of
Acknowledgment
This paper is supported by ACUMEN (Academic Careers Understood through Measurement and Norms), grant agreement number 266632, under the Seventh Framework Program of the European Union. The funding source had no role in the study, including: design, the collection, analysis and interpretation of data; the writing of the report; and the decision to submit the article for publication.
Mike Thelwall is professor of information science and leader of the Statistical Cybermetrics Research Group at the University of Wolverhampton, UK, and a research associate at the Oxford Internet Institute. He has developed tools for gathering and analyzing web data, including hyperlink analysis, sentiment analysis, and content analysis for Twitter, YouTube, blogs and the general web. His publications include 210 refereed journal articles, seven book chapters, and two books, including
References (40)
- et al.
The anatomy of a large-scale hypertextual web search engine
Computer Networks and ISDN Systems
(1998) - et al.
Mapping world-class universities on the web
Information Processing & Management
(2009) - et al.
A study of results overlap and uniqueness among major web search engines
Information Processing & Management
(2006) - et al.
Search engine coverage bias: Evidence and possible causes
Information Processing & Management
(2004) - et al.
Word co-occurrences on webpages as a measure of the relatedness of organizations: A new webometrics concept
Journal of Informetrics
(2010) Alexa top 500 global sites
- et al.
Capacity planning for vertical search engines
Methods for measuring search engine performance over time
Journal of the American Society for Information Science and Technology
(2002)- et al.
Dynamics of search engine rankings — A case study
- et al.
The lifespan of ‘informetrics’ on the web: An eight year study (1998–2006)
Scientometrics
(2008)
A method for measuring the evolution of a topic on the web: The case of “informetrics”
Journal of the American Society for Information Science and Technology
Integration of news content into web results
Google basics: Learn how Google discovers, crawls, and serves web pages
A longitudinal study of web pages continued: A report after six years
Information Research
Google scholar citations and Google Web/URL citations: A multi-discipline exploratory analysis
Journal of the American Society for Information Science and Technology
Accessibility of information on the web
Nature
A three-year study on the freshness of web search engine databases
Journal of Information Science
Cited by (25)
Rethinking the corporate digital divide: The complementarity of technologies and the demand for digital skills
2021, Technological Forecasting and Social ChangeCitation Excerpt :This was introduced in May 2011 and has become the only major international data source for web search engines, available for automatic offline processing in webometric research (Thelwall and Sud, 2012). The Google search engine is not used as it does not allow automatic searching through API (Wilkinson and Thelwall, 2013). Since Microsoft Bing API allows a researcher to perform an analysis using R or Python scripts, rather than carrying out a manual search, we have written the Python script to collect data relating to the number of mentions of a company's name and the keywords from the list above.
Can we use Google Scholar to identify highly-cited documents?
2017, Journal of InformetricsCitation Excerpt :These are three aspects that all require detailed discussion. The way search engines (not only academic search engines, such as Google Scholar, but also general search engines, such as Google or Bing) function can cause two identical queries, made on different computers in different geographical locations, or simply repeated after a short period of time, to generate slightly different results (Wilkinson & Thelwall, 2013). This, in turn, can cause some documents to appear or disappear, or to move to another position within the search results page.
From Web Catalogs to Google: A Retrospective Study of Web Search Engines Sustainable Development
2023, Sustainability (Switzerland)Does the regional environment matter in ERP system adoption? Evidence from Russia
2023, Journal of Enterprise Information ManagementCitation Indexing and Indexes
2021, Knowledge Organization
Mike Thelwall is professor of information science and leader of the Statistical Cybermetrics Research Group at the University of Wolverhampton, UK, and a research associate at the Oxford Internet Institute. He has developed tools for gathering and analyzing web data, including hyperlink analysis, sentiment analysis, and content analysis for Twitter, YouTube, blogs and the general web. His publications include 210 refereed journal articles, seven book chapters, and two books, including Introduction to Webometrics. He is an associate editor of the Journal of the American Society for Information Science and Technology and sits on three other editorial boards.
David Wilkinson is a member of the Statistical Cybermetrics Research Group in the School of Technology at the University of Wolverhampton, UK, as well as head of the maths subject. He conducts link analysis and pure maths research and has published 10 refereed journal articles in journals such as, Journal of the American Society for Information Science and Technology, Journal of Information Science, and Information Processing & Management.