Abstract
Linkage analysis as an aid to web search has been assumed to be of significant benefit and we know that it is being implemented by many major Search Engines. Why then have few TREC participants been able to scientifically prove the benefits of linkage analysis in recent years? In this paper we put forward reasons why many disappointing results have been found in TREC experiments and we identify the linkage density requirements of a dataset to faithfully support experiments into linkage-based retrieval by examining the linkage structure of the WWW. Based on these requirements we report on methodologies for synthesising such a test collection.
Article PDF
Similar content being viewed by others
References
Adamic L (2003) Zipf, Power-laws, and Pareto-a ranking tutorial. Available at http://www.hpl.hp.com/shl/ papers/ranking/ (visited 1st September 2003).
Adamic L and Humberman B (2001) The Web's hidden order. Communications of the ACM, 44(9):55–59.
ALLTHEWEB (2003) http://www.alltheweb.com (visited 1st September 2003).
Amento B, Terveen L and Hill W (2000) Does 'Authority' mean quality? Predicting expert quality ratings of web document. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in IR, pp. 296-303.
Bailey P, Craswell N and Hawking D (2003) Engineering a multi-purpose test collection for Web Retrieval Experiments. Journal of Information Processing and Management, 853-871.
Barabasi A and Albert R (1999) Emergence of scaling in random networks. Science, 286:509–512.
Bharat K and Henzinger M (1998) Improved algorithms for topic distillation in a hyperlinked environment. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in IR, pp. 104-111.
Brin S and Page L (1998) The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the 7th International WWW Conference, pp. 107-117.
Broder A (2002) A taxonomy of web search. ACM SIGIR Forum, 36(2):3–10.
Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A and Weiner J (2000) Graph structure in the web. In: Proceedings of the 9th International WWW Conference, pp. 309-320.
CYVEILLANCE (2003) http://www.cyveillance.com. (visited 13th May 2003).
Faloutsos M, Faloutsos P and Faloutsos C (1999) On power-law relationships of the internet topology. In: Proceedings of the annual ACM SIGCOMM Conference on Research and Development in Data Communications 99, pp. 251-262.
GOOGLE (2003) http://www.google.com (visited 1st September 2003).
Gurrin C and Smeaton AF (2003) Improving the evaluation of web search systems. Advances in information retrieval. In: Proceedings of the 25th BCS-IRSG European Colloquium on IR Research, Springer Lecture Notes in Computer Science, pp. 25-40.
Gurrin C and Smeaton AF (1999) Connectivity analysis approaches to increasing precision in retrieval from hyperlinked documents. In: Proceedings of the 8th Annual TREC Conference, pp. 357-366.
Gurrin C and Smeaton AF (2000) Dublin city university experiments in connectivity analysis for TREC-9. In: Proceedings of the 9th Annual TREC Conference, pp. 179-188.
Hawking D (2000) Overview of the TREC-9 web track. In: Proceedings of the 9th Annual TREC Conference, pp. 87-102.
Hawking D, Voorhees E, Craswell N and Bailey P (1999) Overview of the TREC-8 web track. In: Proceedings of the 8th Annual TREC Conference, pp. 131-150.
Kleinberg J (1999) Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–623.
Kumar R, Raghavan P, Rajagopalan S and Tomkins A (1999) Trawling the web for emerging cyber-communities. In: Proceedings of the 8th International World Wide Web Conference, pp. 403-415.
McBryan O (1994) GENVL and WWWW: Tools for taming the Web. In: Proceedings of the 1st International WWW Conference, pp. 58-67.
Mitzenmacher M (2001) A brief history of generative models for power law and lognormal distributions. In: Proceedings of the 39th Annual Allerton Conference on Communication, Control, and Computing, pp. 182-191.
Murray B and Moore (2003)Asizing the internet-A White Paper. Cyveillance, Inc., 2000.Available at http://www.cyveillance.com/web/corporate/white papers.htm (visited 1st September 2003).
Page L, Brin S, Motwani R and Winograd T (1997) The PageRank citation ranking: Bringing order to the web. Stanford Digital Libraries Working Paper, 0072.
Pennock D, Flake G, Lawrence S, Glover E and Giles L (2002) Winners don't take all: Characterising the competition for links on the web. National Academy of Sciences, 99(8):5207–5211.
Silverstein C, Henzinger M, Marais J and Moricz M (1998) Analysis of a very large AltaVista query log. Digital SRC Technical Note 1998-014.
Singhal A. and Kaszkiel M (2000) AT&T at TREC-9. In: Proceedings of the 9th Annual TREC Conference, pp. 103-105.
Soboroff I (2002) Does WT10g look like the Web? In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in IR, pp. 423-424.
SOWS III: The Third State of the Web Survey (1999) http://www.pantos.org/atw/35654-a.html (visited 1st September 2003).
URouLette Random Web Page Generator (2003) http://www.uroulette.com (visited 1st September 2003).
Wu L, Huang X, Niu J, Xia Y, Feng Z and Zhou Y (2002) FDU at TREC 2002: Filtering, Q&A, web and video tasks. In: Proceedings of the 11th Annual TREC Conference, pp. 232-247.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Gurrin, C., Smeaton, A.F. Replicating Web Structure in Small-Scale Test Collections. Information Retrieval 7, 239–263 (2004). https://doi.org/10.1023/B:INRT.0000011206.23588.ab
Issue Date:
DOI: https://doi.org/10.1023/B:INRT.0000011206.23588.ab