Abstract
The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of context on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of “authorative” information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages” that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristrics for link-based analysis.
- AROCENA,G.O.,MENDELZON,A.O.,AND MIHAILA, G. A. 1997. Applications of a Web query language. In Proceedings of the 6th International World Wide Web Conference (Santa Clara, Calif., Apr. 7-11). Google Scholar
- BARRETT, R., MAGLIO, P., AND KELLEM, D. 1997. How to personalize the web. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI '97) (Atlanta, Ga., Mar. 22-27). ACM, New York, pp. 75-82. Google Scholar
- BERMAN, O., HODGSON,M.J.,AND KRASS, D. 1995. Flow-interception problems. In Facility Location: A Survey of Applications and Methods, Z. Drezner, ed. Springer-Verlag, New York.Google Scholar
- BERNERS-LEE, T., CAILLIAU, R., LUOTONEN, A., NIELSEN,H.F.,AND SECRET, A. 1994. The world-wide web. Commun. ACM 37, 1 (Jan.), 76-82. Google Scholar
- BHARAT, K., BRODER, A., HENZINGER,M.R.,KUMAR, P., AND VENKATASUBRAMANIAN, S. 1998. Connectivity server: Fast access to linkage information on the web. In Proceedings of the 7th International World Wide Web Conference (Brisbane, Australia, Apr. 14-18). Google Scholar
- BHARAT, K., AND HENZINGER, M. R. 1998. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia, Aug. 24-28). ACM, New York, pp. 104-111. Google Scholar
- BOTAFOGO, R., RIVLIN, E., AND SHNEIDERMAN, B. 1992. Structural analysis of hypertext: Identify-ing hierarchies and useful metrics. ACM Trans. Inf. Sys. 10, 2 (Apr.), 142-180. Google Scholar
- BRIN, S., AND PAGE, L. 1998. Anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference (Brisbane, Australia, Apr. 14-18). pp. 107-117. Google Scholar
- CARRIERE, J., AND KAZMAN, R. 1997. WebQuery: Searching and visualizing the web through connectivity. In Proceedings of the 6th International World Wide Web Conference (Santa Clara, Calif., Apr. 7-11). Google Scholar
- CHAKRABARTI, S., DOM, B., GIBSON, D., KUMAR,S.R.,RAGHAVAN, P., RAJAGOPALAN, S., AND TOMKINS, A. 1998. Experiments in topic distillation. In Proceedings of the ACM SIGIR Workshop on Hypertext Information Retrieval on the Web (Melbourne, Australia). ACM, New York.Google Scholar
- CHAKRABARTI, S., DOM, B., GIBSON, D., KLEINBERG, J., RAGHAVAN, P., AND RAJAGOPALAN, S. 1998. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceed-ings of the 7th International World Wide Web Conference (Brisbane, Australia, Apr. 14-18). pp. 65-74. Google Scholar
- CHUNG, F. R. K. 1997. Spectral Graph Theory. AMS Press, Providence, R.I.Google Scholar
- CHEKURI, C., GOLDWASSER, M., RAGHAVAN, P., AND UPFAL, E. 1997. Web search using automated classification. In Proceedings of the 6th International World Wide Web Conference (Santa Clara, Calif., Apr. 7-11).Google Scholar
- CUTTING,D.R.,PEDERSEN, J., KARGER,D.R.,AND TUKEY, J. W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Copenhagen, Denmark, June 21-24). ACM, New York, pp. 330-337. Google Scholar
- DE SOLLA PRICE, D. 1981. The analysis of square matrices of scientometric transactions. Sciento-metrics 3 55-63.Google Scholar
- DEERWESTER, S., DUMAIS, S., LANDAUER, T., FURNAS, G., AND HARSHMAN, R. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Info. Sci. 41, 391-407.Google Scholar
- DIGITAL EQUIPMENT CORPORATION. AltaVista search engine, http://altavista.digital.com/.Google Scholar
- DONATH,W.E.,AND HOFFMAN, A. J. 1973. Lower bounds for the partitioning of graphs. IBM J. Res. Develop. 17.Google Scholar
- DOREIAN, P. 1988. Measuring the relative standing of disciplinary journals, Inf. Proc. Manage. 24, 45-56. Google Scholar
- DOREIAN, P. 1994. A measure of standing for citation networks within a wider environment. Inf. Proc. Manage. 30, 21-31. Google Scholar
- EGGHE, L. 1988. Mathematical relations between impact factors and average number of citations. Inf. Proc. Manage. 24, 567-576. Google Scholar
- EGGHE, L., AND ROUSSEAU, R. 1990. Introduction to Informetrics, Elsevier, North-Holland, Am-sterdam, The Netherlands.Google Scholar
- FIELDER, M. 1973. Algebraic connectivity of graphs. Czech. Math. J. 23, 298-305.Google Scholar
- FRIEZE, A., KANNAN, R., AND VEMPALA, S. 1998. Fast Monte-Carlo Algorithms for Finding Low-Rank Approximations. In Proceedings of the 39th IEEE Symposium on Foundations of Computer Science (Palo Alto, Calif., Nov. 8-11). IEEE Computer Society Press, Los Alamitos, Calif. Google Scholar
- FRISSE, M. E. 1988. Searching for information in a hypertext medical handbook. Commun. ACM 31, 7 (July), 880-886. Google Scholar
- GARFIELD, E. 1972. Citation analysis as a tool in journal evaluation. Science 178, 471-479.Google Scholar
- GELLER, N. 1978. On the citation influence methodology of Pinski and Narin. Inf. Proc. Manage. 14, 93-95.Google Scholar
- GIBSON, D., KLEINBERG, J., AND RAGHAVAN, P. 1998. Inferring web communities from link topology. In Proceedings of the 9th ACM Conference on Hypertext and Hypermedia (Pittsburgh, Pa., June 20-24). ACM, New York, pp. 225-234. Google Scholar
- GIBSON, D., KLEINBERG, J., AND RAGHAVAN, P. 1998. Clustering categorical data: An approach based on dynamical systems. In Proceedings of the 24th International Conference on Very Large Databases (New York, N.Y., Aug. 24-27). pp. 311-322. Google Scholar
- GOLUB, G., AND VAN LOAN, C. F. 1989. Matrix Computations. Johns Hopkins University Press, Baltimore, Md.Google Scholar
- HOTELLING, H. 1933. Analysis of a complex statistical variable into principal components. J. Educ. Psychol. 24, 417-441.Google Scholar
- HUBBELL, C. H. 1965. An input-output approach to clique identification. Sociometry 28, 377-399.Google Scholar
- HUBERMAN, B., PIROLLI, P., PITKOW, J., AND LUKOSE, R. 1998. Strong regularities in world wide web surfing. Science, 280.Google Scholar
- JOLLIFFE, I. T. 1986. Principal Component Analysis. Springer-Verlag, New York.Google Scholar
- KATZ, L. 1953. A new status index derived from sociometric analysis. Psychometrika 18, 39-43.Google Scholar
- KESSLER, M. M. 1963. Bibliographic coupling between scientific papers. Amer. Document. 14, 10-25.Google Scholar
- LARSON, R. 1996. Bibliometrics of the world wide web: An exploratory analysis of the intellectual structure of cyberspace. In Proceedings of the Annual Meeting of the American Society of Information Science (Baltimore, Md., Oct. 19-24).Google Scholar
- LEVINE, J. H. 1979. Joint-space analysis of 'pick-any' data: Analysis of choices from an uncon-strained set of alternatives. Psychometrika, 44, 85-92.Google Scholar
- MARCHIORI, M. 1997. The quest for correct information on the web: Hyper search engines. In Proceedings of the 6th International World Wide Web Conference (Santa Clara, Calif., Apr. 7-11). Google Scholar
- MCBRYAN, O. 1994. GENVL and WWWW: Tools for taming the web. In Proceedings of the 1st International World Wide Web Conference (Geneva, Switzerland, May).Google Scholar
- MCCAIN, K. 1986. Co-cited author mapping as a valid representation of intellectual structure. J. Amer. Soc. Info. Sci. 37, 111-122.Google Scholar
- NOMA, E. 1982. An improved method for analyzing square scientometric transaction matrices. Scientometrics 4, 297-316.Google Scholar
- NOMA, E. 1984. Co-citation analysis and the invisible college. J. Amer. Soc. Info. Sci. 35, 29-33.Google Scholar
- PAPADIMITRIOU,C.H.,RAGHAVAN, P., TAMAKI, H., AND VEMPALA, S. 1998. Latent semantic indexing: A probabilistic analysis. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Seattle, Wash., June 1-3). ACM, New York, pp. 159-168. Google Scholar
- PINSKI, G., AND NARIN, F. 1976. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Inf. Proc. Manage. 12, 297-312.Google Scholar
- PIROLLI, P., PITKOW, J., AND RAO, R. 1996. Silk from a sow's ear: Extracting usable structures from the web. In Proceedings of ACM SIGCHI Conference on Human Factors in Computing Systems (CHI '96) (Vancouver, B.C., Canada, Apr. 13-18). ACM, New York, pp. 118-125. Google Scholar
- PITKOW, J., AND PIROLLI, P. 1997. Life, death, and lawfulness on the electronic frontier. In Proceedings of ACM SIGCHI Conference on Human Factors in Computing Systems (CHI '97) (Atlanta, Ga., Mar. 22-27). ACM, New York, pp. 383-390. Google Scholar
- SALTON, G. 1989. Automatic Text Processing. Addison-Wesley, Reading, Mass. Google Scholar
- SHAW, W. M. 1991. Subject and citation indexing. Part I: The clustering structure of composite representations in the cystic fibrosis document collection. J. Amer. Soc. Info. Sci. 42, 669-675.Google Scholar
- SHAW, W. M. 1991. Subject and citation indexing. Part II: The optimal, cluster-based retrieval performance of composite representations. J. Amer. Soc. Info. Sci. 42, 676-684.Google Scholar
- SMALL, H. 1973. Co-citation in the scientific literature: A new measure of the relationship between two documents. J. Amer. Soc. Info. Sci. 24, 265-269.Google Scholar
- SMALL, H. 1986. The synthesis of specialty narratives from co-citation clusters. J. Amer. Soc. Info. Sci. 37, 97-110. Google Scholar
- SMALL, H., AND GRIFFITH, B. C. 1974. The structure of the scientific literatures I. Identifying and graphing specialties. Science Studies 4, 17-40.Google Scholar
- SPERTUS, E. 1997. ParaSite: Mining structural information on the web. In Proceedings of the 6th International World Wide Web Conference (Santa Clara, Calif., Apr. 7-11). Google Scholar
- VAN RIJSBERGEN, C. J. 1979. Information Retrieval. Butterworths, London, England. Google Scholar
- WEISS, R., VELEZ, B., SHELDON,M.A.,NEMPREMPRE, C., SZILAGYI, P., DUDA, A., AND GIFFORD, D. K. 1996. HyPursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In Proceedings of the 7th ACM Conference on Hypertext (Washington, D.C., Mar. 16-20). ACM, New York, pp. 180-193. Google Scholar
- WIRED DIGITAL,INC. Hotbot, http://www.hotbot.com.Google Scholar
- YAHOO!CORPORATION Yahoo!, http://www.yahoo.com.Google Scholar
Index Terms
- Authoritative sources in a hyperlinked environment
Recommendations
Structural analysis for web documentation using the non-well-founded set
HYPERTEXT '04: Proceedings of the fifteenth ACM conference on Hypertext and hypermediaWe propose a method for the structural analysis of Web documentation. Employing the non-well-founded set theory, we have developed a means of reduction analysis to detect irregularities in the structures of target documents. To test this method's ...
Analysis and improvement of HITS algorithm for detecting Web communities
This paper discusses Kleinberg's HITS algorithm (hyperlink-induced topic search) that extracts the Web community by Web inherent hyperlink analysis. The problems of the algorithm are analyzed and an improvement is proposed. For this purpose, a tool (...
Comments